How to use the data processing agent

Welcome to 2.0 release! If your dataset is structured differently from what BojAI expects, you can use the Data Processing Agent to automatically adapt the pipeline based on a description you provide.

⚠️ This is a beta feature, and it currently works via Ollama, which must be installed locally.


Prerequisites

To use the agent, you must have Ollama installed on your system.

Download Ollama

Then download the LLM locally by running:

ollama pull mistral

When to Use the Agent

During the CLI initialization flow, after you enter the initialization data, BojAI will ask:

Would you like to use the agent? [Y/n]

If you select Y, you’ll go through a few guided steps to describe your dataset so the agent can adapt the pipeline for you.


Steps When Using the Agent

  1. Confirm Usage

    You’ll be asked:

    Would you like to use the agent? [Y/n]
    

    Enter Y to continue or N to skip.

  2. Confirm Image Use

    You’ll be asked:

    Does your data contain images? [Y/n]
    

    This helps the agent know whether to treat inputs as image paths.

  3. Describe Your Dataset

    You’ll be prompted to enter a clear description of how your data is organized.
    Example:

    My data folder contains a .txt file where each line has input and output separated by a comma.
    

    Or for images:

    Each subfolder contains one image and a label.txt file with the class.
    

    Entering 0 at this step will cancel agent use.


What the Agent Does

  • Creates a temporary copy of your custom_data_processor.py file
  • Generates a new get_inputs_outputs method using a local LLM (via Ollama)
  • Injects the function into the file and tests it
  • On success: overwrites the original file with the new one
  • On failure: retries up to 3 times with feedback before giving up

If successful, your YourDataProcessor class will now have a working get_inputs_outputs() method tailored to your data.


Guidelines for Descriptions

The better your description, the better the result. Be sure to include:

  • What files are present (e.g. .csv, .txt, image folders)
  • How inputs and outputs are separated or paired
  • Any formats (e.g. line-based, JSON objects, folder-based)

What Happens Under the Hood

  • The agent constructs a prompt with your description
  • It asks Ollama to generate a valid Python method
  • It uses regex to extract the function and replaces the existing one
  • Then, it tests it using a call to get_item_untokenized(0)

If the function runs correctly, the changes are saved. If not, the agent tries to fix it based on the error, up to three times.