How to code the data processing part

This file defines how raw data is loaded, processed, and passed to the training pipeline. In BojAI, data processing is modular and fully customizable—this is where your dataset is transformed into something your model can learn from.

At the heart of this file is one class: YourDataProcessor.
It inherits from Processor, which is a BojAI abstract base class that handles much of the backend integration for you.

You only need to fill in the required methods for:

  • loading the data
  • splitting it into training and evaluation sets
  • tokenizing and formatting each sample

Overall Class Behavior

class YourDataProcessor(Processor):

This class will be instantiated three times automatically by the pipeline:

  1. To load and split the full dataset (when is_main=True)
  2. To provide the train dataset (division='train')
  3. To provide the eval dataset (division='eval')

You don’t need to worry about managing these three modes—BojAI does that for you using flags like is_main and division.


Method-by-Method Breakdown

1. get_inputs_outputs(self, data_dir)

def get_inputs_outputs(self, data_dir):
    return [], []

This is the method where you load your raw data.

  • Input: The directory path to your dataset.
  • Output: A tuple of two lists:
    • inputs: e.g., list of source sentences or questions.
    • outputs: e.g., target sentences or answers.

Example for a translation dataset:

return ["Translate this", "Another example"], ["Traduce esto", "Otro ejemplo"]

2. get_train_eval(self)

def get_train_eval(self):
    return [], [], [], []

This method splits the full dataset into:

  • inputs_train
  • inputs_eval
  • outputs_train
  • outputs_eval

You can use slicing or any stratified logic. A simple 80/20 split:

train_size = int(0.8 * len(self.inputs))
return (
    self.inputs[:train_size],
    self.inputs[train_size:],
    self.outputs[:train_size],
    self.outputs[train_size:],
)

3. __len__(self)

def __len__(self):
    return len(self.inputs)

This should return the number of examples in the dataset. If you’ve correctly initialized self.inputs, this default implementation is fine.


4. __getitem__(self, idx)

def __getitem__(self, idx):
    return None

This is the core of your dataset. It defines how each example is returned to the model during training.

  • You should use your tokenizer here. You can define a custom tokenizer below and use it in your CustomDataProcessor.
  • The format of this is completely up to you. Keep in mind that the format here will be used when you code your model.

An example format could be a dictionary with fields like:

return {
    "input_ids": torch.tensor(...),
    "labels": torch.tensor(...)
}

🛠 Example for classification:

encoded = self.tokenizer(self.inputs[idx], padding="max_length", truncation=True, return_tensors="pt")
return {
    "input_ids": encoded["input_ids"].squeeze(0),
    "attention_mask": encoded["attention_mask"].squeeze(0),
    "labels": torch.tensor(self.outputs[idx])
}

5. get_item_untokenized(self, idx)

def get_item_untokenized(self, idx):
    return None

This returns the raw (untokenized) input/output pair, mostly used for:

  • debugging
  • visual inspection in the GUI or logs
  • string conversion during evaluation

🛠 Example:

return self.inputs[idx], self.outputs[idx]

If you’ve correctly initialized self.inputs and self.outputs, this default implementation is fine.


Final Notes

Once you finish implementing this file:

  • BojAI will be able to load, split, and feed your dataset into your model.
  • This powers the first stage of the pipeline: prepare.
  • You can run this stage by building it, then starting the pipeline: In CLI:
    bojai start --pipeline name-from-build --directory where/the/editing/files/are --stage prepare

In UI: bojai start --pipeline name-from-build --directory where/the/editing/files/are --stage prepare --ui


✅ Summary Table

Method Purpose
get_inputs_outputs Load raw data and return inputs/outputs
get_train_eval Split data into train/eval partitions
__len__ Return dataset size
__getitem__ Tokenize and return a single item (for training)
get_item_untokenized Return the raw version (for inspection/logging)