How to code the data processing part
This file defines how raw data is loaded, processed, and passed to the training pipeline. In BojAI, data processing is modular and fully customizable—this is where your dataset is transformed into something your model can learn from.
At the heart of this file is one class: YourDataProcessor
.
It inherits from Processor
, which is a BojAI abstract base class that handles much of the backend integration for you.
You only need to fill in the required methods for:
- loading the data
- splitting it into training and evaluation sets
- tokenizing and formatting each sample
Overall Class Behavior
class YourDataProcessor(Processor):
This class will be instantiated three times automatically by the pipeline:
- To load and split the full dataset (when
is_main=True
) - To provide the train dataset (
division='train'
) - To provide the eval dataset (
division='eval'
)
You don’t need to worry about managing these three modes—BojAI does that for you using flags like is_main
and division
.
Method-by-Method Breakdown
1. get_inputs_outputs(self, data_dir)
def get_inputs_outputs(self, data_dir):
return [], []
This is the method where you load your raw data.
- Input: The directory path to your dataset.
- Output: A tuple of two lists:
inputs
: e.g., list of source sentences or questions.outputs
: e.g., target sentences or answers.
Example for a translation dataset:
return ["Translate this", "Another example"], ["Traduce esto", "Otro ejemplo"]
2. get_train_eval(self)
def get_train_eval(self):
return [], [], [], []
This method splits the full dataset into:
inputs_train
inputs_eval
outputs_train
outputs_eval
You can use slicing or any stratified logic. A simple 80/20 split:
train_size = int(0.8 * len(self.inputs))
return (
self.inputs[:train_size],
self.inputs[train_size:],
self.outputs[:train_size],
self.outputs[train_size:],
)
3. __len__(self)
def __len__(self):
return len(self.inputs)
This should return the number of examples in the dataset. If you’ve correctly initialized self.inputs
, this default implementation is fine.
4. __getitem__(self, idx)
def __getitem__(self, idx):
return None
This is the core of your dataset. It defines how each example is returned to the model during training.
- You should use your
tokenizer
here. You can define a custom tokenizer below and use it in your CustomDataProcessor. - The format of this is completely up to you. Keep in mind that the format here will be used when you code your model.
An example format could be a dictionary with fields like:
return {
"input_ids": torch.tensor(...),
"labels": torch.tensor(...)
}
🛠 Example for classification:
encoded = self.tokenizer(self.inputs[idx], padding="max_length", truncation=True, return_tensors="pt")
return {
"input_ids": encoded["input_ids"].squeeze(0),
"attention_mask": encoded["attention_mask"].squeeze(0),
"labels": torch.tensor(self.outputs[idx])
}
5. get_item_untokenized(self, idx)
def get_item_untokenized(self, idx):
return None
This returns the raw (untokenized) input/output pair, mostly used for:
- debugging
- visual inspection in the GUI or logs
- string conversion during evaluation
🛠 Example:
return self.inputs[idx], self.outputs[idx]
If you’ve correctly initialized self.inputs
and self.outputs
, this default implementation is fine.
Final Notes
Once you finish implementing this file:
- BojAI will be able to load, split, and feed your dataset into your model.
- This powers the first stage of the pipeline:
prepare
. - You can run this stage by building it, then starting the pipeline:
In CLI:
bojai start --pipeline name-from-build --directory where/the/editing/files/are --stage prepare
In UI:
bojai start --pipeline name-from-build --directory where/the/editing/files/are --stage prepare --ui
✅ Summary Table
Method | Purpose |
---|---|
get_inputs_outputs |
Load raw data and return inputs/outputs |
get_train_eval |
Split data into train/eval partitions |
__len__ |
Return dataset size |
__getitem__ |
Tokenize and return a single item (for training) |
get_item_untokenized |
Return the raw version (for inspection/logging) |