How to create your own custom pipeline

Welcome to 2.0 release!

BojAI allows you to define your own end-to-end machine learning pipelines tailored to your unique data and modeling needs. Creating a custom pipeline gives you full control over how data is processed, models are trained, and results are generated—all while benefiting from BojAI’s modular structure and visualization features.

To define a custom pipeline, you will need to implement five Python modules, each responsible for a different aspect of the ML workflow. These files are:

File Name Description
custom_data_processor.py Defines how your raw data is preprocessed, tokenized, and transformed before training. This includes tasks like cleaning, splitting, or encoding data.
custom_pipeline_user.py Acts as the orchestrator of your pipeline. It loads configurations and connects your data processor, model, and trainer together in a runnable pipeline.
custom_trainer.py Contains the logic for training your model, tracking metrics, and saving checkpoints. You define how training epochs run and how evaluations are performed.
global_vars.py Stores all constants and paths (e.g., model names, dataset paths, tokenizer type, etc.) in one place. This keeps your pipeline easy to update and configure.
model.py Defines the architecture of the model you are training. This can be a HuggingFace model, a PyTorch nn.Module, or any other structure supported by BojAI.

Follow the steps in this section to create your own custom pipeline and use it.