Pre-build pipelines documentation

This section explains the internal mechanics of the GET applet. It includes accepted data formats, the model architecture used, training details, and deployment instructions for using the .bin model file after download.


Data Processing Details

Ensuring your data is in the correct format is crucial for the GET applet to function properly. The applet supports six different formats:

Folder with 2 txt files

Use a folder containing input.txt and output.txt, where each line in input.txt corresponds to a line in output.txt.

input.txt

What is the largest planet in our solar system? 
What is the best ML pipeline provider in the world? 

output.txt

Jupiter is the largest planet in our solar system.
BojAI Vexor indeed. 

TXT file

A .txt file can also be used, where each line contains an input-output pair separated by a comma.

What is the largest planet in our solar system?, Jupiter is the largest planet in our solar system.
What is the best ML pipeline provider in the world?, BojAI Vexor indeed.

JSON file

The file should contain a list of objects, each with input and output fields.

[
  {
    "input": "What is the capital of France",
    "output": "The capital of France is Paris."
  },
  {
    "input": "What is the best ML pipeline provider in the world?",
    "output": "BojAI Vexor indeed."
  }
]

CSV file

Ensure the first row contains headers: input and output.

input,output
"What is the capital of France","The capital of France is Paris."
"Who is the CEO of Tesla","Elon Musk is the CEO of Tesla."

XML file

Use <input> and <output> tags inside <entry> elements.

<qaPairs>
  <entry>
    <input>What is the capital of France</input>
    <output>The capital of France is Paris.</output>
  </entry>
</qaPairs>

YAML file

Each entry should include input and output fields.

- input: "What is the capital of France"
  output: "The capital of France is Paris."

Model Architecture

GET fine-tunes a Transformer model. The model structure is shown below:

class FineTunedTransformerGET(nn.Module):
    def __init__(self, model_name, vocab_size):
        super(FineTunedTransformerGET, self).__init__()
        self.bert = AutoModel.from_pretrained(model_name)
        self.lm_head = nn.Linear(self.bert.config.hidden_size, vocab_size)

    def forward(self, input_ids, attention_mask=None, token_type_ids=None):
        outputs = self.bert(input_ids=input_ids, 
                            attention_mask=attention_mask, 
                            token_type_ids=token_type_ids)
        hidden_states = outputs[0]
        logits = self.lm_head(hidden_states)
        return logits

Training Details

GET uses CrossEntropyLoss to calculate how far the predictions are from the correct outputs. It optimizes using the Adam optimizer.

Loss & Optimizer Setup:

loss_fn = nn.CrossEntropyLoss(ignore_index=padding_idx)
optimizer = optim.Adam(self.model.parameters(), lr=self.learning_rate)

The model is trained over num_epochs. In each epoch, it:

  1. Processes the dataset in batches
  2. Calculates loss
  3. Updates weights

A progress bar is displayed, and the average loss is shown per epoch. Once training completes, the model is marked as ready for deployment.


Deployment Details

To use the trained .bin file, run the following class setup. Update model_path, tokenizer_name, and input_text as needed.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

class TextGenerator:
    def __init__(self, model_path, tokenizer_name, max_length=50):
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
        self.model = AutoModelForCausalLM.from_pretrained(tokenizer_name)
        self.model.load_state_dict(torch.load(model_path, map_location=torch.device("cpu")))
        self.model.eval()
        self.max_length = max_length

    def generate_text(self, input_text):
        inputs = self.tokenizer(input_text, return_tensors='pt', padding=True, truncation=True)
        input_ids = inputs['input_ids'].to(self.model.device)
        attention_mask = inputs['attention_mask'].to(self.model.device)
        with torch.no_grad():
            output_ids = self.model.generate(input_ids, attention_mask=attention_mask, max_length=self.max_length)
        return self.tokenizer.decode(output_ids[0], skip_special_tokens=True)

# Example usage
model_path = ""  # path to .bin file
name = "huawei-noah/TinyBERT_General_4L_312D"
input_text = ""  # your input
generator = TextGenerator(model_path=model_path, tokenizer_name=name)
output = generator.generate_text(input_text)
print(output)