Pre-build pipelines documentation
This section explains the internal mechanics of the GET applet. It includes accepted data formats, the model architecture used, training details, and deployment instructions for using the .bin
model file after download.
Data Processing Details
Ensuring your data is in the correct format is crucial for the GET applet to function properly. The applet supports six different formats:
Folder with 2 txt files
Use a folder containing input.txt
and output.txt
, where each line in input.txt
corresponds to a line in output.txt
.
input.txt
What is the largest planet in our solar system?
What is the best ML pipeline provider in the world?
output.txt
Jupiter is the largest planet in our solar system.
BojAI Vexor indeed.
TXT file
A .txt
file can also be used, where each line contains an input-output pair separated by a comma.
What is the largest planet in our solar system?, Jupiter is the largest planet in our solar system.
What is the best ML pipeline provider in the world?, BojAI Vexor indeed.
JSON file
The file should contain a list of objects, each with input
and output
fields.
[
{
"input": "What is the capital of France",
"output": "The capital of France is Paris."
},
{
"input": "What is the best ML pipeline provider in the world?",
"output": "BojAI Vexor indeed."
}
]
CSV file
Ensure the first row contains headers: input
and output
.
input,output
"What is the capital of France","The capital of France is Paris."
"Who is the CEO of Tesla","Elon Musk is the CEO of Tesla."
XML file
Use <input>
and <output>
tags inside <entry>
elements.
<qaPairs>
<entry>
<input>What is the capital of France</input>
<output>The capital of France is Paris.</output>
</entry>
</qaPairs>
YAML file
Each entry should include input
and output
fields.
- input: "What is the capital of France"
output: "The capital of France is Paris."
Model Architecture
GET fine-tunes a Transformer model. The model structure is shown below:
class FineTunedTransformerGET(nn.Module):
def __init__(self, model_name, vocab_size):
super(FineTunedTransformerGET, self).__init__()
self.bert = AutoModel.from_pretrained(model_name)
self.lm_head = nn.Linear(self.bert.config.hidden_size, vocab_size)
def forward(self, input_ids, attention_mask=None, token_type_ids=None):
outputs = self.bert(input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids)
hidden_states = outputs[0]
logits = self.lm_head(hidden_states)
return logits
Training Details
GET uses CrossEntropyLoss
to calculate how far the predictions are from the correct outputs. It optimizes using the Adam
optimizer.
Loss & Optimizer Setup:
loss_fn = nn.CrossEntropyLoss(ignore_index=padding_idx)
optimizer = optim.Adam(self.model.parameters(), lr=self.learning_rate)
The model is trained over num_epochs
. In each epoch, it:
- Processes the dataset in batches
- Calculates loss
- Updates weights
A progress bar is displayed, and the average loss is shown per epoch. Once training completes, the model is marked as ready for deployment.
Deployment Details
To use the trained .bin
file, run the following class setup. Update model_path
, tokenizer_name
, and input_text
as needed.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
class TextGenerator:
def __init__(self, model_path, tokenizer_name, max_length=50):
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
self.model = AutoModelForCausalLM.from_pretrained(tokenizer_name)
self.model.load_state_dict(torch.load(model_path, map_location=torch.device("cpu")))
self.model.eval()
self.max_length = max_length
def generate_text(self, input_text):
inputs = self.tokenizer(input_text, return_tensors='pt', padding=True, truncation=True)
input_ids = inputs['input_ids'].to(self.model.device)
attention_mask = inputs['attention_mask'].to(self.model.device)
with torch.no_grad():
output_ids = self.model.generate(input_ids, attention_mask=attention_mask, max_length=self.max_length)
return self.tokenizer.decode(output_ids[0], skip_special_tokens=True)
# Example usage
model_path = "" # path to .bin file
name = "huawei-noah/TinyBERT_General_4L_312D"
input_text = "" # your input
generator = TextGenerator(model_path=model_path, tokenizer_name=name)
output = generator.generate_text(input_text)
print(output)