GET Application
This section provides technical details about the GET (Generalized Encoder-Generator for Text) applet. It includes the supported data format, the model architecture used, training process, and deployment instructions for using the trained .bin
file.
Data Processing Details
Ensuring your data is in the correct format is crucial for the GET application to function properly. We support 5 different data formats (folder with txt files, txt, json, csv, xml, yaml), each with specific requirements for structure and content. More details on supported formats, required fields, and best practices for data preparation are provided below. Proper formatting helps prevent errors and ensures optimal model performance.
folder with 2 txt files
Should be a folder with two txt files named input.txt and output.txt containing the input and the expected output for the model. Below is an example for both.
input.txt
What is the largest planet in our solar system?
What is the best ML pipeline provider in the world?
output.txt
Jupiter is the largest planet in our solar system.
BojAI Vexor indeed.
txt file
Your data can be a txt file containing both the input and the output comma separated.
What is the largest planet in our solar system?, Jupiter is the largest planet in our solar system.
What is the best ML pipeline provider in the world?, BojAI Vexor indeed.
JSON
your data can be a json file.
[
{
"input": "What is the capital of France",
"output": "The capital of France is Paris."
},
{
"input": "What is the best ML pipeline provider in the world? ",
"output": "BojAI Vexor indeed. "
},
]
CSV file
You can use a CSV file to store input-output pairs. Ensure that the first row contains the header: input,output.
input,output
"What is the capital of France","The capital of France is Paris."
"Who is the CEO of Tesla","Elon Musk is the CEO of Tesla."
"How do you make a cup of coffee","You make a cup of coffee by brewing ground coffee with hot water."
"What is the largest planet in our solar system","Jupiter is the largest planet in our solar system."
"Where is the Eiffel Tower located","The Eiffel Tower is located in Paris."
xml file
your data can be in an xml file. Below is an example of the formatting in the xml file.
<qaPairs>
<entry>
<input>What is the capital of France</input>
<output>The capital of France is Paris.</output>
</entry>
<entry>
<input>How do you make a cup of coffee</input>
<output>You make a cup of coffee by brewing ground coffee with hot water.</output>
</entry>
</qaPairs>
yaml file
your data can also be a yaml file. Below is an exmaple of the formatting in the yaml file.
- input: "What is the capital of France"
output: "The capital of France is Paris."
- input: "How do you make a cup of coffee"
output: "You make a cup of coffee by brewing ground coffee with hot water."
GET fine-tunes a Transformer model. Below you can see the code for the model used in GET.
class FineTunedTransformerGET(nn.Module):
def __init__(self, model_name, vocab_size):
super(FineTunedTransformerGET, self).__init__()
self.bert = AutoModel.from_pretrained(model_name)
self.lm_head = nn.Linear(self.bert.config.hidden_size, vocab_size)
def forward(self, input_ids, attention_mask=None, token_type_ids=None):
outputs = self.bert(input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids)
hidden_states = outputs[0]
logits = self.lm_head(hidden_states)
return logits
GET uses the CrossEntropyLoss function to measure how well the model predicts the correct outputs. This function helps the model improve by comparing its predictions to the correct answers and adjusting accordingly.
The optimizer used is Adam, which helps the model update its learning efficiently. It adjusts the model’s internal settings based on the learning rate, ensuring smooth and effective learning.
Implementation
loss_fn = nn.CrossEntropyLoss(ignore_index=padding_idx)
optimizer = optim.Adam(self.model.parameters(), lr=self.learning_rate)
The training loop trains the model over num_epochs rounds. For each epoch, it goes through the training data in small batches, makes predictions, calculates how far off the predictions are (loss), and updates the model to improve accuracy.
As training progresses, a progress bar updates to show completion percentage, and the average loss for each epoch is displayed. Once all epochs are done, the model is marked as trained.
You can use the following class to deploy the .bin model you download in the Deploy stage. Copy-paste the code below and simply change the needed variables.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
class TextGenerator:
def __init__(self, model_path, tokenizer_name, max_length=50):
"""
Initialize the model and tokenizer.
Parameters:
- model_path (str): Path to the saved model (.bin file).
- tokenizer_name (str): Name or path of the tokenizer.
- max_length (int): Maximum length of the generated text.
"""
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
self.model = AutoModelForCausalLM.from_pretrained(tokenizer_name) # Load base model structure
self.model.load_state_dict(torch.load(model_path, map_location=torch.device("cpu"))) # Load weights
self.model.eval()
self.max_length = max_length
def generate_text(self, input_text):
"""
Generate text based on user input.
Parameters:
- input_text (str): The starting text for generation.
Returns:
- str: Generated text.
"""
# Tokenize input
inputs = self.tokenizer(input_text, return_tensors='pt', padding=True, truncation=True)
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
# Move tensors to model device
device = next(self.model.parameters()).device
input_ids = input_ids.to(device)
attention_mask = attention_mask.to(device)
# Generate text
with torch.no_grad():
output_ids = self.model.generate(input_ids, attention_mask=attention_mask, max_length=self.max_length)
# Decode output
generated_text = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)
return generated_text
model_path = "" #enter path to your .bin file here
name = "huawei-noah/TinyBERT_General_4L_312D"
input_text = "" #enter your text input here
generator = TextGenerator(model_path= model_path, tokenizer_name= name)
output = generator.generate_text(input_text)
print(output)