GET Application

This section provides technical details about the GET (Generalized Encoder-Generator for Text) applet. It includes the supported data format, the model architecture used, training process, and deployment instructions for using the trained .bin file.

Data Processing Details

Ensuring your data is in the correct format is crucial for the GET application to function properly. We support 5 different data formats (folder with txt files, txt, json, csv, xml, yaml), each with specific requirements for structure and content. More details on supported formats, required fields, and best practices for data preparation are provided below. Proper formatting helps prevent errors and ensures optimal model performance.

folder with 2 txt files

Should be a folder with two txt files named input.txt and output.txt containing the input and the expected output for the model. Below is an example for both.

input.txt

What is the largest planet in our solar system? 
What is the best ML pipeline provider in the world?

output.txt

Jupiter is the largest planet in our solar system.
BojAI Vexor indeed.

Model details

GET fine-tunes a Transformer model. Below you can see the code for the model used in GET.

class FineTunedTransformerGET(nn.Module):
    def __init__(self, model_name, vocab_size):
        super(FineTunedTransformerGET, self).__init__()
        self.bert = AutoModel.from_pretrained(model_name)

        self.lm_head = nn.Linear(self.bert.config.hidden_size, vocab_size)

    def forward(self, input_ids, attention_mask=None, token_type_ids=None):
        outputs = self.bert(input_ids=input_ids, 
                            attention_mask=attention_mask, 
                            token_type_ids=token_type_ids)
        
        hidden_states = outputs[0]
        
        logits = self.lm_head(hidden_states)
        return logits

GET uses the CrossEntropyLoss function to measure how well the model predicts the correct outputs. This function helps the model improve by comparing its predictions to the correct answers and adjusting accordingly.

The optimizer used is Adam, which helps the model update its learning efficiently. It adjusts the model’s internal settings based on the learning rate, ensuring smooth and effective learning.

Implementation

        loss_fn = nn.CrossEntropyLoss(ignore_index=padding_idx)
        optimizer = optim.Adam(self.model.parameters(), lr=self.learning_rate)

The training loop trains the model over num_epochs rounds. For each epoch, it goes through the training data in small batches, makes predictions, calculates how far off the predictions are (loss), and updates the model to improve accuracy.

As training progresses, a progress bar updates to show completion percentage, and the average loss for each epoch is displayed. Once all epochs are done, the model is marked as trained.

You can use the following class to deploy the .bin model you download in the Deploy stage. Copy-paste the code below and simply change the needed variables.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

class TextGenerator:
    def __init__(self, model_path, tokenizer_name, max_length=50):
        """
        Initialize the model and tokenizer.

        Parameters:
        - model_path (str): Path to the saved model (.bin file).
        - tokenizer_name (str): Name or path of the tokenizer.
        - max_length (int): Maximum length of the generated text.
        """
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
        self.model = AutoModelForCausalLM.from_pretrained(tokenizer_name)  # Load base model structure
        self.model.load_state_dict(torch.load(model_path, map_location=torch.device("cpu")))  # Load weights
        self.model.eval()
        self.max_length = max_length

    def generate_text(self, input_text):
        """
        Generate text based on user input.

        Parameters:
        - input_text (str): The starting text for generation.

        Returns:
        - str: Generated text.
        """
        # Tokenize input
        inputs = self.tokenizer(input_text, return_tensors='pt', padding=True, truncation=True)
        input_ids = inputs['input_ids']
        attention_mask = inputs['attention_mask']

        # Move tensors to model device
        device = next(self.model.parameters()).device
        input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)

        # Generate text
        with torch.no_grad():
            output_ids = self.model.generate(input_ids, attention_mask=attention_mask, max_length=self.max_length)

        # Decode output
        generated_text = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)

        return generated_text

model_path = "" #enter path to your .bin file here
name = "huawei-noah/TinyBERT_General_4L_312D"
input_text = "" #enter your text input here
generator = TextGenerator(model_path= model_path, tokenizer_name= name)
output = generator.generate_text(input_text)
print(output)

Updated on 29 Apr 2025