Hands-on Tutorial: Fine-Tuning a Pre-trained Model

Fine-tuning a pre-trained model is a powerful technique that allows you to adapt a model to specific tasks with minimal additional training. Whether you’re working on sentiment analysis, question-answering, or text generation, the process of fine-tuning can significantly enhance performance tailored to your dataset.

In this tutorial, you will learn how to:

Load and prepare a dataset for fine-tuning.
Select a suitable pre-trained model.
Set up training arguments and fine-tune the model.
Evaluate the model’s performance and save the fine-tuned model for future use.

In this tutorial, we will be fine-tuning a pre-trained language model to answer questions about Taylor Swift. We’ll use the Hugging Face Transformers library to achieve this goal.

Prerequisites

Python 3.6 or higher
Install necessary libraries:

pip install transformers datasets torch
pip install transformers[torch]

Step-by-Step Tutorial

Step 1: Import Libraries

Start by importing the necessary libraries.

from transformers import AutoTokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments
from datasets import load_dataset

Step 2: Load a Pre-trained Model and Tokenizer

Choose a pre-trained model from Hugging Face’s Model Hub. For this tutorial, we’ll use flan-t5-small. Using “flan-t5-small” for text generation inference is a great choice, especially if you’re looking for a model that balances performance with efficiency.

# Define the model name
model_name = "google/flan-t5-small"

# Load the tokenizer and model
pretrained_tokenizer = AutoTokenizer.from_pretrained(model_name)
pretrained_model = T5ForConditionalGeneration.from_pretrained(model_name)

# Assign the eos_token as the pad_token
pretrained_tokenizer.pad_token = pretrained_tokenizer.eos_token

Step 3: Prepare the Dataset

Given the goal of this tutorial, lets use the “taylor_swift” dataset from the Hugging Face Hub structured with questions and answers about Taylor Swift. Feel free to explore other datasets available in the Hugging Face Hub.

Load the dataset:

# Load the Taylor Swift dataset
dataset = load_dataset("lamini/taylor_swift")

# Split into training and test sets
train_test_split = dataset['train'].train_test_split(test_size=0.2)
train_dataset = train_test_split['train']
test_dataset = train_test_split['test']

Step 4: Test the Pre-trained Model

Test the Pre-trained model before fine-tuning. As you will notice, the model will not provide accurate or coherent responses regarding the questions about Taylor Swift.

# Evaluate the model's performance
def generate_response(input_text, model, tokenizer, max_length=512):
    # Tokenize the input text
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, padding=False, max_length=512)

    # Generate the response with adjusted parameters
    output_sequences = model.generate(
        input_ids=inputs['input_ids'],
        max_length=max_length,
        repetition_penalty=1.2,  # Apply repetition penalty
        no_repeat_ngram_size=2,  # Prevent repetition of 2-grams
        do_sample=True,          # Enable sampling for non-deterministic output
        temperature=0.7,         # Control randomness (lower is less random)
        top_k=50,                # Consider top 50 tokens for sampling
        top_p=0.95               # Consider tokens with cumulative probability 0.95
    )

    # Decode and return the response
    return tokenizer.decode(output_sequences[0], skip_special_tokens=True)

# Test the pre-trained model
print("################################################\n")
print("Testing pre-trained model:\n")
print("################################################\n")
for x in range(0, 3):
    sample_question = train_dataset[x]['question']
    sample_answer = train_dataset[x]['answer']
    print(f"\n\nQuestion: {sample_question}\n")
    print(f"Generated Answer: {generate_response(sample_question, pretrained_model, pretrained_tokenizer )}\n")
    print(f"Expected Answer: {sample_answer}\n\n")

As expected, the responses received are not that good:

Question: How many Grammy Awards has Taylor Swift won?

Generated Answer: one

Expected Answer: Taylor Swift had won a total of 12 Grammy Awards.

Step 5: Tokenize the Dataset

Prepare the dataset for training by tokenizing the input data.

# Tokenize data
def preprocess_function(examples):

    # Tokenize the input (question)
    model_inputs = pretrained_tokenizer(examples["question"], max_length=512, truncation=True, padding='max_length')

    # Tokenize the target (answer) using text_target
    labels = pretrained_tokenizer(examples["answer"], max_length=512, truncation=True, padding='max_length', return_tensors="pt").input_ids
    
    # Important: Replace padding token id with -100 to ignore in loss calculation
    labels = [[(label if label != pretrained_tokenizer.pad_token_id else -100) for label in label_example] for label_example in labels]
    
    model_inputs["labels"] = labels
    return model_inputs

tokenized_train_dataset = train_dataset.map(preprocess_function, batched=True)
tokenized_test_dataset = test_dataset.map(preprocess_function, batched=True)

Step 5: Set Up Training Arguments

Define the training parameters. Since you’ll be running this locally, the following values are being used. However, You can adjust these based on your needs and resources.

# Define the training parameters
training_args = TrainingArguments(
    output_dir='./results',           # Directory where the model checkpoints and logs will be saved.
    eval_strategy="epoch",      # Determines when to evaluate the model. "epoch" means evaluation occurs at the end of each epoch.
    learning_rate=5e-5,               # The learning rate used for gradient descent. This controls how much to change the model in response to the estimated error each time the model weights are updated.
    per_device_train_batch_size=4,    # The batch size per device for training. A larger batch size may lead to better performance but requires more memory.
    per_device_eval_batch_size=4,     # The batch size per device for evaluation. Keeping this consistent with the training batch size is generally a good practice.
    num_train_epochs=3,               # The number of complete passes through the training dataset. More epochs can lead to better performance but may also cause overfitting.
    weight_decay=0.01,                # A regularization technique to prevent overfitting by penalizing large weights.
    logging_dir='./logs'             # Directory where the logs will be stored for monitoring the training process.
)

Step 6: Initialize the Trainer

Set up the Trainer class with the model, training arguments, and datasets.

# Set up the Trainer
trainer = Trainer(
    model=pretrained_model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset
)

Step 7: Fine-Tune the Model

Begin the fine-tuning process.

# Fine-tune the model with the training data.
trainer.train()

Step 8: Evaluate the Model

Assess the model’s performance on the test dataset. The fine tuning results will be discussed on last section of this tutorial.

# Assess the model's performance
results = trainer.evaluate()
print(f"Test Results: {results}")

Step 9: Save the Fine-Tuned Model

Save the model and tokenizer for future use.

# Save the fine-tuned model and tokenizer
pretrained_model.save_pretrained('./fine-tuned-taylor-swift')
pretrained_tokenizer.save_pretrained('./fine-tuned-taylor-swift')

Step 10: Test the Fine-Tuned Model

Test the Fine-tuned model. Probably, you are still not getting “perfect” responses regarding the questions made about Taylor Swift. However, you should notice some improvement regarding the initial model.

# Load the fine-tuned model and tokenizer
ftTokenizer = AutoTokenizer.from_pretrained('./fine-tuned-taylor-swift')
ftModel = T5ForConditionalGeneration.from_pretrained('./fine-tuned-taylor-swift')

# Evaluate the model's performance after fine-tuning to see improvements.
print("################################################\n")
print("Testing fine-tuned model:\n")
print("################################################\n")
for x in range(0, 3):
    
    sample_question = train_dataset[x]['question']
    sample_answer = train_dataset[x]['answer']

    print(f"\n\nQuestion: {sample_question}\n")
    print(f"Generated Answer: {generate_response(sample_question, ftModel, ftTokenizer )}\n")
    print(f"Expected Answer: {sample_answer}\n\n")

The responses received have demonstrated some improvement:

Question: How many Grammy Awards has Taylor Swift won?

Generated Answer: Taylor Swift has won 12 Grammy Awards. Her career has been ranked number one on the Billboard Hot 100, with 5 nominations. She has also won 36 Grammy Award nomination albums. The album's lead single, "Back to December," was released in 2009 and was certified platinum by the American Music Association. Overall, Taylor swift has gained significant fame through her career and performances.

Expected Answer: Taylor Swift had won a total of 12 Grammy Awards.

Discussing the Fine-Tuning results

As you confirmed, after fine-tune there was some improvement in the performance of the model, however, it is clear that it still requires adittional fine-tuning. Lets look at the fine-tuning results obtained:

 {'eval_loss': 2.5875954627990723, 'eval_runtime': 91.9853, 'eval_samples_per_second': 1.707, 'eval_steps_per_second': 0.435, 'epoch': 1.0}
{'eval_loss': 2.547515392303467, 'eval_runtime': 91.1446, 'eval_samples_per_second': 1.723, 'eval_steps_per_second': 0.439, 'epoch': 2.0}
{'eval_loss': 2.53344988822937, 'eval_runtime': 81.2187, 'eval_samples_per_second': 1.933, 'eval_steps_per_second': 0.492, 'epoch': 3.0}
{'train_runtime': 4131.6995, 'train_samples_per_second': 0.455, 'train_steps_per_second': 0.114, 'train_loss': 2.7784913664410826, 'epoch': 3.0}

Evaluating fine-tuning results involves examining the evaluation loss and training loss values to assess how well the model is learning. Here’s an analysis based on the provided results:

Evaluation Loss: The evaluation loss (eval_loss) decreases slightly across epochs, which suggests that the model is learning and improving its performance on the validation set. A continuous decrease in evaluation loss generally indicates that the model is adapting to the task.
Training Loss: The training loss (train_loss) is slightly higher than the evaluation loss. It probably suggests that the model is underfitting slightly, as the training process might not have been exhaustive enough.

Based on this analysis, while the results show some improvements, more iterations and adjustments may be needed to optimize the model’s performance fully.

If you decide to fine-tune this model further, you should consider the following:

Perform further Training: Given the decrease in evaluation loss, consider running additional epochs to determine if the loss continues to decline significantly or plateaus.
Hyperparameter Tuning: Experiment with different learning rates, batch sizes, or optimizer settings to potentially improve both training and evaluation loss.
Validation Set: Ensure the validation set is representative of the task and verify that the loss calculations are accurately reflecting model performance.
Hardware Optimization: If the throughput is a concern, explore optimizing data loading pipelines, adjusting batch sizes, or utilizing more powerful hardware if available.
Model Complexity: Evaluate whether a more complex model might achieve better performance, depending on the task complexity and available data. For more accurate responses, it may be beneficial to use a more advanced model than flan-t5-small.

Find the full code described in this tutorial here.