Maintenance

Site is under maintenance — quizzes are still available.

Go to quizzes
Sponsored Reserved space — layout preview until AdSense is connected

How-tos

Fine-Tune an Open-Source LLM on a Single GPU in 2025

Fine-tune Llama 3.2 or Mistral using QLoRA on a 24GB RTX 4090 with minimal code. Covers data prep, training script, and pro tips for production-ready behavior and tone changes.

June 2026 · 12 min read · 1 views · 0 hearts

Everyone says fine-tuning is the secret sauce for making open-source LLMs actually useful. And they're right — but the process is still surrounded by a weird mystique. People think you need a cluster of A100s and a PhD to do it. In 2025, you don't.

You can fine-tune a capable model like Llama 3.2 or Mistral on your own data, on a single consumer GPU (yes, even a 24GB RTX 4090), and have something production-ready in an afternoon. Here's exactly how.

Why Bother Fine-Tuning at All?

Prompt engineering and RAG (Retrieval-Augmented Generation) get you 80% of the way there. Fine-tuning gets you the rest. It's not about teaching the model new facts — that's what retrieval is for. It's about changing behavior and tone.

Your use cases:

  • Format control: You want JSON output, bullet-point summaries, or very specific report structures — not paragraphs of waffle.
  • Domain style: You need the model to sound like a technical writer, a customer support agent, or a legal analyst, not a generic chatbot.
  • Removing guardrails: You want a model that stops refusing valid instructions (e.g. "summarize this internal document").
  • Teaching new tasks: Classification, entity extraction, or rewriting in a house style.

Fine-tuning is the "unlock" for proprietary workflows.

The Method That Actually Works: QLoRA

Full fine-tuning (updating every parameter) is costly and often wasteful. LoRA (Low-Rank Adaptation) freezes the base model and inserts tiny trainable matrices into each layer. It's like sticking adjustable tuning knobs on a pre-built engine instead of rebuilding the engine.

QLoRA adds 4-bit quantization on top. This drops the memory footprint drastically.

  • A 7B parameter model in full 16-bit floats: ~14GB VRAM just to load.
  • QLoRA version (4-bit): ~4GB VRAM.

That means an RTX 3090 (24GB) can handle a 13B parameter model comfortably. 7B models fit on 12GB cards.

Tools you'll use: - transformers + peft from Hugging Face - bitsandbytes for quantization - trl (Transformer Reinforcement Learning) for the trainer - datasets for loading your data

Step 1: Prepare Your Data (The Hard Part)

Fine-tuning fails most often because of bad data. Not bad code.

You need a dataset of instruction-completion pairs. Each example should be:

{
  "instruction": "Summarize the quarterly sales report for Q3.",
  "input": "Here is the raw report text...",
  "output": "Q3 sales grew 12% YoY, driven by... [concise summary]"
}

Or a simpler chat-style format for chat models:

{
  "messages": [
    {"role": "user", "content": "What is the refund policy?"},
    {"role": "assistant", "content": "Our refund policy allows returns within 30 days..."}
  ]
}

Key rules for your data: - Quality over quantity: 500 excellent examples beat 10,000 noisy ones. - Consistency: If you want bullet-point answers, every example should end with bullet points. - Coverage: Include edge cases — questions the model usually gets wrong in your tests. - No duplicates: Deduplicate aggressively. Repetition causes the model to memorize, not generalize.

Pro tip: Use GPT-4 or Claude to generate a first draft of your training data. Then manually edit 20-50 to get the format right. It's faster than writing from scratch.

Step 2: The Code — Minimal, Complete

This is the actual script to fine-tune a 7B model. It runs on a single 16GB GPU.

import torch
from transformers import (
    AutoModelForCausalLM, 
    AutoTokenizer, 
    BitsAndBytesConfig,
    TrainingArguments
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import load_dataset

# 1. Load quantized model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_name = "meta-llama/Llama-3.2-7B-Instruct"  # or Mistral-7B-Instruct
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# 2. Prepare for LoRA
model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=8,  # rank - higher = more expressiveness, more memory
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],  # attention layers
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)

# 3. Load your data (JSONL format)
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")

# Format as chat template
def format_example(example):
    messages = [
        {"role": "user", "content": example["instruction"]},
        {"role": "assistant", "content": example["output"]}
    ]
    text = tokenizer.apply_chat_template(messages, tokenize=False)
    return {"text": text}

dataset = dataset.map(format_example)

# 4. Train
training_args = TrainingArguments(
    output_dir="./llama-finetuned",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,
    save_steps=200,
    logging_steps=20,
    save_total_limit=2,
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
    dataset_text_field="text",
    max_seq_length=2048,
)

trainer.train()

# 5. Save only the LoRA adapters (not the whole base model)
model.save_pretrained("./llama-finetuned-lora")
tokenizer.save_pretrained("./llama-finetuned-lora")

That's it. Seriously. The script above, with your data in a training_data.jsonl file, will produce a working fine-tuned model.

Step 3: Load and Use Your Fine-Tuned Model

Loading the adapters is just as simple:

from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "./llama-finetuned-lora")

# Now use normally
inputs = tokenizer("User: What is our return policy?\nAssistant:", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))

Pro Tips (Learned from Pain)

1. Overfitting is sneaky

If your training loss goes to 0.0 and your model spits out exact matches of training data in production — you overfit. Fix: add more variety, drop duplicate examples, use lower rank (r=4) and more dropout.

2. Format matters more than content

Your model will learn the structure of your data faster than the facts. Make sure your training data has the exact output format you want in production — same headers, same punctuation, same level of detail.

3. Don't train on every layer

Targeting only q_proj and v_proj (the query and value projections in attention) is a good default. If you need more capacity, add k_proj and o_proj. Never target all modules without a reason — it bloats memory for marginal gain.

4. Test immediately after training

Run 10-20 manual test prompts right after training finishes. Don't wait until tomorrow. The moment you save the model, test edge cases you know the base model failed on. That's the whole point.

What About Larger Models?

If you need a 70B or 120B model (like Llama 3.3 70B), you'll need more than one consumer GPU. Use the same QLoRA method, but with:

  • Multi-GPU setup: Two RTX 3090s or 4090s via device_map="auto" and DeepSpeed
  • Cloud instances: RunPod, Lambda Labs, or Google Colab Pro+ with A100s

The code stays identical. The config changes.

The Bottom Line

Fine-tuning an open-source LLM on your own data is no longer a research project. It's an engineering task. The tools are mature, the memory requirements are now within reach of a decent desktop PC, and the results are immediately measurable in your application.

Start with 100 examples. See if it fixes your biggest model behavior problem. You'll be surprised how little data it takes to make a big difference.

Comments

Questions, corrections, and tips stay visible for everyone reading this page.

0 in thread

Join the discussion

Shown next to your comment.

Up to 4,000 characters

No comments yet

Be the first to leave a note — it helps the next reader.