LoRA Fine-Tuning on a Student Budget: Llama on a Single GPU
When Meta released Llama in February, I immediately wanted to fine-tune it. The problem was obvious: I'm a grad student. I don't have a cluster of A100s. I have access to Northeastern's HPC with a few T4 GPUs and a Google Colab Pro subscription. Full fine-tuning of a 7B parameter model requires roughly 28GB of VRAM just for the weights in fp32. That's before gradients, optimizer states, and activations. A single T4 has 16GB. The math doesn't work.
Then I found LoRA, and the math started working.
What LoRA Actually Does
LoRA stands for Low-Rank Adaptation. The key insight from the original paper (Hu et al., 2021) is that the weight updates during fine-tuning are low-rank. You don't need to modify all the parameters. You can decompose the weight update into two small matrices and train only those.
For a weight matrix W of dimensions d x d, instead of learning a full update delta_W (also d x d), LoRA learns two matrices: A (d x r) and B (r x d), where r is the rank, typically 8, 16, or 32. The effective update is A * B, which has the same dimensions as W but uses far fewer parameters.
Concretely: a 4096 x 4096 weight matrix has ~16.7M parameters. With LoRA at rank 16, you're training two matrices of sizes 4096 x 16 and 16 x 4096, totaling ~131K parameters. That's a 128x reduction for that layer. Apply this across the attention layers of the model and you can fine-tune a 7B model by updating only 1-2% of the total parameters.
The Setup
Here's what I used for fine-tuning Llama-7B with LoRA using HuggingFace's PEFT library.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
from transformers import TrainingArguments, Trainer
from datasets import load_dataset
# Load base model in 4-bit (QLoRA)
from transformers import BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
# Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "v_proj"],
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.062That last line is the key number. We're training 0.06% of the model's parameters. The rest stay frozen.
Real Numbers
Here's what fine-tuning actually looked like on the hardware I had access to.
On a T4 (16GB, Colab Pro): Using QLoRA (4-bit quantization plus LoRA), the model fits in about 10GB of VRAM. Training on the Alpaca 52K dataset took roughly 4 hours. Total cost on Colab Pro: about $3 worth of compute units.
On an A100 (40GB, university HPC): Full 16-bit LoRA without quantization. Same dataset, finished in about 90 minutes. The A100's memory bandwidth makes a noticeable difference.
Adapter size on disk: The final LoRA adapter is approximately 17MB. Not 17GB. Megabytes. You can email it. The base model stays unchanged, and you just load the adapter on top at inference time.
Why This Matters
A year ago, fine-tuning a large language model was something only well-funded labs and companies could do. You needed serious hardware, significant engineering effort, and a budget measured in thousands of dollars. LoRA, combined with quantization techniques like QLoRA, collapsed those requirements.
A grad student with a Colab subscription can now fine-tune a 7B parameter model in an afternoon. That's a fundamental shift in who gets to participate in LLM development. It means researchers at smaller institutions can experiment with custom models. It means startups can build domain-specific LLMs without raising a Series A first. It means the gap between "using an API" and "having your own model" just got much smaller.
The broader pattern here is one I keep seeing in ML: techniques that democratize access matter more than techniques that push state-of-the-art. LoRA isn't the most powerful fine-tuning method. Full fine-tuning with enough compute will generally perform better. But LoRA made fine-tuning accessible, and accessibility compounds in ways that raw performance doesn't. More people experimenting means more ideas, more applications, more progress.
I've been running my LoRA-tuned Llama locally on my laptop for two weeks now. It's not GPT-4. But it's mine, it runs offline, and I understand every part of how it was built. For someone who spent years doing on-device ML, that feeling of ownership over a model never gets old.
Related Posts
Building an LLM Microservice with FastAPI and Llama 3.2 on AWS ECS
How I built a production LLM microservice for product summarization at BulkMagic. FastAPI, Llama 3.2, Docker, and AWS ECS.
Custom Commands and Slash Commands: Building Your Own Claude Code CLI
Slash commands turn Claude Code into a personalized CLI. A markdown file becomes a reusable workflow you invoke with a single slash. Here's how to build them.
Subagents and Parallel Execution: Making Claude Code 5x Faster
Claude Code can spawn autonomous worker agents that run in parallel. Here's how subagents work, when to use them, and why they make complex tasks dramatically faster.