pkg in packages:
subprocess.check_call([sys.executable, "-m", "pip", "install", pkg, "--quiet"])
install_training_stack()
### Step 2: Model Loading & Adapter Configuration
Load the base instruction model with 4-bit quantization. The quantization strategy compresses weights to NF4 format, preserving instruction-following capabilities while drastically reducing memory footprint. LoRA adapters are injected into attention and feed-forward projection layers. RSLoRA is enabled to normalize the scaling factor, ensuring stable gradient flow regardless of rank selection.
```python
from unsloth import FastLanguageModel
from peft import LoraConfig
MODEL_ID = "Qwen/Qwen2.5-3B-Instruct"
SEQ_LEN = 1024
ADAPTER_RANK = 32
ADAPTER_ALPHA = 64
DROPOUT_RATE = 0.05
def initialize_model_and_adapter():
base_model, tokenizer = FastLanguageModel.from_pretrained(
model_name=MODEL_ID,
max_seq_length=SEQ_LEN,
dtype=None,
load_in_4bit=True,
)
lora_config = LoraConfig(
r=ADAPTER_RANK,
lora_alpha=ADAPTER_ALPHA,
lora_dropout=DROPOUT_RATE,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
bias="none",
use_rslora=True,
task_type="CAUSAL_LM"
)
adapted_model = FastLanguageModel.get_peft_model(
base_model,
lora_config=lora_config,
use_gradient_checkpointing="unsloth"
)
return adapted_model, tokenizer
trained_model, chat_tokenizer = initialize_model_and_adapter()
Architecture Rationale:
load_in_4bit=True: Compresses weights to ~1.8GB. Without this, optimizer states alone exceed T4 capacity.
target_modules: Injecting adapters into both attention projections and MLP layers captures cross-domain instruction patterns more effectively than attention-only targeting.
use_rslora=True: Standard LoRA scales updates by α/r. When r changes, the effective learning rate shifts unpredictably. RSLoRA normalizes this to α/√r, stabilizing convergence across hyperparameter sweeps.
use_gradient_checkpointing="unsloth": Discards intermediate activations during the forward pass and recomputes them during backpropagation. Trades ~15% compute time for ~40% VRAM savings.
Step 3: Data Pipeline & Template Alignment
Instruction-tuned models require strict role-based formatting. The dataset must be converted into the model's native chat template, which uses control tokens to delineate system, user, and assistant turns. A 90/10 stratified split ensures unbiased evaluation.
import json
from datasets import Dataset
DATA_PATH = "/content/training_corpus.jsonl"
def load_and_format_dataset(tokenizer):
raw_records = []
with open(DATA_PATH, "r", encoding="utf-8") as fh:
for line in fh:
stripped = line.strip()
if stripped:
raw_records.append(json.loads(stripped))
hf_dataset = Dataset.from_list(raw_records)
def apply_template(record):
formatted_text = tokenizer.apply_chat_template(
record["messages"],
tokenize=False,
add_generation_prompt=False
)
return {"processed_text": formatted_text}
formatted_ds = hf_dataset.map(
apply_template,
remove_columns=hf_dataset.column_names
)
partitioned = formatted_ds.train_test_split(test_size=0.1, seed=42)
return partitioned["train"], partitioned["test"]
training_split, validation_split = load_and_format_dataset(chat_tokenizer)
print(f"Training samples: {len(training_split)} | Validation samples: {len(validation_split)}")
Key Design Choices:
- Line-by-line JSON parsing prevents memory spikes when processing large corpora.
tokenize=False defers tokenization to the data collator, enabling dynamic padding and reducing preprocessing overhead.
add_generation_prompt=False ensures the template includes the complete assistant response, allowing the model to learn next-token prediction across the full conversation rather than stopping at the prompt boundary.
Step 4: Training Orchestration
The SFTTrainer from TRL manages the supervised fine-tuning loop. Hyperparameters are tuned for single-GPU constraints: moderate batch size, gradient accumulation to simulate larger batches, and cosine decay for stable convergence.
from trl import SFTTrainer, SFTConfig
def launch_training(model, tokenizer, train_ds, eval_ds):
training_args = SFTConfig(
output_dir="./qwen_adapter_output",
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
weight_decay=0.01,
lr_scheduler_type="cosine",
warmup_ratio=0.05,
logging_steps=10,
eval_strategy="steps",
eval_steps=50,
save_strategy="steps",
save_steps=50,
save_total_limit=2,
fp16=True,
optim="adamw_8bit",
report_to="none"
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=train_ds,
eval_dataset=eval_ds,
args=training_args,
dataset_text_field="processed_text",
max_seq_length=SEQ_LEN,
packing=False
)
trainer.train()
return trainer
training_engine = launch_training(trained_model, chat_tokenizer, training_split, validation_split)
Why These Hyperparameters:
gradient_accumulation_steps=4 with per_device_train_batch_size=2 simulates an effective batch size of 8 without exceeding VRAM limits.
optim="adamw_8bit" reduces optimizer state memory by 50% compared to 32-bit AdamW.
packing=False prevents sequence concatenation, which can corrupt chat template boundaries and cause role leakage during training.
save_total_limit=2 caps disk usage, critical in ephemeral environments where storage quotas are strict.
Pitfall Guide
1. Chat Template Mismatch
Explanation: Training on raw text or manually concatenated strings ignores the model's pre-trained control tokens (<|im_start|>, <|im_end|>). The model fails to learn role boundaries, resulting in assistant responses that bleed into user prompts or ignore system instructions.
Fix: Always use tokenizer.apply_chat_template() with add_generation_prompt=False. Verify output samples before training.
2. Sequence Length Overallocation
Explanation: VRAM scales quadratically with sequence length. Setting max_seq_length=4096 on a 3B model will OOM even with 4-bit quantization, as attention matrices dominate memory allocation.
Fix: Cap at 1024-2048 tokens. Truncate or filter longer examples. Use dynamic padding via the data collator instead of static padding.
3. Alpha/Rank Decoupling
Explanation: Setting lora_alpha arbitrarily (e.g., alpha=16, r=32) breaks the scaling relationship. Standard LoRA expects alpha ≈ 2 * r for stable gradient magnitudes. Deviations cause vanishing or exploding updates.
Fix: Enable use_rslora=True. It automatically normalizes scaling to α/√r, making the adapter robust to rank changes.
4. Validation Set Neglect
Explanation: Training without a held-out evaluation split masks overfitting. Loss curves will continuously decrease while generation quality degrades, as the model memorizes training examples rather than learning generalizable patterns.
Fix: Reserve 10-15% of data for evaluation. Monitor validation loss divergence. Implement early stopping if validation loss plateaus or increases.
5. Ephemeral Storage Loss
Explanation: Colab and similar cloud notebooks reset on disconnect. Checkpoints saved only to /content/ or local RAM are permanently lost.
Fix: Mount Google Drive early in the session. Configure output_dir to point to a Drive path. Implement periodic checkpointing with save_steps.
6. Attention-Only Targeting
Explanation: Restricting LoRA to q_proj, k_proj, v_proj limits adaptation capacity. Feed-forward layers (gate_proj, up_proj, down_proj) store factual knowledge and reasoning patterns.
Fix: Target all major projection matrices. The marginal VRAM cost is negligible compared to the improvement in domain alignment.
7. Deployment Friction from Unmerged Adapters
Explanation: Shipping LoRA adapters separately requires inference runtimes to support dynamic weight merging. Many production servers (vLLM, TGI) expect standalone model directories.
Fix: Post-training, merge adapters into base weights using model.merge_and_unload(). Export the fused model to a standard Hugging Face directory structure for universal compatibility.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Domain-specific instruction tuning | 4-bit QLoRA + RSLoRA | Balances adaptation quality with VRAM constraints; preserves base capabilities | Near-zero (free-tier GPU) |
| Full architectural modification | Full FP16 fine-tuning | Required when changing model head, adding new tokenizers, or altering layer count | High (multi-GPU A100/H100) |
| Rapid prototyping / few-shot tasks | Prompt engineering + RAG | Avoids training overhead; leverages base model's existing instruction following | Zero compute cost |
| Multi-domain generalization | Mixture-of-Experts or LoRA routing | Prevents catastrophic interference; isolates domain-specific adapters | Moderate (inference routing overhead) |
Configuration Template
# fine_tuning_config.yaml
model:
base_id: "Qwen/Qwen2.5-3B-Instruct"
quantization: "nf4"
max_seq_length: 1024
adapter:
rank: 32
alpha: 64
dropout: 0.05
target_modules:
- "q_proj"
- "k_proj"
- "v_proj"
- "o_proj"
- "gate_proj"
- "up_proj"
- "down_proj"
use_rslora: true
training:
epochs: 3
batch_size: 2
gradient_accumulation: 4
learning_rate: 2.0e-4
scheduler: "cosine"
warmup_ratio: 0.05
optimizer: "adamw_8bit"
eval_strategy: "steps"
eval_steps: 50
save_steps: 50
save_limit: 2
data:
format: "jsonl"
chat_template: "qwen2.5"
train_eval_split: 0.9
seed: 42
Quick Start Guide
- Prepare Environment: Launch a fresh Colab notebook. Set runtime to T4 GPU. Execute the dependency installation cell. Allow 2-3 minutes for compilation.
- Upload & Format Data: Place your
training_corpus.jsonl in /content/. Run the dataset formatter cell. Verify printed sample counts and template output.
- Initialize & Train: Execute the model loading and training orchestration cells. Monitor the console for validation loss trends. Training typically completes in 45-70 minutes on a T4.
- Persist & Export: Save checkpoints to mounted cloud storage. Run the adapter merge routine. Download the fused model directory for deployment to vLLM, TGI, or local inference servers.