SOTA Reference (Oxford-IIIT Pet) | 94.0 | N/A | N/A |
Key Findings:
- Gemma 4's baseline accuracy jumps 22% over Gemma 3, reducing the fine-tuning gap needed to reach SOTA.
- QLoRA (4-bit) reduces base memory footprint from ~62GB to ~18β20GB, freeing ~50GB+ VRAM for activations and long-context batches.
- A Rank 64 / Alpha 64 LoRA configuration with a 5e-5 learning rate provides sufficient "surface area" for visual feature refinement without destabilizing the multimodal backbone.
- Serverless execution on Cloud Run Jobs achieves full-dataset convergence in under 4.5 hours with zero manual GPU provisioning.
Core Solution
Gemma 4 supports interleaved inputs and native system roles, but for stable fine-tuning, we enforce a single-turn structure with images preceding text. This simplifies custom masking logic and preserves instruction-following precision.
full_user_content = f"{prompt}\n\nIdentify the breed of the animal in this image."
messages = [
{
"role": "user",
"content": [
{"type": "image"}, # Image must come first!
{"type": "text", "text": full_user_content},
],
},
{
"role": "assistant",
"content": [{"type": "text", "text": example["caption"]}]
}
]
2. Loading the Correct Multimodal Architecture
Gemma 4 natively processes images, video, and audio. We explicitly load the model using AutoModelForMultimodalLM to ensure the vision tower, audio encoders, and projection layers are correctly initialized.
from transformers import AutoModelForMultimodalLM
model = AutoModelForMultimodalLM.from_pretrained(model_id, **model_kwargs)
3. Label Masking for Multimodal Data
Dynamic image soft tokens cause tokenizer boundary quirks when concatenating text and control tokens. Instead of calculating prompt length in isolation, we implement a backward-search collator:
- Tokenize the full chat template.
- Search
_input_ids for the exact breed label tokens.
- Step backward to locate the
<|turn> control token marking the assistant's response start.
- Mask everything before
<|turn>. This guarantees mathematically precise gradient alignment.
4. Bypassing Custom Layers & Unlocking the Vision Tower
Standard LoRA targeting (q_proj, v_proj) breaks Gemma4ClippableLinear wrappers, bypassing activation clipping and freezing the vision tower. The solution is to use target_modules="all-linear", which recursively scans the model tree, wraps nested linear layers safely, preserves clipping logic, and adapts both language and vision projection layers.
5. Managing VRAM with QLoRA & Gradient Checkpointing
To ensure absolute stability during long-running Cloud Run Jobs, we combine 4-bit QLoRA (via bitsandbytes) with gradient checkpointing and dynamic batch sizing. This keeps peak VRAM well under the 96GB limit while maximizing throughput. Gradient checkpointing trades compute for memory by recomputing activations during backward passes, essential for long-context multimodal batches. We also configure per_device_train_batch_size=2 and gradient_accumulation_steps=4 to simulate larger batches without OOM errors.
Pitfall Guide
- Hardcoding Token Lengths for Masking: Gemma 4's dynamic image soft tokens shift token boundaries when text is concatenated. Hardcoded lengths cause misaligned labels and accuracy drops. Always use backward-search collation targeting
<|turn> and label tokens.
- Bypassing Activation Clipping in LoRA: Targeting inner
.linear weights strips Gemma4ClippableLinear wrappers, removing min/max clipping. This causes activation drift and loss explosion. Use target_modules="all-linear" to preserve architectural stability.
- Freezing the Vision Tower: Text-only LoRA configs often miss vision projection layers. The model's "eyes" remain frozen, preventing visual feature adaptation. Ensure LoRA targets span both language and vision towers.
- Ignoring Multimodal Input Order: Placing text before images disrupts the chat template's internal routing. Always place
{"type": "image"} first in the user content array to guarantee correct token injection.
- VRAM Overflow on 31B Models: FP16 base weights consume ~62GB, leaving insufficient headroom for multimodal activations. Always apply QLoRA (4-bit) + gradient checkpointing, and tune batch size/accumulation steps to stay under GPU limits.
Deliverables
- π¦ Fine-Tuning Blueprint: Complete GitHub repository with Cloud Run Job configuration, Dockerfile, and training scripts optimized for RTX 6000 Pro. Includes QLoRA setup, backward-search collator, and multimodal data pipeline.
- β
Pre-Flight Checklist: VRAM capacity calculator, LoRA target validation matrix, multimodal token alignment verification, and Cloud Run job timeout/memory limits configuration guide.
- βοΈ Configuration Templates: Ready-to-deploy
training_args.json, lora_config.json (Rank 64/Alpha 64), and cloud_run_job.yaml with GPU resource requests, environment variables, and retry policies for serverless execution.