Fine-Tuning Qwen2.5-0.5B to Write SRE Post-Mortem Summaries

Optimizing Small Language Models for Structured Incident Reporting: A LoRA Fine-Tuning Blueprint

Current Situation Analysis

Site Reliability Engineering (SRE) teams face a persistent bottleneck in post-incident analysis. Writing root-cause summaries is labor-intensive, and the quality of these documents varies wildly based on the author's experience. Junior engineers frequently omit contributing factors or fail to link symptoms to root causes, while senior engineers produce summaries that lack a standardized structure, making historical trend analysis difficult.

The industry has largely turned to zero-shot Large Language Models (LLMs) to automate this drafting process. However, general-purpose models struggle with domain-specific constraints. They tend to produce verbose, generic narratives that ignore organizational conventions. They often hallucinate technical details or fail to adhere to the strict format required for actionable post-mortems.

A critical oversight in current workflows is the assumption that model size correlates with task performance for structured generation. Evidence from domain adaptation studies demonstrates that a small model, when fine-tuned on high-quality incident data, can significantly outperform zero-shot large models on rubric-based compliance. In controlled evaluations using 700 real-world incident timelines mapped to root-cause summaries, a fine-tuned 0.5B parameter model achieved a rubric compliance score exceeding 60%, whereas zero-shot baselines from proprietary and open-weight giants capped at 50% and 35% respectively. This approach enables structured, concise summaries that run on consumer hardware at a fraction of the inference cost.

WOW Moment: Key Findings

The most compelling insight from this domain adaptation is that structure adherence is a function of training data, not model scale. When the evaluation criteria prioritize specific SRE conventions—such as referencing the timeline, identifying contributing factors, naming specific components, and prescribing prevention actions—the fine-tuned small model dominates.

Approach	Rubric Compliance	Inference Cost	Hardware Requirement	Latency Profile
Zero-Shot Proprietary (Nano-class)	35–50%	~$0.002 per call	Cloud API	Variable (Network bound)
Zero-Shot Open (Plus-class)	20–35%	~$0.0005 per call	Cloud API	Variable (Network bound)
Fine-Tuned Qwen2.5-0.5B	> 60%	~$0.00001 per call	8GB VRAM / CPU	Consistent (<200ms)

Why this matters: This comparison flips the cost-performance curve. Organizations can deploy a model that produces higher-quality, format-compliant summaries while eliminating API dependency and reducing inference costs by orders of magnitude. The fine-tuned adapter runs locally, ensuring data privacy and enabling real-time integration into incident management tools without network latency.

Core Solution

The solution relies on Parameter-Efficient Fine-Tuning (PEFT) using Low-Rank Adaptation (LoRA) on the Qwen/Qwen2.5-0.5B-Instruct model. This architecture allows us to inject domain-specific knowledge into the model's attention layers without retraining the base weights, keeping resource requirements minimal.

1. Data Preparation Strategy

The foundation of the fine-tuning process is a curated dataset of incident timelines paired with gold-standard summaries. The dataset should consist of JSONL records where each entry contains the raw timeline data and the target summary.

Dataset Schema:

{
  "incident_id": "INC-2024-0892",
  "timeline": "14:02 - Latency spike detected on payment-service.\n14:05 - Auto-scaling triggered, new pods failing health checks.\n14:12 - Root cause identified: Database connection pool exhaustion due to slow query.\n14:20 - Query optimized, connection pool reset, latency normalized.",
  "summary": "The payment-service experienced elevated latency due to database connection pool exhaustion. A slow query in the transaction handler caused connections to hang, preventing new requests from acquiring resources. Auto-scaling attempts failed as new pods also encountered the pool limit. The issue was resolved by optimizing the query and resetting the connection pool. Prevention includes implementing query timeouts and monitoring connection pool utilization metrics."
}

Implementation:

import json
from datasets import Dataset

class IncidentDatasetProcessor:
    def __init__(self, data_path: str, tokenizer):
        self.data_path = data_path
        self.tokenizer = tokenizer

    def load_and_format(self) -> Dataset:
        raw_data = []
        with open(self.data_path, 'r') as f:
            for line in f:
                raw_data.append(json.loads(line))
        
        # Format for instruction tuning
        formatted_data = []
        for item in raw_data:
            formatted_data.append({
                "text": (
                    f"<|im_start|>system\nYou are an SRE assistant generating post-mortem summaries.<|im_end|>\n"
                    f"<|im_start|>user\nGenerate a root-cause summary for this incident timeline:\n{item['timeline']}<|im_end|>\n"
                    f"<|im_start|>assistant\n{item['summary']}<|im_end|>"
                )
            })
        return Dataset.from_list(formatted_data)

2. LoRA Configuration and Training

We apply 4-bit quantization to the base model to reduce VRAM usage, allowing training on an 8GB consumer GPU. LoRA targets the query and value projection layers, which are critical for capturing domain-specific patterns.

Training Configuration:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, get_peft_model

# Quantization setup for 4-bit training
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B-Instruct",
    quantization_config=bnb_config,
    device_map="auto"
)

# LoRA adapter configuration
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

training_args = TrainingArguments(
    output_dir="./output/postmortem_adapter",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    num_train_epochs=3,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    load_best_model_at_end=True
)

Rationale:

Target Modules: Focusing on q_proj and v_proj captures the majority of the model's representational capacity for text generation while minimizing trainable parameters.
Epochs: Three epochs are sufficient for a 700-sample dataset. Additional epochs risk overfitting to the specific phrasing of the training data rather than the underlying structure.
Learning Rate: A rate of 2e-4 balances convergence speed with stability for small models.

3. Rubric-Based Evaluation

Evaluation must go beyond perplexity. We implement a structured rubric that scores outputs against four critical SRE criteria. Each criterion is weighted equally, and a pass threshold of 0.60 is enforced.

Evaluator Logic:

class PostMortemRubricEvaluator:
    def __init__(self):
        self.criteria = [
            "timeline_reference",
            "contributing_factors",
            "specific_component",
            "prevention_action"
        ]

    def score_summary(self, generated_text: str, ground_truth: str) -> dict:
        scores = {}
        for criterion in self.criteria:
            # In production, use an LLM-as-a-judge or regex/NLP heuristics
            # Here we simulate the scoring logic
            scores[criterion] = self._check_criterion(generated_text, criterion)
        
        weighted_score = sum(scores.values()) / len(self.criteria)
        return {
            "criteria_scores": scores,
            "weighted_score": weighted_score,
            "passed": weighted_score >= 0.60
        }

    def _check_criterion(self, text: str, criterion: str) -> float:
        # Placeholder for actual detection logic
        # e.g., checking for timestamps, causal language, component names, action verbs
        return 1.0 if criterion in text.lower() else 0.0

Pitfall Guide

Data Noise and Inconsistency
- Explanation: Training on incident summaries with varying formats or factual errors teaches the model to replicate those inconsistencies.
- Fix: Curate the dataset rigorously. Ensure all 700 samples follow the exact target structure. Remove entries with ambiguous root causes or missing prevention actions.
Rubric Misalignment
- Explanation: If the evaluation rubric does not match your organization's actual post-mortem template, the model will optimize for the wrong output.
- Fix: Define the rubric based on your internal SRE standards. Include criteria specific to your workflow, such as "impact assessment" or "stakeholder notification," if required.
Overfitting on Small Datasets
- Explanation: With only 700 samples, the model may memorize specific incidents rather than learning the general structure of a summary.
- Fix: Use a held-out test set (e.g., 100 samples) for validation. Implement early stopping if validation loss plateaus or increases. Monitor the gap between training and validation scores.
Quantization Degradation
- Explanation: 4-bit quantization can introduce artifacts that degrade the model's ability to generate precise technical terms.
- Fix: Use nf4 (NormalFloat4) quantization and bfloat16 compute dtype. Verify that critical technical terms in the output are not corrupted by running a spot-check on the first few training steps.
Ignoring the Baseline
- Explanation: Proceeding with fine-tuning without establishing a zero-shot baseline makes it impossible to measure the true value added by the adapter.
- Fix: Always run the same evaluation rubric against a zero-shot model (e.g., gpt-5.4-nano or qwen3.6-plus:free) before training. This quantifies the improvement and justifies the engineering effort.
Context Window Mismatch
- Explanation: Incident timelines can be lengthy. If the input exceeds the model's context window, truncation may cut off critical early events.
- Fix: Implement a preprocessing step to chunk or summarize long timelines before feeding them to the model. Ensure the max_seq_length in training covers the typical length of your incident data.

Production Bundle

Action Checklist

Audit Incident Data: Collect 700+ high-quality incident timelines and summaries. Ensure consistent formatting.
Define Evaluation Rubric: Create a JSON rubric with criteria matching your SRE standards. Set a pass threshold (e.g., 0.60).
Run Zero-Shot Baseline: Execute the rubric against a zero-shot model to establish current performance metrics.
Configure LoRA Training: Set up 4-bit quantization, target modules, and hyperparameters. Allocate an 8GB VRAM GPU.
Execute Fine-Tuning: Train for 3 epochs with gradient accumulation. Monitor loss and validation scores.
Validate Output: Run the fine-tuned adapter against the held-out test set. Verify rubric compliance > 60%.
Deploy Adapter: Export the LoRA weights. Integrate into your incident management pipeline for inference.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Sensitive Incident Data	Fine-Tuned Qwen2.5-0.5B	Local inference ensures data never leaves your infrastructure.	Low (Hardware amortized)
Rapid Prototyping	Zero-Shot API	No training required; immediate results for proof-of-concept.	Medium (API costs)
High-Volume Incident Flow	Fine-Tuned Qwen2.5-0.5B	Inference cost is negligible; scales with local hardware.	Near Zero
Strict Format Compliance	Fine-Tuned Qwen2.5-0.5B	Model is trained specifically on your format, outperforming generalists.	Low
Limited Engineering Resources	Zero-Shot API	Avoids maintenance of training pipeline and model hosting.	Medium

Configuration Template

Use this YAML configuration to standardize your training runs across environments.

model:
  base: "Qwen/Qwen2.5-0.5B-Instruct"
  quantization:
    enabled: true
    type: "nf4"
    compute_dtype: "bfloat16"

lora:
  rank: 16
  alpha: 32
  target_modules:
    - "q_proj"
    - "v_proj"
  dropout: 0.05

training:
  epochs: 3
  batch_size: 4
  gradient_accumulation: 4
  learning_rate: 2e-4
  max_seq_length: 512
  early_stopping_patience: 2

evaluation:
  rubric_path: "data/rubric.json"
  pass_threshold: 0.60
  test_split: "data/test_100.jsonl"

Quick Start Guide

Environment Setup:

python3 -m venv sre-llm-env
source sre-llm-env/bin/activate
pip install transformers peft datasets accelerate bitsandbytes

Prepare Data: Place your train.jsonl and test.jsonl files in the data/ directory. Ensure they follow the schema defined in the Core Solution.
Run Training: Execute the training script with the configuration template. The script will handle quantization, LoRA injection, and checkpointing.
```
python train_adapter.py --config config/training.yaml
```
Evaluate Results: After training, run the evaluation script to compare the fine-tuned adapter against the rubric.
```
python evaluate_adapter.py --model_path ./output/postmortem_adapter --test_data data/test_100.jsonl
```
Deploy: Load the adapter in your production service using PeftModel and serve via a lightweight inference server like vLLM or TGI for low-latency incident summarization.

Mid-Year Sale — Unlock Full Article