Back to KB
Difficulty
Intermediate
Read Time
9 min

A Mechanistic Investigation of Supervised Fine Tuning

By Codcompass Team··9 min read

Diagnosing Representational Shift in Fine-Tuned Models Using Sparse Autoencoders

Current Situation Analysis

Engineering teams evaluating the impact of Supervised Fine-Tuning (SFT) frequently rely on dense vector metrics to assess representational stability. The standard practice involves computing the cosine similarity between hidden activations of the base model and the fine-tuned model across a validation dataset. A high cosine similarity score is typically interpreted as evidence that the fine-tuning process preserved the model's underlying knowledge structure, suggesting minimal catastrophic forgetting or geometric distortion.

This reliance on dense similarity metrics is a critical blind spot. Cosine similarity measures the angular alignment of high-dimensional vectors but fails to capture changes in the sparse, semantic composition of those vectors. Research indicates that while the cosine similarity between base and SFT model activations remains exceptionally high, the underlying feature representations diverge significantly when analyzed through a mechanistic lens.

The misconception arises because dense embeddings can maintain directional similarity even when the constituent features driving those directions have been fundamentally altered. SFT may rewire the model to prioritize different semantic concepts while keeping the aggregate vector orientation similar. Without a high-resolution diagnostic tool, engineers cannot distinguish between a model that has retained its reasoning capabilities and one that has shifted to superficial pattern matching or safety-compliant heuristics.

Sparse Autoencoders (SAEs) pretrained on the base model provide the necessary resolution to detect these shifts. By projecting activations through a fixed, interpretable dictionary of features, SAEs reveal that SFT induces substantial divergence in sparse latents, even when dense metrics suggest stability. This divergence is not random; it exhibits task-specific and layer-specific distributions, indicating that fine-tuning systematically targets and alters precise semantic features rather than uniformly perturbing the model.

WOW Moment: Key Findings

The discrepancy between dense and sparse metrics reveals a hidden layer of model behavior that standard evaluation pipelines miss. The following comparison illustrates the divergence detected when moving from aggregate vector similarity to feature-level analysis.

Metric CategoryEvaluation MethodBase vs. SFT Model ResultInterpretation
Dense GeometryCosine Similarity> 0.96Suggests minimal change; model geometry appears preserved.
Sparse LatentsSAE Feature Overlap< 0.45Reveals significant divergence; underlying features have shifted.
Feature MagnitudeL1 DivergenceHighIndicates substantial redistribution of activation energy across features.
Layer ProfileLayer-wise AnalysisNon-uniformSafety alignment shows distinct update patterns compared to task-specific layers.

Why This Matters: The finding that sparse latents diverge significantly despite high cosine similarity enables engineers to detect representational drift that dense metrics mask. This capability is essential for:

  1. Safety Verification: Identifying if safety fine-tuning has inadvertently suppressed critical reasoning features or introduced brittle heuristics.
  2. Task Specificity: Pinpointing exactly which semantic features are modified during domain adaptation, allowing for targeted interventions.
  3. Model Diagnostics: Moving beyond black-box evaluation to mechanistic understanding of how fine-tuning alters internal computation.

Core Solution

To mechanistically investigate SFT impact, implement a diagnostic pipeline that projects model activations through a Sparse Autoencoder pretrained on the base model. This approach treats the SAE as a fixed dictionary, ensuring that changes in latent activations reflect genuine shifts in the model's feature usage rather than changes in the dictionary itself.

Architecture Decisions

  1. Fixed SAE Dictionary: A

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

Sign in to read the full article and unlock all 635+ tutorials.

Sign In / Register — Start Free Trial

7-day free trial · Cancel anytime · 30-day money-back