How LLMs Transform Writing Style: A Stylometric Experiment

By Codcompass Team·2026-05-28·9 min read

Quantifying Stylistic Displacement in LLM Rewrites: A Function-Word Approach

Current Situation Analysis

Engineering teams increasingly deploy large language models for text transformation tasks: tone adjustment, localization, summarization, and style normalization. The industry's default quality metric remains semantic similarity or human preference scoring. Both approaches miss a critical dimension: stylistic displacement. When an LLM rewrites a passage, it doesn't just change words; it shifts the underlying syntactic and grammatical habits of the text. Treating style as a binary "AI vs human" classification problem is theoretically fragile and practically useless for production pipelines. Style is a continuous, measurable distribution, not a switch.

The misunderstanding stems from conflating semantic content with syntactic structure. Modern embedding models (BERT, OpenAI embeddings, etc.) are optimized for meaning preservation. They compress text into dense vectors where stylistic markers are drowned out by semantic signals. A rewrite that preserves the exact same plot points but replaces literary subordination with analytical coordination will score near-identical on semantic similarity, yet read completely differently to a human editor.

Empirical evidence confirms that stylistic drift is systematic, not random. In a controlled experiment using 80 French literary passages (40 from Émile Zola, 40 from Guy de Maupassant), four major models were instructed to rewrite each passage into a neutral, factual register while preserving meaning. Each text was mapped to a 57-dimensional vector of L2-normalized function-word frequencies. The stylistic shift was quantified using cosine distance between the original and rewritten vectors. The results revealed consistent, reproducible displacement patterns across all models. Gemini Pro averaged a shift of 0.230, Claude 3 averaged 0.170, Mistral 7B averaged 0.139, and GPT-4 averaged 0.132. These differences persisted across two distinct prompt variations, proving that models leave measurable stylistic traces independent of prompt phrasing.

The industry pain point is clear: without a quantitative baseline for stylistic displacement, teams cannot guarantee style consistency in automated rewriting pipelines. You cannot route, validate, or fallback effectively if you cannot measure how far the output has drifted from the source register.

WOW Moment: Key Findings

The most actionable insight from the experiment is not that every model produces a unique fingerprint. The statistical structure reveals two effective clusters with one clear outlier, not four distinct profiles.

Model	Mean Shift	95% Bootstrap CI	Statistical Group
Gemini Pro	0.230	[0.204, 0.256]	High-Drift Outlier
Claude 3	0.170	[0.147, 0.195]	Intermediate / Ambiguous
Mistral 7B	0.139	[0.124, 0.157]	Low-Drift Cluster
GPT-4	0.132	[0.113, 0.152]	Low-Drift Cluster

Pairwise permutation tests (Bonferroni-corrected, 10,000 permutations) confirm the clustering:

GPT-4 vs Mistral 7B: p = 1.000 (statistically indistinguishable)
GPT-4 vs Claude 3: p = 0.103 (not significant after correction)
Claude 3 vs Gemini Pro: p = 0.007 (significant)
GPT-4 vs Gemini Pro: p < 0.001 (significant)
Mistral 7B vs Gemini Pro: p < 0.001 (significant)

Why this matters: You don't need four separate style profiles. You can design routing logic around two behavioral tiers. The low-drift cluster (GPT-4, Mistral 7B) preserves syntactic density and literary rhythm, making them suitable for creative localization or tone-preserving edits. The high-drift tier (Gemini Pro) systematically injects explicit causal connectors and analytical framing, which is ideal for technical documentation, compliance summaries, or educational content. Claude 3 occupies a transitional zone, useful when you need moderate formalization without full analytical restructuring.

This finding enables predictable style-aware routing. Instead of guessing which model "sounds right," you can set a drift threshold (e.g., ≤0.15 for creative, ≥0.20 for analytical) and automatically select or fallback based on measured displacement.

Core Solution

Building a stylometric drift measure

ment pipeline requires separating syntactic style from semantic content. The most robust approach uses function-word frequency vectors, cosine distance, and non-parametric statistical validation. Below is a production-ready TypeScript implementation that abstracts the measurement logic into a reusable pipeline.

Architecture Decisions & Rationale

Function-Word Lexicon over Embeddings: Dense embeddings conflate meaning and structure. Function words (articles, prepositions, conjunctions, pronouns) carry syntactic habits with minimal semantic load. A curated 50–60 word lexicon captures style without noise.
L2 Normalization: Texts vary in length. L2 normalization ensures vector magnitude doesn't bias distance calculations, making cosine distance a pure measure of directional alignment in style-space.
Cosine Distance: 1 - cosine_similarity measures angular separation. A value of 0 means identical function-word distribution; 0.23 indicates substantial structural reorganization.
Bootstrap Confidence Intervals: Style distributions are rarely normal. Resampling (5,000 iterations) provides robust CIs without parametric assumptions.
Permutation Testing: Validates whether observed drift differences are statistically meaningful or within corpus variance.

Implementation

// style-drift-pipeline.ts
import { createHash } from 'crypto';

export interface FunctionWordLexicon {
  [word: string]: number; // frequency weight or binary flag
}

export interface StyleVector {
  dimensions: number[];
  length: number;
}

export interface DriftMeasurement {
  sourceId: string;
  targetId: string;
  cosineDistance: number;
  bootstrapCI: [number, number];
  permutationPValue?: number;
}

export class StylometricAnalyzer {
  private lexicon: string[];
  private vectorDimension: number;

  constructor(lexicon: string[]) {
    this.lexicon = lexicon.map(w => w.toLowerCase());
    this.vectorDimension = this.lexicon.length;
  }

  /**
   * Converts raw text into an L2-normalized function-word frequency vector.
   * Strips punctuation, lowercases, and counts occurrences of lexicon terms.
   */
  public buildStyleVector(text: string): StyleVector {
    const normalized = text
      .replace(/[^\w\sàâäéèêëïîôùûüÿçœæ]/gi, ' ')
      .toLowerCase()
      .split(/\s+/)
      .filter(Boolean);

    const frequencies = new Array(this.vectorDimension).fill(0);
    const wordCounts = new Map<string, number>();

    normalized.forEach(token => {
      wordCounts.set(token, (wordCounts.get(token) || 0) + 1);
    });

    this.lexicon.forEach((lexeme, idx) => {
      frequencies[idx] = wordCounts.get(lexeme) || 0;
    });

    return this.normalizeL2(frequencies);
  }

  /**
   * Calculates cosine distance between two style vectors.
   * Returns 0 for identical distributions, approaches 1 for orthogonal styles.
   */
  public calculateDrift(source: StyleVector, target: StyleVector): number {
    const dotProduct = source.dimensions.reduce((sum, val, i) => sum + val * target.dimensions[i], 0);
    const magnitudeProduct = source.length * target.length;
    
    if (magnitudeProduct === 0) return 0;
    
    const similarity = dotProduct / magnitudeProduct;
    return 1 - similarity; // Cosine distance
  }

  /**
   * Generates bootstrap confidence intervals for drift measurements.
   * Resamples the drift distribution to estimate uncertainty.
   */
  public computeBootstrapCI(
    driftSamples: number[],
    resamples: number = 5000,
    confidenceLevel: number = 0.95
  ): [number, number] {
    const bootstrapMeans: number[] = [];
    const n = driftSamples.length;

    for (let r = 0; r < resamples; r++) {
      let sampleSum = 0;
      for (let i = 0; i < n; i++) {
        const idx = Math.floor(Math.random() * n);
        sampleSum += driftSamples[idx];
      }
      bootstrapMeans.push(sampleSum / n);
    }

    bootstrapMeans.sort((a, b) => a - b);
    const lowerIdx = Math.floor((1 - confidenceLevel) / 2 * resamples);
    const upperIdx = Math.floor((1 + confidenceLevel) / 2 * resamples);
    
    return [bootstrapMeans[lowerIdx], bootstrapMeans[upperIdx]];
  }

  private normalizeL2(vector: number[]): StyleVector {
    const magnitude = Math.sqrt(vector.reduce((sum, val) => sum + val * val, 0));
    const normalized = magnitude === 0 ? vector : vector.map(v => v / magnitude);
    return { dimensions: normalized, length: magnitude === 0 ? 0 : 1 };
  }
}

// Usage Example: Measuring drift across a batch of rewrites
export async function runDriftAnalysis(
  analyzer: StylometricAnalyzer,
  sourceTexts: string[],
  rewrittenTexts: string[]
): Promise<DriftMeasurement[]> {
  const results: DriftMeasurement[] = [];
  const rawDrifts: number[] = [];

  for (let i = 0; i < sourceTexts.length; i++) {
    const srcVec = analyzer.buildStyleVector(sourceTexts[i]);
    const tgtVec = analyzer.buildStyleVector(rewrittenTexts[i]);
    const distance = analyzer.calculateDrift(srcVec, tgtVec);
    rawDrifts.push(distance);

    results.push({
      sourceId: createHash('sha256').update(sourceTexts[i]).digest('hex').slice(0, 8),
      targetId: createHash('sha256').update(rewrittenTexts[i]).digest('hex').slice(0, 8),
      cosineDistance: distance,
      bootstrapCI: [0, 0] // Placeholder, computed after batch
    });
  }

  const ci = analyzer.computeBootstrapCI(rawDrifts);
  return results.map(r => ({ ...r, bootstrapCI: ci }));
}

Why This Architecture Works in Production

Stateless Vectorization: The buildStyleVector method is pure and idempotent. You can cache results per text hash, avoiding redundant computation in high-throughput pipelines.
Lexicon Swappability: The constructor accepts any language-specific function-word list. French, English, Spanish, or German lexicons can be injected without modifying the core math.
Statistical Rigor Built-In: Bootstrap CI calculation is decoupled from drift measurement, allowing you to run permutation tests or Bayesian estimation later without refactoring.
Memory Efficient: Vectors are fixed-size arrays. No matrix libraries required. Suitable for edge deployment or serverless functions with strict memory limits.

Pitfall Guide

Pitfall	Explanation	Fix
Semantic Leakage via Embeddings	Using transformer embeddings (e.g., `text-embedding-3-small`) to measure style conflates meaning with syntax. Two texts with identical plot but different conjunctions will score near-zero distance.	Restrict analysis to function-word frequencies. Validate by checking that semantic similarity scores remain high while style distance varies.
Ignoring Statistical Overlap	Treating mean drift differences (e.g., 0.132 vs 0.139) as meaningful without confidence intervals or permutation tests. Small corpus variance can create false distinctions.	Always compute bootstrap CIs (≥5,000 resamples) and run Bonferroni-corrected permutation tests before routing decisions.
Lexicon Confirmation Bias	Pre-selecting function words that favor a specific model's known tendencies (e.g., including `tandis`, `néanmoins` to catch formal models). This artificially inflates drift for those models.	Use a linguistically grounded, closed-class lexicon (articles, prepositions, conjunctions, pronouns). Validate lexicon neutrality by testing on human-authored control texts.
Drift-Quality Conflation	Assuming higher stylistic displacement equals lower quality. Gemini's 0.230 shift isn't "worse"; it's analytically restructured. Creative tasks may require low drift; technical tasks may require high drift.	Decouple drift measurement from quality scoring. Use drift as a routing signal, not a pass/fail metric. Pair with task-specific evaluation (e.g., BLEU for technical, human preference for creative).
Prompt Sensitivity Blindness	Testing only one instruction and assuming the drift signature is universal. Models respond differently to "make it formal" vs "simplify for general audience."	Run multi-prompt validation. If drift varies >15% across prompts, the model's style is instruction-dependent, not intrinsic. Adjust routing thresholds accordingly.
Corpus Genre Lock-In	Training or validating on a single genre (e.g., 19th-century French literature) and applying thresholds to modern technical docs. Function-word distributions shift dramatically across registers.	Maintain genre-stratified baselines. Compute drift relative to the source register, not a universal human centroid.
Ignoring Directional Drift	Measuring only magnitude (cosine distance) without analyzing which function words drive the shift. Two models can have identical drift magnitude but opposite syntactic tendencies.	Implement directional analysis: track per-word frequency deltas. Route based on vector direction, not just distance.

Production Bundle

Action Checklist

Define a closed-class function-word lexicon for your target language(s) using linguistic references, not model output.
Implement L2-normalized frequency vectorization with deterministic tokenization rules.
Establish baseline drift thresholds using a control set of human-authored rewrites in your target domain.
Compute bootstrap confidence intervals (5,000 resamples) for all model drift measurements before deployment.
Run permutation tests to verify statistical separation between candidate models.
Decouple drift routing from quality scoring; use drift as a style-preservation signal, not a correctness metric.
Cache vector representations per text hash to reduce latency in high-volume pipelines.
Validate lexicon neutrality by measuring drift on human-to-human rewrites; expected baseline should be <0.05.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Creative localization / literary editing	Low-drift cluster (GPT-4, Mistral 7B) with drift threshold ≤0.15	Preserves syntactic density, rhythm, and authorial voice	Moderate API cost; higher human review overhead if drift exceeds threshold
Technical documentation / compliance summaries	High-drift tier (Gemini Pro) with drift threshold ≥0.20	Injects explicit causal/analytical structure; improves scannability	Lower review cost; predictable output structure reduces editing time
Multi-language pipeline	Language-specific lexicons + unified drift calculator	Function words are language-bound; math remains identical across locales	Higher initial lexicon curation cost; scales linearly with language count
Real-time user-facing rewriting	Embedding similarity for speed + function-word drift for async validation	Latency constraints prevent full vectorization; drift used for post-hoc quality gates	Minimal latency impact; drift calculation deferred to background workers

Configuration Template

{
  "stylometric_pipeline": {
    "lexicon_source": "./lexicons/fr_function_words.json",
    "normalization": "L2",
    "distance_metric": "cosine",
    "statistical_validation": {
      "bootstrap_resamples": 5000,
      "confidence_level": 0.95,
      "permutation_tests": 10000,
      "bonferroni_correction": true
    },
    "routing_thresholds": {
      "creative_preservation": { "max_drift": 0.15, "fallback_model": "mistral-7b" },
      "analytical_transformation": { "min_drift": 0.20, "preferred_model": "gemini-pro" },
      "ambiguous_zone": { "drift_range": [0.15, 0.20], "action": "human_review" }
    },
    "caching": {
      "enabled": true,
      "ttl_seconds": 86400,
      "hash_algorithm": "sha256"
    }
  }
}

Quick Start Guide

Prepare your lexicon: Export a JSON array of 50–60 language-specific function words (articles, prepositions, conjunctions, pronouns). Ensure no content words are included.
Initialize the analyzer: Instantiate StylometricAnalyzer with your lexicon. Run a dry pass on 10 source texts to verify vector dimensions match lexicon length.
Generate baseline drift: Rewrite 20 texts using your target LLMs. Compute cosine distances and bootstrap CIs. Compare against human-to-human rewrites to establish your domain's natural drift floor.
Configure routing thresholds: Set max_drift and min_drift values based on your statistical clusters. Deploy the pipeline with fallback logic and async drift validation.
Monitor directional shifts: Log per-word frequency deltas alongside aggregate drift. If a model's vector direction shifts over time (e.g., due to model updates), recalibrate thresholds quarterly.

Stylistic displacement is no longer a subjective editorial concern. It's a quantifiable engineering parameter. By measuring function-word drift, validating statistical separation, and routing based on directional style profiles, you transform LLM rewriting from a black box into a predictable, style-aware pipeline.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back