Back to KB
Difficulty
Intermediate
Read Time
9 min

How LLMs Transform Writing Style: A Stylometric Experiment

By Codcompass Team··9 min read

Quantifying Stylistic Displacement in LLM Rewrites: A Function-Word Approach

Current Situation Analysis

Engineering teams increasingly deploy large language models for text transformation tasks: tone adjustment, localization, summarization, and style normalization. The industry's default quality metric remains semantic similarity or human preference scoring. Both approaches miss a critical dimension: stylistic displacement. When an LLM rewrites a passage, it doesn't just change words; it shifts the underlying syntactic and grammatical habits of the text. Treating style as a binary "AI vs human" classification problem is theoretically fragile and practically useless for production pipelines. Style is a continuous, measurable distribution, not a switch.

The misunderstanding stems from conflating semantic content with syntactic structure. Modern embedding models (BERT, OpenAI embeddings, etc.) are optimized for meaning preservation. They compress text into dense vectors where stylistic markers are drowned out by semantic signals. A rewrite that preserves the exact same plot points but replaces literary subordination with analytical coordination will score near-identical on semantic similarity, yet read completely differently to a human editor.

Empirical evidence confirms that stylistic drift is systematic, not random. In a controlled experiment using 80 French literary passages (40 from Émile Zola, 40 from Guy de Maupassant), four major models were instructed to rewrite each passage into a neutral, factual register while preserving meaning. Each text was mapped to a 57-dimensional vector of L2-normalized function-word frequencies. The stylistic shift was quantified using cosine distance between the original and rewritten vectors. The results revealed consistent, reproducible displacement patterns across all models. Gemini Pro averaged a shift of 0.230, Claude 3 averaged 0.170, Mistral 7B averaged 0.139, and GPT-4 averaged 0.132. These differences persisted across two distinct prompt variations, proving that models leave measurable stylistic traces independent of prompt phrasing.

The industry pain point is clear: without a quantitative baseline for stylistic displacement, teams cannot guarantee style consistency in automated rewriting pipelines. You cannot route, validate, or fallback effectively if you cannot measure how far the output has drifted from the source register.

WOW Moment: Key Findings

The most actionable insight from the experiment is not that every model produces a unique fingerprint. The statistical structure reveals two effective clusters with one clear outlier, not four distinct profiles.

ModelMean Shift95% Bootstrap CIStatistical Group
Gemini Pro0.230[0.204, 0.256]High-Drift Outlier
Claude 30.170[0.147, 0.195]Intermediate / Ambiguous
Mistral 7B0.139[0.124, 0.157]Low-Drift Cluster
GPT-40.132[0.113, 0.152]Low-Drift Cluster

Pairwise permutation tests (Bonferroni-corrected, 10,000 permutations) confirm the clustering:

  • GPT-4 vs Mistral 7B: p = 1.000 (statistically indistinguishable)
  • GPT-4 vs Claude 3: p = 0.103 (not significant after correction)
  • Claude 3 vs Gemini Pro: p = 0.007 (significant)
  • GPT-4 vs Gemini Pro: p < 0.001 (significant)
  • Mistral 7B vs Gemini Pro: p < 0.001 (significant)

Why this matters: You don't need four separate style profiles. You can design routing logic around two behavioral tiers. The low-drift cluster (GPT-4, Mistral 7B) preserves syntactic density and literary rhythm, making them suitable for creative localization or tone-preserving edits. The high-drift tier (Gemini Pro) systematically injects explicit causal connectors and analytical framing, which is ideal for technical documentation, compliance summaries, or educational content. Claude 3 occupies a transitional zone, useful when you need moderate formalization without full analytical restructuring.

This finding enables predictable style-aware routing. Instead of guessing which model "sounds right," you can set a drift threshold (e.g., ≤0.15 for creative, ≥0.20 for analytical) and automatically select or fallback based on measured displacement.

Core Solution

Building a stylometric drift measure

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

Sign in to read the full article and unlock all 635+ tutorials.

Sign In / Register — Start Free Trial

7-day free trial · Cancel anytime · 30-day money-back