ment pipeline requires separating syntactic style from semantic content. The most robust approach uses function-word frequency vectors, cosine distance, and non-parametric statistical validation. Below is a production-ready TypeScript implementation that abstracts the measurement logic into a reusable pipeline.
Architecture Decisions & Rationale
- Function-Word Lexicon over Embeddings: Dense embeddings conflate meaning and structure. Function words (articles, prepositions, conjunctions, pronouns) carry syntactic habits with minimal semantic load. A curated 50–60 word lexicon captures style without noise.
- L2 Normalization: Texts vary in length. L2 normalization ensures vector magnitude doesn't bias distance calculations, making cosine distance a pure measure of directional alignment in style-space.
- Cosine Distance:
1 - cosine_similarity measures angular separation. A value of 0 means identical function-word distribution; 0.23 indicates substantial structural reorganization.
- Bootstrap Confidence Intervals: Style distributions are rarely normal. Resampling (5,000 iterations) provides robust CIs without parametric assumptions.
- Permutation Testing: Validates whether observed drift differences are statistically meaningful or within corpus variance.
Implementation
// style-drift-pipeline.ts
import { createHash } from 'crypto';
export interface FunctionWordLexicon {
[word: string]: number; // frequency weight or binary flag
}
export interface StyleVector {
dimensions: number[];
length: number;
}
export interface DriftMeasurement {
sourceId: string;
targetId: string;
cosineDistance: number;
bootstrapCI: [number, number];
permutationPValue?: number;
}
export class StylometricAnalyzer {
private lexicon: string[];
private vectorDimension: number;
constructor(lexicon: string[]) {
this.lexicon = lexicon.map(w => w.toLowerCase());
this.vectorDimension = this.lexicon.length;
}
/**
* Converts raw text into an L2-normalized function-word frequency vector.
* Strips punctuation, lowercases, and counts occurrences of lexicon terms.
*/
public buildStyleVector(text: string): StyleVector {
const normalized = text
.replace(/[^\w\sàâäéèêëïîôùûüÿçœæ]/gi, ' ')
.toLowerCase()
.split(/\s+/)
.filter(Boolean);
const frequencies = new Array(this.vectorDimension).fill(0);
const wordCounts = new Map<string, number>();
normalized.forEach(token => {
wordCounts.set(token, (wordCounts.get(token) || 0) + 1);
});
this.lexicon.forEach((lexeme, idx) => {
frequencies[idx] = wordCounts.get(lexeme) || 0;
});
return this.normalizeL2(frequencies);
}
/**
* Calculates cosine distance between two style vectors.
* Returns 0 for identical distributions, approaches 1 for orthogonal styles.
*/
public calculateDrift(source: StyleVector, target: StyleVector): number {
const dotProduct = source.dimensions.reduce((sum, val, i) => sum + val * target.dimensions[i], 0);
const magnitudeProduct = source.length * target.length;
if (magnitudeProduct === 0) return 0;
const similarity = dotProduct / magnitudeProduct;
return 1 - similarity; // Cosine distance
}
/**
* Generates bootstrap confidence intervals for drift measurements.
* Resamples the drift distribution to estimate uncertainty.
*/
public computeBootstrapCI(
driftSamples: number[],
resamples: number = 5000,
confidenceLevel: number = 0.95
): [number, number] {
const bootstrapMeans: number[] = [];
const n = driftSamples.length;
for (let r = 0; r < resamples; r++) {
let sampleSum = 0;
for (let i = 0; i < n; i++) {
const idx = Math.floor(Math.random() * n);
sampleSum += driftSamples[idx];
}
bootstrapMeans.push(sampleSum / n);
}
bootstrapMeans.sort((a, b) => a - b);
const lowerIdx = Math.floor((1 - confidenceLevel) / 2 * resamples);
const upperIdx = Math.floor((1 + confidenceLevel) / 2 * resamples);
return [bootstrapMeans[lowerIdx], bootstrapMeans[upperIdx]];
}
private normalizeL2(vector: number[]): StyleVector {
const magnitude = Math.sqrt(vector.reduce((sum, val) => sum + val * val, 0));
const normalized = magnitude === 0 ? vector : vector.map(v => v / magnitude);
return { dimensions: normalized, length: magnitude === 0 ? 0 : 1 };
}
}
// Usage Example: Measuring drift across a batch of rewrites
export async function runDriftAnalysis(
analyzer: StylometricAnalyzer,
sourceTexts: string[],
rewrittenTexts: string[]
): Promise<DriftMeasurement[]> {
const results: DriftMeasurement[] = [];
const rawDrifts: number[] = [];
for (let i = 0; i < sourceTexts.length; i++) {
const srcVec = analyzer.buildStyleVector(sourceTexts[i]);
const tgtVec = analyzer.buildStyleVector(rewrittenTexts[i]);
const distance = analyzer.calculateDrift(srcVec, tgtVec);
rawDrifts.push(distance);
results.push({
sourceId: createHash('sha256').update(sourceTexts[i]).digest('hex').slice(0, 8),
targetId: createHash('sha256').update(rewrittenTexts[i]).digest('hex').slice(0, 8),
cosineDistance: distance,
bootstrapCI: [0, 0] // Placeholder, computed after batch
});
}
const ci = analyzer.computeBootstrapCI(rawDrifts);
return results.map(r => ({ ...r, bootstrapCI: ci }));
}
Why This Architecture Works in Production
- Stateless Vectorization: The
buildStyleVector method is pure and idempotent. You can cache results per text hash, avoiding redundant computation in high-throughput pipelines.
- Lexicon Swappability: The constructor accepts any language-specific function-word list. French, English, Spanish, or German lexicons can be injected without modifying the core math.
- Statistical Rigor Built-In: Bootstrap CI calculation is decoupled from drift measurement, allowing you to run permutation tests or Bayesian estimation later without refactoring.
- Memory Efficient: Vectors are fixed-size arrays. No matrix libraries required. Suitable for edge deployment or serverless functions with strict memory limits.
Pitfall Guide
| Pitfall | Explanation | Fix |
|---|
| Semantic Leakage via Embeddings | Using transformer embeddings (e.g., text-embedding-3-small) to measure style conflates meaning with syntax. Two texts with identical plot but different conjunctions will score near-zero distance. | Restrict analysis to function-word frequencies. Validate by checking that semantic similarity scores remain high while style distance varies. |
| Ignoring Statistical Overlap | Treating mean drift differences (e.g., 0.132 vs 0.139) as meaningful without confidence intervals or permutation tests. Small corpus variance can create false distinctions. | Always compute bootstrap CIs (≥5,000 resamples) and run Bonferroni-corrected permutation tests before routing decisions. |
| Lexicon Confirmation Bias | Pre-selecting function words that favor a specific model's known tendencies (e.g., including tandis, néanmoins to catch formal models). This artificially inflates drift for those models. | Use a linguistically grounded, closed-class lexicon (articles, prepositions, conjunctions, pronouns). Validate lexicon neutrality by testing on human-authored control texts. |
| Drift-Quality Conflation | Assuming higher stylistic displacement equals lower quality. Gemini's 0.230 shift isn't "worse"; it's analytically restructured. Creative tasks may require low drift; technical tasks may require high drift. | Decouple drift measurement from quality scoring. Use drift as a routing signal, not a pass/fail metric. Pair with task-specific evaluation (e.g., BLEU for technical, human preference for creative). |
| Prompt Sensitivity Blindness | Testing only one instruction and assuming the drift signature is universal. Models respond differently to "make it formal" vs "simplify for general audience." | Run multi-prompt validation. If drift varies >15% across prompts, the model's style is instruction-dependent, not intrinsic. Adjust routing thresholds accordingly. |
| Corpus Genre Lock-In | Training or validating on a single genre (e.g., 19th-century French literature) and applying thresholds to modern technical docs. Function-word distributions shift dramatically across registers. | Maintain genre-stratified baselines. Compute drift relative to the source register, not a universal human centroid. |
| Ignoring Directional Drift | Measuring only magnitude (cosine distance) without analyzing which function words drive the shift. Two models can have identical drift magnitude but opposite syntactic tendencies. | Implement directional analysis: track per-word frequency deltas. Route based on vector direction, not just distance. |
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Creative localization / literary editing | Low-drift cluster (GPT-4, Mistral 7B) with drift threshold ≤0.15 | Preserves syntactic density, rhythm, and authorial voice | Moderate API cost; higher human review overhead if drift exceeds threshold |
| Technical documentation / compliance summaries | High-drift tier (Gemini Pro) with drift threshold ≥0.20 | Injects explicit causal/analytical structure; improves scannability | Lower review cost; predictable output structure reduces editing time |
| Multi-language pipeline | Language-specific lexicons + unified drift calculator | Function words are language-bound; math remains identical across locales | Higher initial lexicon curation cost; scales linearly with language count |
| Real-time user-facing rewriting | Embedding similarity for speed + function-word drift for async validation | Latency constraints prevent full vectorization; drift used for post-hoc quality gates | Minimal latency impact; drift calculation deferred to background workers |
Configuration Template
{
"stylometric_pipeline": {
"lexicon_source": "./lexicons/fr_function_words.json",
"normalization": "L2",
"distance_metric": "cosine",
"statistical_validation": {
"bootstrap_resamples": 5000,
"confidence_level": 0.95,
"permutation_tests": 10000,
"bonferroni_correction": true
},
"routing_thresholds": {
"creative_preservation": { "max_drift": 0.15, "fallback_model": "mistral-7b" },
"analytical_transformation": { "min_drift": 0.20, "preferred_model": "gemini-pro" },
"ambiguous_zone": { "drift_range": [0.15, 0.20], "action": "human_review" }
},
"caching": {
"enabled": true,
"ttl_seconds": 86400,
"hash_algorithm": "sha256"
}
}
}
Quick Start Guide
- Prepare your lexicon: Export a JSON array of 50–60 language-specific function words (articles, prepositions, conjunctions, pronouns). Ensure no content words are included.
- Initialize the analyzer: Instantiate
StylometricAnalyzer with your lexicon. Run a dry pass on 10 source texts to verify vector dimensions match lexicon length.
- Generate baseline drift: Rewrite 20 texts using your target LLMs. Compute cosine distances and bootstrap CIs. Compare against human-to-human rewrites to establish your domain's natural drift floor.
- Configure routing thresholds: Set
max_drift and min_drift values based on your statistical clusters. Deploy the pipeline with fallback logic and async drift validation.
- Monitor directional shifts: Log per-word frequency deltas alongside aggregate drift. If a model's vector direction shifts over time (e.g., due to model updates), recalibrate thresholds quarterly.
Stylistic displacement is no longer a subjective editorial concern. It's a quantifiable engineering parameter. By measuring function-word drift, validating statistical separation, and routing based on directional style profiles, you transform LLM rewriting from a black box into a predictable, style-aware pipeline.