zation, and calculation into discrete, testable units.
Architecture Decisions
- Trigram Extraction: We use 3-character sequences rather than words or sentences. Trigrams balance contextual awareness with computational efficiency. They capture repetitive phrasing patterns without requiring language-specific tokenization or NLP dependencies.
- Shannon Entropy Formula:
H = -Ξ£ p(x) log2 p(x) measures the average information content per symbol. Applied to trigram distributions, it reveals how predictable or repetitive the text is. Lower values indicate high repetition; higher values indicate diverse, information-rich content.
- KB Normalization: Raw entropy scores scale with text length. Dividing by kilobytes enables size-independent comparison across files, modules, and repositories.
- Exclusion Pipeline: Executable code, minified assets, and binary files are filtered before analysis. Only comments, docstrings, and metadata headers are processed.
TypeScript Implementation
interface EntropyReport {
filePath: string;
entropyBitsPerKb: number;
trigramCount: number;
uniqueTrigrams: number;
repetitionRatio: number;
}
class CitationEntropyAnalyzer {
private readonly NGRAM_ORDER = 3;
private readonly BYTES_PER_KB = 1024;
analyzeFileContent(rawContent: string, filePath: string): EntropyReport {
const cleanedText = this.stripExecutableSyntax(rawContent);
const trigrams = this.extractTrigrams(cleanedText);
if (trigrams.length === 0) {
return this.createEmptyReport(filePath);
}
const frequencyMap = this.buildFrequencyMap(trigrams);
const entropy = this.computeShannonEntropy(frequencyMap, trigrams.length);
const normalizedEntropy = entropy / (cleanedText.length / this.BYTES_PER_KB);
return {
filePath,
entropyBitsPerKb: Math.round(normalizedEntropy * 100) / 100,
trigramCount: trigrams.length,
uniqueTrigrams: frequencyMap.size,
repetitionRatio: 1 - (frequencyMap.size / trigrams.length)
};
}
private stripExecutableSyntax(source: string): string {
// Remove code blocks, keep only comments, docstrings, and headers
const commentPattern = /(?:\/\/.*$|\/\*[\s\S]*?\*\/|#.*$)/gm;
const matches = source.match(commentPattern) || [];
return matches.join('\n').trim();
}
private extractTrigrams(text: string): string[] {
const normalized = text.toLowerCase().replace(/\s+/g, ' ').trim();
const trigrams: string[] = [];
for (let i = 0; i <= normalized.length - this.NGRAM_ORDER; i++) {
trigrams.push(normalized.slice(i, i + this.NGRAM_ORDER));
}
return trigrams;
}
private buildFrequencyMap(trigrams: string[]): Map<string, number> {
const freq = new Map<string, number>();
for (const ng of trigrams) {
freq.set(ng, (freq.get(ng) || 0) + 1);
}
return freq;
}
private computeShannonEntropy(freqMap: Map<string, number>, total: number): number {
let entropy = 0;
for (const count of freqMap.values()) {
const probability = count / total;
if (probability > 0) {
entropy -= probability * Math.log2(probability);
}
}
return entropy;
}
private createEmptyReport(path: string): EntropyReport {
return {
filePath: path,
entropyBitsPerKb: 0,
trigramCount: 0,
uniqueTrigrams: 0,
repetitionRatio: 0
};
}
}
export { CitationEntropyAnalyzer, EntropyReport };
Why This Structure Works
The class encapsulates stateless analysis logic, making it trivial to unit test and integrate into CI runners. Separating syntax stripping from entropy calculation prevents executable code from contaminating the measurement. The repetitionRatio field provides an immediate heuristic for teams that prefer simpler thresholds alongside Shannon entropy. Normalization to bits/KB ensures that a 50KB file and a 200KB file can be compared directly without size bias.
Pitfall Guide
Explanation: Running entropy analysis on full source files mixes control flow patterns with attribution text. Code naturally contains repetitive syntax (braces, keywords, operators) that artificially lowers entropy scores.
Fix: Always apply a language-aware comment extractor before analysis. Use AST parsers or regex filters to isolate //, /* */, #, and docstring blocks.
2. Treating Low Entropy as a Storage Win
Explanation: Teams sometimes celebrate high compression ratios as a performance benefit. While repetitive text compresses well, it indicates boilerplate pollution that degrades developer experience and search accuracy.
Fix: Track compression and entropy as separate metrics. Optimize for entropy thresholds, not storage savings. Use compression only for archival or CDN delivery.
3. Hardcoding Thresholds Without Baseline Calibration
Explanation: Applying the 4.0β6.0 bits/KB range universally ignores language differences. Python docstrings naturally contain more whitespace and structural markers than Go comments, skewing raw scores.
Fix: Establish per-language baselines during onboarding. Run a one-time scan across your tech stack, calculate median entropy, and set gates relative to your own distribution.
Explanation: License blocks, SPDX identifiers, and auto-generated attribution banners are legally required but artificially depress entropy. Including them triggers false positives.
Fix: Configure exclusion patterns for known header templates. Use a skipPatterns array in your analyzer config to ignore files or line ranges matching standard boilerplate.
5. Ignoring Context-Window Truncation Effects
Explanation: Multi-agent systems operating near context limits often truncate or repeat citation blocks to fit within token budgets. This creates artificial entropy spikes or drops that don't reflect actual code quality.
Fix: Correlate entropy measurements with agent configuration logs. If entropy drops sharply after a prompt update, investigate context-window management before adjusting CI gates.
6. Neglecting Multi-Language Repository Normalization
Explanation: A monorepo mixing TypeScript, Python, and Rust will show wildly different entropy distributions. Aggregating scores without language stratification produces misleading averages.
Fix: Implement language-aware grouping in your reporting pipeline. Calculate entropy per extension, then weight by file count or lines of code for repository-wide summaries.
7. Overlooking Correlation With Defect Density
Explanation: Entropy measures documentation hygiene, not runtime correctness. Teams sometimes assume low entropy directly causes bugs, leading to misplaced optimization efforts.
Fix: Treat entropy as a leading indicator of review friction and search degradation. Cross-reference with issue tracking data to validate whether low-entropy modules actually experience higher defect rates.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Pure agent development shop | Strict CI gates (<4.0 bits/KB warning, <3.5 failure) | High boilerplate accumulation risks review paralysis and search degradation | Low engineering overhead; prevents long-term review debt |
| Hybrid human/agent team | Soft monitoring + prompt tuning | Human reviewers naturally introduce entropy variation; gates should guide agents, not block humans | Medium overhead; requires prompt iteration and agent configuration management |
| Legacy migration project | Baseline calibration + exclusion tuning | Existing codebases have established documentation patterns; new gates must not flag historical debt | Low cost; focuses on new agent contributions only |
| High-compliance environment | Entropy tracking + header exemption rules | Legal attribution blocks are mandatory; measuring them creates false positives | Medium cost; requires careful pattern matching and compliance validation |
Configuration Template
{
"entropyAnalyzer": {
"thresholds": {
"critical": 3.5,
"warning": 4.0,
"target": 6.0
},
"exclusions": {
"patterns": [
"SPDX-License-Identifier",
"Copyright.*All rights reserved",
"Generated by agent framework"
],
"extensions": [".min.js", ".map", ".json"]
},
"reporting": {
"format": "json",
"outputPath": "./reports/entropy-scan.json",
"groupBy": "language",
"includeRepetitionRatio": true
},
"ciIntegration": {
"failOnCritical": true,
"commentOnPR": true,
"diffThreshold": 0.5
}
}
}
Quick Start Guide
- Install the analyzer: Add the TypeScript module to your repository and compile it with your existing build toolchain. No external NLP dependencies are required.
- Run a baseline scan: Execute the analyzer against your
src/ directory. Generate the initial JSON report to establish per-language entropy baselines.
- Configure exclusions: Update the
exclusions.patterns array to match your project's license headers and framework attribution blocks. Re-run the scan to verify false positives are eliminated.
- Add CI gate: Insert a post-build step that reads the JSON report, checks the
entropyBitsPerKb against your thresholds, and fails the pipeline if critical limits are breached.
- Validate with PRs: Open a test pull request containing agent-generated code. Confirm that the CI step surfaces entropy deltas and that the report aligns with manual review observations.