r splits a multi-codepoint character, which can cause rendering artifacts or invalid strings.
2. Regex Caching: Compiled regular expressions are cached to avoid repeated parsing overhead in hot paths.
3. Explicit Replacement Strategy: We enforce replaceAll for string patterns to eliminate ambiguity and require explicit global flags for regex patterns.
4. Locale Injection: Formatting and comparison methods accept locale parameters, defaulting to the runtime environment but allowing override for consistent server-side rendering.
Implementation
interface TextProcessorConfig {
locale: string;
maxSegmentCacheSize?: number;
}
interface SegmentCache {
[key: string]: string[];
}
class TextProcessor {
private config: TextProcessorConfig;
private segmentCache: SegmentCache;
private regexCache: Map<string, RegExp>;
constructor(config: TextProcessorConfig) {
this.config = {
locale: config.locale || 'en-US',
maxSegmentCacheSize: config.maxSegmentCacheSize || 100,
};
this.segmentCache = {};
this.regexCache = new Map();
}
/**
* Truncates text based on visual graphemes, not code units.
* Prevents splitting emojis or combined characters.
*/
truncateGraphemes(input: string, limit: number, suffix: string = 'β¦'): string {
if (input.length <= limit) return input;
const segments = this.getGraphemes(input);
if (segments.length <= limit) return input;
const truncated = segments.slice(0, limit).join('');
return truncated + suffix;
}
/**
* Performs batch replacement with safety checks.
* Uses replaceAll for strings to ensure all occurrences are handled.
*/
batchReplace(
input: string,
replacements: Array<{ pattern: string | RegExp; replacement: string }>
): string {
let result = input;
for (const { pattern, replacement } of replacements) {
if (typeof pattern === 'string') {
result = result.replaceAll(pattern, replacement);
} else {
const cachedRegex = this.getCachedRegex(pattern);
result = result.replace(cachedRegex, replacement);
}
}
return result;
}
/**
* Normalizes text for accent-insensitive search or comparison.
* Decomposes characters to separate base letters from diacritics.
*/
normalizeForSearch(input: string): string {
return input
.normalize('NFD')
.replace(/[\u0300-\u036f]/g, '')
.toLowerCase();
}
/**
* Formats numbers with locale-aware grouping and decimal separators.
*/
formatNumber(value: number, options?: Intl.NumberFormatOptions): string {
return new Intl.NumberFormat(this.config.locale, options).format(value);
}
private getGraphemes(input: string): string[] {
if (this.segmentCache[input]) {
return this.segmentCache[input];
}
const segmenter = new Intl.Segmenter(this.config.locale, {
granularity: 'grapheme',
});
const segments = Array.from(segmenter.segment(input)).map((s) => s.segment);
// Simple LRU eviction for cache
const cacheKeys = Object.keys(this.segmentCache);
if (cacheKeys.length >= this.config.maxSegmentCacheSize!) {
delete this.segmentCache[cacheKeys[0]];
}
this.segmentCache[input] = segments;
return segments;
}
private getCachedRegex(pattern: RegExp): RegExp {
const key = pattern.source + pattern.flags;
if (!this.regexCache.has(key)) {
this.regexCache.set(key, new RegExp(pattern.source, pattern.flags));
}
return this.regexCache.get(key)!;
}
}
// Usage Example
const processor = new TextProcessor({ locale: 'de-DE' });
// Unicode-safe truncation
const bio = 'Hello π¨βπ©βπ§βπ¦ World!';
const shortBio = processor.truncateGraphemes(bio, 10);
// Result: "Hello π¨βπ©βπ§βπ¦ Wβ¦" (Correctly preserves the family emoji)
// Batch replacement
const text = 'Price: $100, Tax: $15';
const updated = processor.batchReplace(text, [
{ pattern: '$', replacement: 'β¬' },
{ pattern: /\d+/g, replacement: (match) => processor.formatNumber(Number(match)) },
]);
// Result: "Price: 100, Tax: 15" (With locale formatting applied via function)
Rationale:
truncateGraphemes uses Intl.Segmenter to ensure visual correctness. The cache mitigates performance overhead for repeated strings.
batchReplace abstracts the difference between string and regex patterns, enforcing replaceAll for strings to prevent the common "first match only" bug.
normalizeForSearch enables robust search functionality that ignores accents, a requirement for many international applications.
- Regex caching prevents the cost of recompiling patterns in loops or frequent calls.
Pitfall Guide
1. The Emoji Length Trap
Explanation: str.length counts UTF-16 code units, not visual characters. Emojis like π¨βπ©βπ§βπ¦ consist of multiple code points joined by zero-width joiners. str.length returns 11, while the visual length is 1.
Fix: Use Intl.Segmenter or the spread operator [...str].length for grapheme counts. Never use length for UI constraints on user-generated content.
2. Silent Single Replacement
Explanation: String.prototype.replace only replaces the first occurrence when given a string pattern. Developers often assume it replaces all, leading to incomplete data sanitization.
Fix: Use replaceAll for string patterns. For regex, ensure the g flag is present. Audit all replace calls for missing global flags.
3. Slice vs. Substring Index Behavior
Explanation: substring swaps arguments if start > end, while slice returns an empty string. slice also supports negative indices to count from the end; substring treats negatives as zero.
Fix: Prefer slice for its predictable behavior and negative index support. Use substring only when you specifically need the argument-swapping behavior, which is rare.
4. Regex Injection Vulnerabilities
Explanation: Constructing regex patterns from user input without escaping special characters can lead to catastrophic backtracking or logic errors.
Fix: Escape user input before embedding in regex, or use string methods like includes and replaceAll when regex features are unnecessary. Validate patterns against a whitelist if possible.
5. Implicit Type Coercion Side Effects
Explanation: Using +str or str * 1 for conversion can yield unexpected results with objects or empty strings. Number('') is 0, which may be confused with valid numeric input.
Fix: Use explicit Number() or parseInt() with a radix. Validate input types before conversion. Handle empty strings explicitly to avoid 0 defaults.
6. Template Literal XSS Risks
Explanation: Interpolating user data directly into HTML templates via backticks injects raw content, enabling XSS attacks.
Fix: Escape HTML entities before interpolation. Use a sanitization library or a custom escape function for any dynamic content rendered in the DOM.
7. parseInt Radix Omission
Explanation: parseInt without a radix argument may interpret strings starting with 0 as octal in older environments, though modern engines default to decimal. Relying on this is fragile.
Fix: Always provide the radix argument, e.g., parseInt(str, 10). This ensures consistent behavior across all environments and signals intent clearly.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Check substring existence | includes() | Readable, ES6+, optimized | Low |
| Extract with negative indices | slice() | Supports negatives, predictable | Low |
| Replace all occurrences | replaceAll() | Explicit, no regex overhead | Low |
| Format currency | Intl.NumberFormat | Locale-aware, handles grouping | Medium |
| Truncate user text | Intl.Segmenter | Unicode-safe, prevents split chars | Medium |
| Regex in loop | Cached RegExp | Avoids recompilation cost | High (if missed) |
| Accent-insensitive search | normalize('NFD') + strip | Handles diacritics correctly | Low |
Configuration Template
// text-processor.config.ts
export interface TextProcessorOptions {
locale: string;
defaultSuffix: string;
enableSegmentCache: boolean;
maxCacheSize: number;
}
export const defaultConfig: TextProcessorOptions = {
locale: 'en-US',
defaultSuffix: 'β¦',
enableSegmentCache: true,
maxCacheSize: 200,
};
// Factory function for dependency injection
export function createTextProcessor(overrides?: Partial<TextProcessorOptions>) {
const config = { ...defaultConfig, ...overrides };
return new TextProcessor(config);
}
Quick Start Guide
- Install Polyfill (Optional): For older browsers, add
@formatjs/intl-segmenter polyfill to support Intl.Segmenter.
- Import and Configure:
import { createTextProcessor } from './text-processor.config';
const textEngine = createTextProcessor({ locale: 'fr-FR' });
- Process Text:
const result = textEngine.truncateGraphemes('CafΓ© π₯', 5);
console.log(result); // "CafΓ© π₯" (Grapheme safe)
- Batch Transform:
const cleaned = textEngine.batchReplace('ID: 123, Name: Test', [
{ pattern: 'ID:', replacement: 'Ref:' },
{ pattern: /\d+/g, replacement: '###' },
]);
- Verify Unicode Safety: Test truncation with complex emojis and combined characters to ensure no rendering artifacts. Use the
WOW table metrics to validate performance in your environment.