wareness of JavaScript's internal text representation. The following implementation demonstrates a production-grade utility that replaces scattered string operations with a cohesive, type-safe interface.
Architecture Decisions
- Prefer
includes over indexOf: indexOf returns a numeric position, forcing developers to remember the !== -1 idiom. includes returns a boolean, aligning with modern control flow and reducing conditional complexity.
- Use
slice exclusively for extraction: substring swaps arguments when the start index exceeds the end index, creating silent bugs. slice supports negative indexing consistently and maps directly to buffer offsets.
- Handle Unicode with
codePointAt and Intl.Segmenter: charCodeAt only reads 16-bit units, breaking on surrogate pairs. codePointAt returns full code points. For grapheme-aware splitting, Intl.Segmenter provides locale-aware boundaries.
- Leverage tagged templates for DSLs: Tagged template literals separate static text from dynamic values, enabling safe interpolation, validation, and transformation without manual concatenation.
Implementation
interface TextPipelineOptions {
preserveWhitespace?: boolean;
encoding?: 'utf8' | 'base64' | 'url';
}
class TextPipeline {
private readonly source: string;
private readonly config: Required<TextPipelineOptions>;
constructor(input: string, options: TextPipelineOptions = {}) {
this.source = input;
this.config = {
preserveWhitespace: options.preserveWhitespace ?? false,
encoding: options.encoding ?? 'utf8',
};
}
hasToken(target: string, caseSensitive = true): boolean {
const normalizedSource = caseSensitive ? this.source : this.source.toLowerCase();
const normalizedTarget = caseSensitive ? target : target.toLowerCase();
return normalizedSource.includes(normalizedTarget);
}
extractRange(start: number, end?: number): string {
return this.source.slice(start, end);
}
applyTransform(transform: 'upper' | 'lower' | 'capitalize' | 'slug'): string {
switch (transform) {
case 'upper':
return this.source.toUpperCase();
case 'lower':
return this.source.toLowerCase();
case 'capitalize':
return this.source.charAt(0).toUpperCase() + this.source.slice(1);
case 'slug':
return this.source
.toLowerCase()
.trim()
.replace(/[^\w\s-]/g, '')
.replace(/[\s_]+/g, '-')
.replace(/^-+|-+$/g, '');
default:
return this.source;
}
}
encodePayload(): string {
switch (this.config.encoding) {
case 'url':
return encodeURIComponent(this.source);
case 'base64':
return btoa(unescape(encodeURIComponent(this.source)));
case 'utf8':
default:
return this.source;
}
}
normalizeUnicode(form: 'NFC' | 'NFD' | 'NFKC' | 'NFKD' = 'NFC'): string {
return this.source.normalize(form);
}
countGraphemes(): number {
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
return Array.from(segmenter.segment(this.source)).length;
}
}
// Tagged template for safe HTML interpolation
function safeInterpolate(strings: TemplateStringsArray, ...values: unknown[]): string {
return strings.reduce((accumulator, currentString, index) => {
const value = values[index];
const escaped = typeof value === 'string'
? value.replace(/&/g, '&').replace(/</g, '<').replace(/>/g, '>')
: String(value);
return accumulator + currentString + escaped;
}, '');
}
Rationale Behind Key Choices
TextPipeline encapsulation: Wrapping string operations in a class prevents method chaining pollution on the global String.prototype and enables consistent configuration (e.g., encoding defaults, whitespace policies).
Intl.Segmenter for grapheme counting: String.length counts UTF-16 code units, not visible characters. Intl.Segmenter respects locale rules and correctly counts combined emojis, ZWJ sequences, and diacritic clusters.
- Tagged template escaping: The
safeInterpolate function demonstrates how tagged templates can intercept dynamic values before concatenation, applying sanitization logic without manual string building. This pattern scales to query builders, markdown renderers, and template engines.
- Encoding fallbacks: Base64 encoding in browsers requires UTF-8 conversion before
btoa to avoid InvalidCharacterError on non-ASCII input. The pipeline handles this transparently.
Pitfall Guide
1. Assuming String.length Equals Character Count
Explanation: JavaScript strings are UTF-16 encoded. Characters outside the Basic Multilingual Plane (like emojis or rare CJK characters) require two 16-bit code units. length counts units, not graphemes.
Fix: Use Intl.Segmenter for accurate visual character counts, or codePointAt when iterating. Never rely on length for truncation or validation of user-generated content.
Explanation: substr(start, length) was deprecated in ECMA-262 due to parameter ambiguity and inconsistent behavior across environments. Some polyfills implement it incorrectly, causing off-by-one errors.
Fix: Replace with slice(start, start + length). slice uses end indices, aligns with array methods, and is universally supported without deprecation warnings.
3. Forgetting replace Only Swaps the First Match
Explanation: String.prototype.replace with a string pattern only replaces the first occurrence. Developers often assume global replacement, leading to incomplete sanitization or formatting.
Fix: Use a regex with the g flag for global replacement, or chain replaceAll (ES2021+) when available. Always verify replacement scope in unit tests.
4. Case Conversion Breaking i18n
Explanation: toUpperCase() and toLowerCase() use locale-agnostic mappings. Turkish, Azerbaijani, and other languages have dotted/dotless i variants that break naive case conversion.
Fix: Use toLocaleUpperCase() and toLocaleLowerCase() with explicit locale parameters when handling user-facing text. For internal identifiers, enforce ASCII-only constraints or use Intl.Collator for comparisons.
5. Overcomplicating Simple Checks with Regex
Explanation: Using /pattern/.test(str) for straightforward prefix/suffix or substring checks introduces regex compilation overhead and reduces readability.
Fix: Prefer startsWith, endsWith, and includes. Reserve regex for complex pattern matching, validation, or extraction where native methods lack expressiveness.
6. Ignoring trimStart and trimEnd for Parsing
Explanation: trim() removes whitespace from both ends, which can corrupt data when only leading or trailing whitespace needs removal (e.g., parsing CSV fields or log lines).
Fix: Use trimStart() or trimEnd() explicitly. This prevents accidental data loss and makes parsing intent clear to future maintainers.
7. Mishandling Unicode Equivalence in Search
Explanation: Characters like é can be represented as a single composed code point (U+00E9) or as e + combining acute accent (U+0065 U+0301). Direct string comparison fails across these forms.
Fix: Apply normalize('NFC') before indexing, searching, or storing text. This ensures consistent representation and prevents duplicate entries in databases or search indexes.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Simple substring check | includes() / startsWith() | Engine-optimized, zero regex compilation | Low |
| Extracting text with negative offsets | slice() | Consistent negative indexing, no argument swapping | Low |
| Grapheme-aware truncation | Intl.Segmenter | Correctly handles emojis, ZWJ sequences, diacritics | Medium (modern browser/Node 16+) |
| Case-insensitive search with i18n | toLocaleLowerCase() + includes() | Respects locale rules, prevents Turkish i bugs | Low |
| Bulk text replacement | replaceAll() or regex with /g | Explicit scope, avoids first-match-only surprises | Low |
| Safe HTML interpolation | Tagged template literals | Separates static/dynamic content, enables sanitization | Medium |
| Unicode normalization before storage | normalize('NFC') | Prevents duplicate indexing, ensures consistent comparison | Low |
Configuration Template
// string-pipeline.config.ts
export const STRING_PIPELINE_DEFAULTS = {
encoding: 'utf8' as const,
preserveWhitespace: false,
normalizationForm: 'NFC' as const,
caseHandling: 'locale-aware' as const,
graphemeAware: true,
};
export type StringPipelineConfig = typeof STRING_PIPELINE_DEFAULTS;
export function createPipelineConfig(overrides: Partial<StringPipelineConfig>): StringPipelineConfig {
return { ...STRING_PIPELINE_DEFAULTS, ...overrides };
}
Quick Start Guide
- Install/Verify Environment: Ensure Node.js 18+ or a modern browser that supports
Intl.Segmenter, replaceAll, and trimStart/trimEnd.
- Import the Utility: Copy the
TextPipeline class and safeInterpolate function into your shared utilities directory.
- Initialize with Defaults: Create a pipeline instance using
createPipelineConfig() to enforce consistent behavior across your codebase.
- Replace Legacy Calls: Search for
indexOf, substr, and manual regex checks. Swap them with hasToken, extractRange, and applyTransform.
- Validate Edge Cases: Run tests with emoji sequences, Turkish locale strings, and mixed Unicode forms to confirm grapheme and normalization handling.