d be excluded from the parent extraction scope to prevent column misalignment.
interface ExtractionContext {
targetTable: HTMLTableElement;
isTopLevel: boolean;
}
function resolveTopLevelTables(root: Document | Element): HTMLTableElement[] {
const candidates = Array.from(root.querySelectorAll('table'));
return candidates.filter(table => {
let ancestor: Node | null = table.parentElement;
while (ancestor) {
if (ancestor instanceof HTMLTableElement) return false;
ancestor = ancestor.parentElement;
}
return true;
});
}
Architecture Rationale: Traversing upward to detect parent <table> elements is O(n) per table but avoids false positives from <div> wrappers or shadow DOM boundaries. This keeps the extraction scope predictable.
Phase 2: Virtual Grid Reconstruction
Spans require a coordinate-aware matrix. Instead of iterating cells linearly, we allocate a 2D array and mark occupied coordinates.
type CellMatrix = (string | null)[][];
function buildVirtualMatrix(table: HTMLTableElement): CellMatrix {
const rows = Array.from(table.rows);
const matrix: CellMatrix = [];
rows.forEach((rowEl, rowIndex) => {
if (!matrix[rowIndex]) matrix[rowIndex] = [];
let colIndex = 0;
Array.from(rowEl.cells).forEach(cell => {
// Advance past cells already claimed by previous rowspans
while (matrix[rowIndex][colIndex] !== undefined) colIndex++;
const rowSpan = Math.max(1, parseInt(cell.getAttribute('rowspan') || '1', 10));
const colSpan = Math.max(1, parseInt(cell.getAttribute('colspan') || '1', 10));
// Reserve rectangular region in the matrix
for (let r = 0; r < rowSpan; r++) {
const targetRow = rowIndex + r;
if (!matrix[targetRow]) matrix[targetRow] = [];
for (let c = 0; c < colSpan; c++) {
const targetCol = colIndex + c;
if (matrix[targetRow][targetCol] === undefined) {
matrix[targetRow][targetCol] = null; // Placeholder for sanitization
}
}
}
colIndex += colSpan;
});
});
return matrix;
}
Why this works: The matrix reserves coordinates before populating values. This prevents column drift when rowspan pushes subsequent cells to the right. The placeholder null allows deferred sanitization without breaking alignment.
Phase 3: Content Sanitization & Normalization
Raw textContent captures invisible markup. We clone the cell, strip non-visual elements, and normalize whitespace.
function sanitizeCellContent(cell: HTMLTableCellElement): string {
if (!cell) return '';
const clone = cell.cloneNode(true) as HTMLElement;
const removalSelectors = 'style, script, noscript, template, link, meta';
clone.querySelectorAll(removalSelectors).forEach(el => el.remove());
// Collapse whitespace, remove zero-width characters
const raw = clone.textContent || '';
return raw
.replace(/[\u200B-\u200D\uFEFF]/g, '')
.replace(/\s+/g, ' ')
.trim();
}
Production Insight: Zero-width joiners and non-breaking spaces frequently corrupt CSV imports. Explicit removal prevents silent parsing failures in downstream tools like Excel or pandas.
Phase 4: Structured Serialization
CSV and JSON require format-specific escaping and key normalization.
function serializeToRFC4180(matrix: CellMatrix, delimiter: string = ','): string {
return matrix
.map(row =>
row.map(cell => {
const value = cell ?? '';
const needsQuoting = value.includes(delimiter) || /["\r\n]/.test(value);
const escaped = value.replace(/"/g, '""');
return needsQuoting ? `"${escaped}"` : escaped;
}).join(delimiter)
)
.join('\r\n');
}
function serializeToJSON(matrix: CellMatrix): string {
if (matrix.length < 2) return '[]';
const headers = matrix[0].map((raw, idx) => normalizeKey(raw, idx));
const dataRows = matrix.slice(1);
const records = dataRows.map(row => {
const record: Record<string, string> = {};
headers.forEach((key, idx) => {
record[key] = row[idx] ?? '';
});
return record;
});
return JSON.stringify(records, null, 2);
}
function normalizeKey(raw: string | null, fallbackIndex: number): string {
if (!raw || !raw.trim()) return `column_${fallbackIndex + 1}`;
return raw
.normalize('NFD')
.replace(/[\u0300-\u036f]/g, '')
.toLowerCase()
.replace(/[^a-z0-9]+/g, '_')
.replace(/^_+|_+$/g, '');
}
Architecture Rationale: RFC 4180 compliance prevents delimiter collision. Key normalization ensures JSON schemas remain stable across page updates. Fallback column naming prevents undefined keys when headers are missing or empty.
Phase 5: Client-Side Delivery
Blob URLs enable zero-server downloads without data URI size limits.
function triggerClientDownload(content: string, filename: string, mimeType: string): void {
const blob = new Blob([content], { type: mimeType });
const objectUrl = URL.createObjectURL(blob);
const anchor = document.createElement('a');
anchor.href = objectUrl;
anchor.download = filename;
anchor.style.display = 'none';
document.body.appendChild(anchor);
anchor.click();
// Cleanup to prevent memory leaks
setTimeout(() => {
document.body.removeChild(anchor);
URL.revokeObjectURL(objectUrl);
}, 100);
}
Why Blob over Data URI: Data URIs encode the entire payload in the URL string, triggering browser limits (~2MB in some environments) and blocking large exports. Blobs stream directly to the download manager.
Pitfall Guide
| Pitfall | Explanation | Fix |
|---|
| Span Alignment Drift | Linear DOM iteration ignores rowspan/colspan, causing columns to shift right after merged cells. | Use a virtual coordinate matrix. Reserve cells before populating values. |
| Invisible Markup Extraction | textContent captures <script>, <style>, and tracking pixels embedded in cells. | Clone the cell, remove non-visual selectors, then extract text. |
| Nested Table Flattening | Extracting all <table> elements merges child tables into parent rows, corrupting column counts. | Traverse upward to detect parent <table> tags. Filter to top-level only. |
| CSV Delimiter Collision | Unescaped commas, quotes, or newlines break spreadsheet parsers. | Implement RFC 4180 escaping: double internal quotes, wrap fields containing delimiters. |
| Blob URL Memory Leaks | Unreleased URL.createObjectURL references accumulate in browser memory. | Call URL.revokeObjectURL() after download triggers. Use setTimeout to ensure DOM cleanup. |
| Header Key Collisions | Duplicate or empty headers generate invalid JSON objects or overwrite values. | Normalize keys to snake_case. Append fallback indices for empty/duplicate headers. |
Assuming textContent Equals Visible Text | CSS display:none or visibility:hidden cells are still captured. | Check window.getComputedStyle(cell).display !== 'none' before inclusion, or filter via class/attribute markers. |
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| One-off extraction from static HTML | Client-side matrix extractor | Zero infrastructure, runs in console/bookmarklet | $0 |
| Recurring automated pipeline (50+ pages) | Server-side headless (Playwright/Puppeteer) | Handles JS-rendered tables, scheduled runs, error retry | Compute + licensing |
| Multi-table selection with column reordering | Browser extension or custom UI wrapper | User-driven filtering, preview, format switching | Development time |
| High-frequency internal dashboard | Cached API proxy + extraction worker | Decouples extraction from UI, enables rate limiting | Infrastructure overhead |
Configuration Template
// table-extractor.config.ts
export interface ExtractorConfig {
rootSelector: string;
delimiter: string;
outputFormat: 'csv' | 'json';
sanitizeHidden: boolean;
maxRows: number;
}
export const defaultConfig: ExtractorConfig = {
rootSelector: 'body',
delimiter: ',',
outputFormat: 'csv',
sanitizeHidden: true,
maxRows: 5000,
};
export function validateConfig(config: Partial<ExtractorConfig>): ExtractorConfig {
const merged = { ...defaultConfig, ...config };
if (merged.maxRows <= 0) throw new Error('maxRows must be positive');
if (!['csv', 'json'].includes(merged.outputFormat)) {
throw new Error('outputFormat must be csv or json');
}
return merged;
}
Quick Start Guide
- Open Browser Console: Navigate to the target page and open DevTools (F12).
- Inject Extraction Module: Paste the compiled TypeScript/JS pipeline into the console.
- Execute Extraction: Run
const tables = resolveTopLevelTables(document); followed by const matrix = buildVirtualMatrix(tables[0]);.
- Serialize & Download: Call
const csv = serializeToRFC4180(matrix); triggerClientDownload(csv, 'export.csv', 'text/csv;charset=utf-8');.
- Verify Output: Open the downloaded file in a spreadsheet or JSON viewer. Check column alignment and sanitization results.
This pipeline transforms fragile DOM traversal into a deterministic data extraction layer. By decoupling matrix reconstruction from serialization and enforcing strict sanitization, you eliminate the most common failure modes in web table automation. Deploy it as a console utility, bookmarklet, or integrated worker, and scale extraction without manual reconciliation.