servation and LaTeX passthrough. DOCX requires table reconstruction and font styling. JSON serves as the canonical intermediate format.
3. Local-Only Processing: All transformations run in Node.js. No data leaves the machine. This satisfies SOC2, HIPAA, and internal compliance requirements without relying on third-party API trust.
4. Incremental Updates: The processor tracks exported thread IDs in a local manifest. Re-running the script updates existing files instead of creating duplicates, solving the archival chaos problem.
Implementation
// src/types.ts
export interface ChatMessage {
id: string;
role: 'user' | 'assistant' | 'system';
content: string;
timestamp: string;
metadata?: Record<string, unknown>;
}
export interface ConversationThread {
id: string;
title: string;
create_time: string;
update_time: string;
mapping: Record<string, { message?: ChatMessage; children: string[] }>;
}
export interface ExportConfig {
inputPath: string;
outputDir: string;
formats: ('md' | 'json' | 'docx')[];
filters?: {
threadIds?: string[];
dateRange?: { start: string; end: string };
};
}
// src/processor.ts
import fs from 'fs/promises';
import path from 'path';
import { ExportConfig, ConversationThread, ChatMessage } from './types';
export class ConversationProcessor {
private config: ExportConfig;
private manifest: Set<string>;
constructor(config: ExportConfig) {
this.config = config;
this.manifest = new Set();
}
async initialize(): Promise<void> {
await fs.mkdir(this.config.outputDir, { recursive: true });
const manifestPath = path.join(this.config.outputDir, '.export-manifest.json');
try {
const raw = await fs.readFile(manifestPath, 'utf-8');
this.manifest = new Set(JSON.parse(raw));
} catch {
// First run
}
}
async run(): Promise<void> {
const rawData = await fs.readFile(this.config.inputPath, 'utf-8');
const threads: ConversationThread[] = JSON.parse(rawData);
const filtered = threads.filter(t => this.matchesFilter(t));
for (const thread of filtered) {
const messages = this.flattenThread(thread);
const baseName = this.sanitizeFilename(thread.title || thread.id);
if (this.config.formats.includes('md')) {
await this.renderMarkdown(messages, thread, baseName);
}
if (this.config.formats.includes('json')) {
await this.renderJSON(messages, thread, baseName);
}
if (this.config.formats.includes('docx')) {
await this.renderDocx(messages, thread, baseName);
}
this.manifest.add(thread.id);
}
await this.saveManifest();
}
private matchesFilter(thread: ConversationThread): boolean {
if (!this.config.filters) return true;
if (this.config.filters.threadIds?.length) {
return this.config.filters.threadIds.includes(thread.id);
}
if (this.config.filters.dateRange) {
const t = new Date(thread.create_time).getTime();
const start = new Date(this.config.filters.dateRange.start).getTime();
const end = new Date(this.config.filters.dateRange.end).getTime();
return t >= start && t <= end;
}
return true;
}
private flattenThread(thread: ConversationThread): ChatMessage[] {
const messages: ChatMessage[] = [];
const rootId = Object.keys(thread.mapping).find(k => !thread.mapping[k].message);
if (!rootId) return [];
const traverse = (nodeId: string) => {
const node = thread.mapping[nodeId];
if (node.message) messages.push(node.message);
for (const childId of node.children) {
traverse(childId);
}
};
traverse(rootId);
return messages;
}
private async renderMarkdown(msgs: ChatMessage[], thread: ConversationThread, baseName: string): Promise<void> {
const frontmatter = `---\ntitle: "${thread.title}"\ncreated: "${thread.create_time}"\nupdated: "${thread.update_time}"\nthread_id: "${thread.id}"\n---\n\n`;
const body = msgs.map(m => {
const roleLabel = m.role === 'assistant' ? '**AI**' : '**You**';
return `### ${roleLabel} (${new Date(m.timestamp).toLocaleString()})\n\n${m.content}\n`;
}).join('\n---\n\n');
const content = frontmatter + body;
await fs.writeFile(path.join(this.config.outputDir, `${baseName}.md`), content);
}
private async renderJSON(msgs: ChatMessage[], thread: ConversationThread, baseName: string): Promise<void> {
const payload = {
metadata: { id: thread.id, title: thread.title, created: thread.create_time },
messages: msgs.map(m => ({ role: m.role, content: m.content, timestamp: m.timestamp }))
};
await fs.writeFile(
path.join(this.config.outputDir, `${baseName}.json`),
JSON.stringify(payload, null, 2)
);
}
private async renderDocx(msgs: ChatMessage[], thread: ConversationThread, baseName: string): Promise<void> {
// Placeholder for docx library integration
// In production, use the `docx` npm package to construct paragraphs,
// preserve code blocks via monospace runs, and rebuild tables as Table objects.
console.log(`[DOCX] Generating ${baseName}.docx (requires docx library setup)`);
}
private sanitizeFilename(title: string): string {
return title.replace(/[^a-z0-9]+/gi, '_').toLowerCase().slice(0, 64);
}
private async saveManifest(): Promise<void> {
const manifestPath = path.join(this.config.outputDir, '.export-manifest.json');
await fs.writeFile(manifestPath, JSON.stringify([...this.manifest]));
}
}
Why This Architecture Works
- Deterministic Parsing: OpenAI's JSON structure is stable. Traversing the
mapping object guarantees chronological order without relying on DOM rendering or extension injection.
- Format Isolation: Markdown preserves raw LaTeX and code fences for downstream renderers (Obsidian, Notion, static sites). JSON provides a machine-readable canonical form. DOCX generation is deferred to a dedicated library that understands table reconstruction and font styling.
- Idempotent Execution: The
.export-manifest.json tracks processed thread IDs. Re-running the script updates files in place, preventing version drift in knowledge bases.
- Zero Network Dependency: All I/O is local. This eliminates the privacy risk inherent in server-assisted PDF generators while maintaining full control over output formatting.
Pitfall Guide
1. Assuming Browser PDF Generation is Local
Explanation: Most extensions claim "client-side" processing but silently route PDF compilation to external APIs. Browser printing engines cannot reliably render complex layouts, syntax highlighting, or LaTeX without external rendering services.
Fix: Verify the extension's privacy policy and network requests. If PDF export requires a server call, treat it as data exfiltration. Use local JSON/MD exports for sensitive threads.
2. Losing Code Fence Syntax via DOM Scraping
Explanation: Extensions that read the rendered page strip language identifiers from code blocks. A Python snippet becomes generic monospace text, breaking downstream syntax highlighting and copy-paste workflows.
Fix: Prefer tools that parse the underlying JSON payload. If using an extension, test exports with multi-language code blocks and verify language tags survive.
3. Ignoring Reasoning Traces & Canvas Elements
Explanation: Thinking models and canvas features output structured metadata that DOM scrapers flatten into unstructured paragraphs. Critical reasoning steps, tool calls, and interactive elements disappear.
Fix: Use JSON or MD exporters that explicitly map reasoning fields and canvas payloads. Verify exports contain structured blocks rather than concatenated text.
4. Truncating Long Context Threads
Explanation: Browser extensions often hit memory limits or DOM rendering caps when processing threads approaching model context windows. Messages silently drop or export fails mid-thread.
Fix: Use offline ZIP processors or local JSON parsers. They operate on raw data without rendering overhead, guaranteeing complete thread extraction regardless of length.
5. Mishandling LaTeX/MathJax Rendering
Explanation: Some exporters convert LaTeX to images, others output raw strings, and some attempt MathML. Downstream tools (Obsidian, Markdown viewers, static site generators) expect consistent formatting.
Fix: Standardize on raw LaTeX passthrough in Markdown. Configure your downstream renderer to handle MathJax/KaTeX. Avoid image-based math exports unless printing is the sole use case.
6. Overlooking Extension Permission Scope
Explanation: Many export extensions request broad host permissions or explicitly state they collect PII and user activity. Even if data isn't sold, telemetry and usage patterns are logged.
Fix: Audit Chrome Web Store listings for data collection clauses. Prefer extensions with minimal permissions and explicit local-processing claims. Use offline processors for compliance-heavy environments.
7. Duplicate Management in Archival Workflows
Explanation: Re-exporting conversations creates version chaos. Files accumulate with timestamps or incremental suffixes, breaking knowledge base links and search indexes.
Fix: Implement an idempotent export pipeline. Track processed thread IDs in a manifest file. Update existing files in place rather than creating new ones.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| One-off thread share with non-technical stakeholder | Server-Assisted Extension (PDF) | High-fidelity layout, universal readability | Free tier limits; paid per additional export |
| Compliance-heavy audit requiring data residency | Offline ZIP Processor | 100% local processing, structured MD/JSON output | One-time license (~$30) or free tier (30 convos) |
| Developer knowledge base (Obsidian/Notion) | Client-Side Extension or Local JSON Parser | Preserves code fences, LaTeX, and YAML frontmatter | Free; minimal infrastructure |
| Bulk archival with version control | Offline ZIP Processor with manifest tracking | Idempotent updates, no duplicates, full thread fidelity | One-time license; zero recurring cost |
| Programmatic re-processing (search index, plugins) | Local JSON Parser | Machine-readable structure, role/content/timestamp mapping | Free; requires basic Node.js setup |
Configuration Template
// export.config.json
{
"inputPath": "./downloads/conversations.json",
"outputDir": "./exports/chat-archive",
"formats": ["md", "json"],
"filters": {
"dateRange": {
"start": "2025-01-01T00:00:00Z",
"end": "2025-12-31T23:59:59Z"
}
}
}
// package.json (minimal setup)
{
"name": "chat-archive-processor",
"version": "1.0.0",
"type": "module",
"scripts": {
"build": "tsc",
"export": "node dist/processor.js"
},
"devDependencies": {
"typescript": "^5.4.0",
"@types/node": "^20.0.0"
}
}
// tsconfig.json
{
"compilerOptions": {
"target": "ES2022",
"module": "NodeNext",
"outDir": "./dist",
"rootDir": "./src",
"strict": true,
"esModuleInterop": true,
"skipLibCheck": true,
"forceConsistentCasingInFileNames": true
},
"include": ["src/**/*"]
}
Quick Start Guide
- Download Native Export: Navigate to OpenAI Settings β Data Controls β Request Export. Wait for the email (up to 7 days) and extract
conversations.json.
- Initialize Project: Create a directory, copy the
package.json and tsconfig.json templates, run npm install, and place the processor code in src/processor.ts.
- Configure Filters: Edit
export.config.json to specify input path, output directory, target formats, and optional date/thread filters.
- Execute Pipeline: Run
npm run build && npm run export. Verify output in the configured directory. Check .export-manifest.json to confirm idempotent tracking.
- Integrate Downstream: Link the output directory to your knowledge base, documentation pipeline, or archival storage. Re-run the script periodically to sync new conversations without duplicates.