ecisions
- Stateless Processing: Each article is processed independently. No session memory is retained between calls to prevent context bleed and ensure reproducible outputs.
- Strict Prompt Constraints: The system prompt enforces concept boundaries, title formatting, internal linking, and synthesis rules. Chain-of-thought reasoning is explicitly suppressed to reduce token waste and hallucination.
- Structured Output Parsing: Responses are parsed against a predefined schema to guarantee consistent markdown generation.
- Local File Generation: Output is written to disk as individual
.md files, preserving vault compatibility and enabling version control.
Implementation
import Anthropic from '@anthropic-ai/sdk';
import fs from 'fs/promises';
import path from 'path';
interface ConceptNode {
title: string;
body: string;
sourceCitation: string;
internalLinks: string[];
}
interface ExtractionResponse {
concepts: ConceptNode[];
}
class KnowledgeAtomizer {
private client: Anthropic;
private outputDir: string;
constructor(apiKey: string, outputDir: string) {
this.client = new Anthropic({ apiKey });
this.outputDir = outputDir;
}
private buildSystemPrompt(): string {
return `You are a knowledge extraction engine. Your task is to convert source material into atomic notes.
RULES:
- Extract exactly 3 to 7 distinct concepts. Do not summarize the article.
- Each concept must be self-contained and reusable.
- Titles must follow Wikipedia naming conventions (noun phrases, no questions, no verbs).
- Write the body in original phrasing. Never copy-paste source text.
- Create [[wikilinks]] between concepts within this batch.
- Include a source citation block at the end of each note.
- Output must be valid JSON matching the requested schema.`;
}
private buildUserPrompt(sourceText: string, sourceUrl: string): string {
return `Source URL: ${sourceUrl}
Source Text:
${sourceText}
Extract atomic concepts according to the system rules. Return JSON.`;
}
async extract(sourceText: string, sourceUrl: string): Promise<ConceptNode[]> {
const response = await this.client.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 4096,
system: this.buildSystemPrompt(),
messages: [
{ role: 'user', content: this.buildUserPrompt(sourceText, sourceUrl) }
]
});
const rawContent = response.content[0].type === 'text' ? response.content[0].text : '';
const jsonMatch = rawContent.match(/\{[\s\S]*\}/);
if (!jsonMatch) {
throw new Error('Failed to parse JSON from AI response');
}
const parsed: ExtractionResponse = JSON.parse(jsonMatch[0]);
return parsed.concepts;
}
private formatMarkdown(concept: ConceptNode, sourceUrl: string): string {
const links = concept.internalLinks.map(link => `[[${link}]]`).join(', ');
return `# ${concept.title}
${concept.body}
**Cross-References**: ${links}
---
**Source**: [${sourceUrl}](${sourceUrl})
**Extracted**: ${new Date().toISOString().split('T')[0]}
`;
}
async persist(concepts: ConceptNode[], sourceUrl: string): Promise<void> {
await fs.mkdir(this.outputDir, { recursive: true });
for (const concept of concepts) {
const fileName = `${concept.title.toLowerCase().replace(/\s+/g, '-')}.md`;
const filePath = path.join(this.outputDir, fileName);
const content = this.formatMarkdown(concept, sourceUrl);
await fs.writeFile(filePath, content, 'utf-8');
}
}
}
// Usage Example
async function runPipeline() {
const atomizer = new KnowledgeAtomizer(
process.env.ANTHROPIC_API_KEY!,
'./vault/inbox'
);
const sourceArticle = await fs.readFile('./sources/technical-article.md', 'utf-8');
const concepts = await atomizer.extract(sourceArticle, 'https://example.com/article');
await atomizer.persist(concepts, 'https://example.com/article');
console.log(`Generated ${concepts.length} atomic notes.`);
}
runPipeline().catch(console.error);
Why This Architecture Works
- Deterministic Extraction: By forcing JSON output and suppressing conversational filler, the pipeline guarantees parseable results. This eliminates the need for regex scraping or fragile string manipulation.
- Internal Linking Constraint: The prompt explicitly requires concepts to reference each other. This creates micro-clusters that mirror Luhmann's associative indexing, making the notes immediately useful upon import.
- Token Optimization:
max_tokens: 4096 caps output size, preventing runaway generation. The system prompt is concise and rule-bound, reducing input token count while maintaining constraint enforcement.
- Vault Compatibility: Output follows standard markdown conventions with explicit citation blocks and date stamps. Files are named using URL-safe slugs, ensuring compatibility with Git-based version control and static site generators.
Pitfall Guide
1. Summary Collapse
Explanation: AI models default to summarization when constraints are vague. The output becomes a condensed version of the source rather than discrete concepts.
Fix: Explicitly forbid summarization in the system prompt. Enforce noun-phrase titles and require each note to stand alone without referencing the source article's narrative.
2. Context Stripping
Explanation: Atomic notes lose the nuance that made the concept valuable. The AI extracts the "what" but drops the "why" or "under what conditions."
Fix: Mandate a context anchor in the body template. Require the AI to specify scope, limitations, or applicable domains within each note.
3. Phantom Wikilinks
Explanation: The AI generates links to concepts that don't exist in the current batch or vault. These become broken references that degrade trust in the system.
Fix: Implement a post-extraction validation step. Parse all [[wikilinks]] and cross-reference them against existing vault files. Flag unmatched links for manual review.
4. Vault Fragmentation
Explanation: Importing AI-generated notes without a directory strategy creates a flat, unsearchable mass. Notes get lost in the inbox.
Fix: Define a strict folder hierarchy before ingestion. Map concept domains to directories (e.g., vault/systems/architecture/, vault/cognitive/learning/). Enforce naming conventions via pre-commit hooks.
5. Cognitive Offloading Fallacy
Explanation: Treating AI output as final knowledge. The user skips review, assuming the extraction is accurate and complete.
Fix: Enforce a mandatory curation step. The pipeline should output to a staging directory. Human review must validate accuracy, adjust phrasing, and integrate notes into the active vault before they become referenceable.
6. Prompt Drift Over Time
Explanation: As the AI model updates or prompt templates are modified, extraction behavior changes. Notes generated months apart follow different structures.
Fix: Version control your prompt templates. Store them alongside your pipeline code. Run regression tests on a fixed corpus of articles to detect structural drift before deploying updates.
7. API Cost Mismanagement
Explanation: Unbounded input lengths or excessive max_tokens settings cause unexpected billing spikes, especially when processing long technical documents.
Fix: Chunk source text to 4,000–6,000 tokens before sending to the API. Implement token budgeting with hard limits. Cache responses for identical source hashes to avoid redundant API calls.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-volume technical reading (10+ articles/week) | AI-Assisted + Staging Review | Manual atomization becomes unsustainable; AI handles scaffolding, human validates | ~$0.003/article |
| Deep academic research (complex, nuanced sources) | Manual Extraction + AI Summarization | Requires precise contextual capture; AI used only for initial concept mapping | $0.00 + human time |
| Rapid prototyping / Dev notes | AI-Assisted + Direct Import | Speed prioritized over precision; notes are disposable or quickly iterated | ~$0.003/article |
| Long-term knowledge base building | Hybrid Curation | Balances scalability with quality; AI extracts, human integrates and cross-links | ~$0.003/article + review time |
Configuration Template
// pipeline.config.ts
export const PIPELINE_CONFIG = {
model: 'claude-3-5-sonnet-20241022',
maxTokens: 4096,
temperature: 0.2, // Low temperature for deterministic extraction
systemPrompt: `You are a knowledge extraction engine. Extract 3-7 atomic concepts.
Titles: Wikipedia-style noun phrases.
Body: Original phrasing, self-contained, include scope/limitations.
Links: [[wikilinks]] between batch concepts only.
Output: Valid JSON. No markdown formatting in JSON values.`,
stagingDir: './vault/staging',
activeDir: './vault/active',
validationRules: {
requireSourceCitation: true,
maxConcepts: 7,
minConcepts: 3,
enforceInternalLinks: true
}
};
Quick Start Guide
- Install Dependencies: Run
npm install @anthropic-ai/sdk and ensure Node.js 18+ is available.
- Set Environment Variable: Export your Anthropic API key:
export ANTHROPIC_API_KEY="sk-ant-..."
- Prepare Source: Place a markdown or text file in
./sources/ and update the file path in the usage example.
- Execute Pipeline: Run
ts-node pipeline.ts. The script will extract concepts, generate JSON, parse it, and write .md files to ./vault/inbox/.
- Review & Integrate: Open the staging directory, verify note accuracy, adjust phrasing if needed, and move validated files to your active vault directory. Run your wikilink validator to ensure cross-references resolve.
This pipeline transforms knowledge ingestion from a linear time sink into a scalable, review-driven workflow. The AI handles the mechanical translation; you retain control over accuracy, integration, and long-term structure.