ps that span multiple source documents. The system stops acting as a search engine and starts functioning as a compounding knowledge asset.
Core Solution
Building an incremental knowledge system requires three architectural decisions: persistent storage, structured compilation, and constrained querying. The following implementation demonstrates how to orchestrate this workflow using a persistent workspace environment.
Step 1: Establish the Persistent Storage Schema
Raw documents, staged inputs, and compiled outputs must be isolated to prevent contamination during synthesis. A clean directory structure enforces separation of concerns:
/knowledge-archival/
/raw-inputs/
/pdfs/
/markdown/
/specs/
/staging/
/synthesized/
/concepts/
/relationships/
/master-index.md
/raw-inputs/ holds source material. /staging/ acts as a buffer for new documents before compilation. /synthesized/ contains the structured output. This separation ensures that queries never accidentally read unprocessed files, and compilation prompts can target specific directories without ambiguity.
Step 2: Orchestrate Incremental Compilation
Compilation runs in two phases: concept extraction and relationship mapping. Both phases use deterministic prompt templates that output standardized markdown. The orchestration script below demonstrates how to trigger these phases programmatically:
import { readFileSync, writeFileSync, existsSync } from 'fs';
import { join } from 'path';
interface CompilationConfig {
workspaceRoot: string;
modelEndpoint: string;
apiKey: string;
}
class KnowledgeCompiler {
private config: CompilationConfig;
constructor(config: CompilationConfig) {
this.config = config;
}
async extractConcepts(sourceFile: string): Promise<string> {
const prompt = `
Analyze ${sourceFile} and extract core technical concepts.
Output a markdown file with the following structure:
- Concept name (H2)
- Definition (1-2 sentences)
- Key properties (bullet list)
- Source reference (filename + section)
Maintain consistent formatting across all concept notes.
`;
return this.invokeModel(prompt);
}
async mapRelationships(conceptDir: string): Promise<string> {
const files = this.listMarkdownFiles(conceptDir);
const prompt = `
Review the following concept files: ${files.join(', ')}.
Identify up to 5 implicit relationships between concepts.
For each relationship, generate a markdown note containing:
- Concept A and Concept B
- Relationship type (dependency, contradiction, extension, alternative)
- Evidence summary (1-2 sentences)
Append all relationships to the master index.
`;
return this.invokeModel(prompt);
}
private async invokeModel(prompt: string): Promise<string> {
// Placeholder for actual API call to persistent workspace model
// Returns compiled markdown string
return `# Compiled Output\n${prompt}\n// Model response placeholder`;
}
private listMarkdownFiles(dir: string): string[] {
// Implementation would scan directory and return .md files
return [];
}
}
The compiler separates extraction from relationship mapping. This two-pass approach prevents context window overflow and ensures each concept is normalized before cross-referencing. The output is always structured markdown, which remains human-readable, version-controllable, and easily parsable by downstream query engines.
Once compilation completes, the system must answer questions using only the synthesized layer. This requires a persona configuration that enforces citation, scope boundaries, and gap detection:
ROLE: Knowledge Synthesis Engine
SCOPE: /synthesized/concepts/ and /synthesized/relationships/
CONSTRAINTS:
- Answer exclusively from compiled notes
- Cite source concept file for every claim
- Flag unanswered questions as "knowledge gaps"
- Never speculate beyond documented relationships
OUTPUT_FORMAT:
1. Direct answer (2-3 sentences)
2. Supporting evidence (bullet list with citations)
3. Gap analysis (if applicable)
This configuration transforms the model from a generative engine into a verification layer. It cannot invent relationships, must ground every statement in compiled markdown, and explicitly surfaces missing coverage. The architecture prioritizes accuracy over fluency, which is critical for technical and research workflows.
Pitfall Guide
1. Aggressive Pre-Chunking Before Compilation
Explanation: Splitting documents into small fragments before running extraction prompts destroys semantic continuity. The model loses narrative context and outputs shallow concept lists.
Fix: Feed complete documents or logical sections to the extraction prompt. Let the model determine conceptual boundaries rather than imposing arbitrary token limits.
2. Skipping Relationship Mapping
Explanation: Compiling concepts in isolation creates a knowledge silo. Without explicit cross-linking, the system cannot answer comparative or synthesis queries.
Fix: Always run the relationship mapping pass after extraction. Enforce a minimum of 3-5 relationship notes per compilation cycle to maintain graph density.
3. Weak Persona Constraints
Explanation: Allowing the model to draw from external training data during queries introduces hallucination and breaks citation integrity.
Fix: Lock the scope to /synthesized/ directories. Use negative constraints (never speculate, do not reference external knowledge) and require explicit source citations for every claim.
4. Treating the Master Index as Static
Explanation: The index file drifts as new concepts are added. Stale indexes cause query engines to miss recently compiled material.
Fix: Automate index regeneration during every compilation cycle. Append new entries rather than rewriting, and maintain a last_updated timestamp for cache invalidation.
5. Ignoring Model Version Drift
Explanation: Swapping models without version pinning changes compilation output formatting and relationship logic. Downstream parsers break.
Fix: Pin model versions in the orchestration script. Maintain a model_manifest.json that tracks which model version produced each compiled batch. Re-compile only when necessary.
6. Neglecting Citation Verification
Explanation: Models occasionally fabricate file paths or section references. Unverified citations erode trust in the knowledge base.
Fix: Implement a post-compilation validation step that checks if cited files exist and contain the referenced keywords. Flag mismatches for manual review.
7. Accumulating Obsolete Sources
Explanation: Old specifications and deprecated guides remain in /raw-inputs/, polluting future compilation cycles.
Fix: Archive deprecated sources to /raw-inputs/archived/ and exclude that directory from prompt scopes. Maintain a status: active/deprecated field in source metadata.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Small team (<5 users), static docs | Incremental Compilation | Low infra overhead, high synthesis quality | Medium (compute for initial build) |
| High-frequency updates, real-time sync | Traditional RAG + Vector DB | Faster ingestion, lower latency for fresh data | High (embedding + storage costs) |
| Compliance/audit requirements | Incremental Compilation | Full citation trail, version-controllable markdown | Low (storage only) |
| Cross-domain research synthesis | Incremental Compilation | Native relationship mapping across disparate sources | Medium (prompt engineering overhead) |
Configuration Template
# knowledge-system-config.yaml
workspace:
root: /knowledge-archival
scopes:
raw: /raw-inputs/
staging: /staging/
synthesized: /synthesized/
compilation:
extraction_prompt: |
Analyze {source_file} and extract core technical concepts.
Output markdown with H2 concept names, 1-2 sentence definitions,
bullet-point properties, and source references.
relationship_prompt: |
Review {concept_files}. Identify implicit relationships.
Output relationship notes with Concept A/B, relationship type,
and evidence summary. Update master-index.md.
max_concepts_per_run: 15
relationship_pairs_per_run: 5
query:
persona: |
ROLE: Knowledge Synthesis Engine
SCOPE: /synthesized/concepts/, /synthesized/relationships/
CONSTRAINTS: Cite sources, flag gaps, never speculate.
model: claude-sonnet-4-20250514
temperature: 0.1
max_tokens: 1024
validation:
citation_check: true
index_regenerate: true
archive_deprecated: true
Quick Start Guide
- Initialize Workspace: Create the directory schema (
/raw-inputs/, /staging/, /synthesized/) in your persistent environment. Upload 3-5 source documents to /staging/.
- Run Extraction: Execute the extraction prompt against each staging file. Verify output matches the markdown schema and move results to
/synthesized/concepts/.
- Map Relationships: Run the relationship prompt against the compiled concepts. Confirm relationship notes are generated and
master-index.md is updated.
- Configure Query Persona: Apply the constrained persona configuration. Test with a cross-concept question and verify citations point to compiled files.
- Validate & Iterate: Run citation validation. Archive any stale sources. Add new documents to
/staging/ and repeat the compilation cycle.