Building a Markdig Extension for DocFX: Remote Content Inclusion with AI Rewriting
Composing Distributed Documentation at Build Time: A Markdig-Based Approach
Current Situation Analysis
Modern engineering organizations rarely maintain documentation as a single monolithic repository. API references live alongside service code, troubleshooting guides reside in dedicated knowledge bases, and platform onboarding materials are owned by separate product teams. When these fragments need to converge into a unified static site, teams typically resort to one of three patterns: manual copy-pasting, Git submodules, or custom CI/CD orchestration scripts.
Each approach introduces measurable friction. Manual duplication guarantees content drift the moment a source file updates. Submodules complicate CI pipelines, require explicit credential propagation across repositories, and break deterministic builds when upstream branches shift. Custom shell scripts or Node-based preprocessors add fragile glue code that rarely survives framework upgrades.
DocFX, Microsoft's open-source documentation generator, solves the static compilation problem elegantly but deliberately excludes remote content resolution from its core pipeline. This is an architectural choice: DocFX prioritizes build determinism and offline reproducibility. However, this leaves a gap for organizations that need to compose authoritative content from distributed services without sacrificing build reliability.
The misunderstanding lies in treating remote inclusion as a build-time anti-pattern. In reality, remote resolution is only problematic when implemented naively: unbounded HTTP calls, missing cycle detection, and uncontrolled concurrency will destroy build performance and stability. When implemented through a controlled extension seam, remote inclusion becomes a deterministic composition layer that preserves offline fallbacks while enabling single-source-of-truth architecture.
DocFX exposes a public BuildOptions.ConfigureMarkdig extension point specifically for pipeline customization. Leveraging this seam allows teams to inject remote resolution logic without forking the core engine. The result is a maintainable, version-agnostic composition strategy that aligns with enterprise documentation scaling requirements.
WOW Moment: Key Findings
The shift from fragmented documentation management to build-time remote composition fundamentally changes how teams measure documentation health. The following comparison illustrates the operational impact of adopting a controlled remote inclusion strategy versus traditional approaches.
| Approach | Maintenance Overhead | Build Determinism | Tone Consistency | Auth Complexity |
|---|---|---|---|---|
| Manual Duplication | High (drift after every update) | High (static files) | Low (manual sync required) | None |
| Git Submodules | Medium (branch tracking, CI sync) | Medium (depends on upstream refs) | Medium (requires post-processing) | High (repo-level tokens) |
| Remote Include Extension | Low (single source, declarative) | High (cached, capped, deterministic) | High (optional AI normalization) | Medium (scoped service credentials) |
This finding matters because it decouples content authoring from content delivery. Teams can maintain authoritative markdown in their native services while guaranteeing that the compiled documentation site reflects a consistent, unified voice. The optional AI rewriting layer transforms raw remote fragments into context-aware content that matches the target page's tone, tense, and technical depth. This eliminates the "patchwork documentation" syndrome where readers encounter abrupt shifts in terminology or formatting between sections.
Core Solution
The architecture centers on a Markdig extension that intercepts the markdown parsing pipeline, resolves remote fragments, and injects them as native AST nodes before final HTML generation. The implementation avoids framework coupling by adhering strictly to DocFX's public extension contract.
Step 1: Pipeline Integration
DocFX allows custom Markdig pipelines through the ConfigureMarkdig delegate. Instead of modifying core build logic, we register a remote resolution processor that scans for a custom directive syntax.
public static class DocumentationPipelineBuilder
{
public static void RegisterRemoteComposition(
MarkdownPipelineBuilder pipeline,
IRemoteContentResolver resolver,
CompositionOptions settings)
{
pipeline.Extensions.Add(new RemoteFragmentProcessor(resolver, settings));
}
}
The RemoteFragmentProcessor implements Markdig's IMarkdownExtension interface. During the parsing phase, it scans for block and inline directives, fetches remote content, and replaces the directive node with the resolved markdown AST.
Step 2: Directive Syntax & Context Awareness
The directive uses a structured tag format that supports optional rewrite hints:
[!remoteinclude[Section Title](fragments/onboarding.md "align with developer guide tone")]
The processor evaluates context to determine injection strategy:
- Block Mode: When the directive occupies an entire line, the fetched content is parsed as full markdown and inserted as a block container. Headings, lists, and code fences render natively.
- Inline Mode: When embedded within a paragraph, only inline elements are spliced. Wrapping paragraph tags are stripped to prevent layout breaks.
This dual-mode parsing prevents common markdown rendering artifacts where remote content accidentally creates nested block structures.
Step 3: Remote Resolution & Caching
Fetching remote content requires strict lifecycle management. The resolver implements an in-process cache keyed by URL, ensuring each fragment is retrieved exactly once per build execution.
public class CachedRemoteResolver : IRemoteContentResolver
{
private readonly HttpClient _http;
private readonly ConcurrentDictionary<string, string> _cache;
private readonly SemaphoreSlim _concurrencyGate;
public CachedRemoteResolver(HttpClient client, int maxConcurrent = 8)
{
_http = client;
_cache = new ConcurrentDictionary<string, string>();
_concurrencyGate = new SemaphoreSlim(maxConcurrent, maxConcurrent);
}
public async Task<string> ResolveAsync(string path, CancellationToken ct)
{
if (_cache.TryGetValue(path, out var cached)) return cached;
await _concurrencyGate.WaitAsync(ct);
try
{
var response = await _http.GetAsync(path, ct);
response.EnsureSuccessStatusCode();
var content = await response.Content.ReadAsStringAsync(ct);
return _cache.GetOrAdd(path, content);
}
finally
{
_concurrencyGate.Release();
}
}
}
The concurrency gate prevents upstream service overload during large documentation builds. The cache is intentionally scoped to the build process lifetime, avoiding stale state across CI runs.
Step 4: Optional AI Tone Normalization
When a rewrite hint is present, the resolved content passes through a pluggable normalization service. This service accepts raw markdown and a style directive, then returns context-aligned content.
public interface IToneNormalizationService
{
Task<string> NormalizeAsync(string sourceMarkdown, string styleHint, CancellationToken ct);
}
public class AzureOpenAiNormalizer : IToneNormalizationService
{
private readonly OpenAIClient _client;
private readonly string _deploymentId;
public AzureOpenAiNormalizer(OpenAIClient client, string deployment)
{
_client = client;
_deploymentId = deployment;
}
public async Task<string> NormalizeAsync(string source, string hint, CancellationToken ct)
{
var prompt = $"Rewrite the following documentation fragment to {hint}. Preserve all technical accuracy, code blocks, and markdown structure. Output only the rewritten markdown.\n\n{source}";
var response = await _client.GetChatCompletionsAsync(new ChatCompletionsOptions
{
DeploymentName = _deploymentId,
Messages = { new ChatRequestUserMessage(prompt) }
}, ct);
return response.Value.Choices[0].Message.Content;
}
}
The normalization layer is entirely opt-in. When no hint is provided, content passes through verbatim. The service abstraction ensures zero vendor lock-in; teams can swap Azure OpenAI for local LLMs, self-hosted endpoints, or rule-based transformers without modifying the pipeline core.
Architecture Rationale
- Extension over Fork: DocFX's
ConfigureMarkdigseam guarantees forward compatibility. Framework updates do not require pipeline rewrites. - AsyncLocal Cycle Tracking: Remote fragments can reference other remote fragments. An
AsyncLocal<Stack<string>>tracks the resolution path, throwing a deterministic exception when depth exceeds the configured threshold (default: 8). - Hard Fail Default: Missing remote content triggers a build failure by default. This enforces content ownership accountability. Graceful degradation is available via explicit configuration flags.
- Credential Isolation: Authentication handlers are resolved at runtime from environment variables or host callbacks. No secrets are serialized into configuration files.
Pitfall Guide
1. Unbounded Recursion in Fragment Chains
Remote content that references itself or creates circular dependencies will cause stack overflow or infinite loops.
Fix: Implement an AsyncLocal resolution stack with a configurable depth limit. Log the cycle path before failing fast.
2. Credential Leakage in Configuration Files
Storing API keys or tenant IDs in remoteinclude.json committed to version control exposes secrets to repository history.
Fix: Resolve credentials exclusively through environment variables, CI/CD secret managers, or runtime callbacks. Validate configuration at startup to fail early if required secrets are absent.
3. AI Rewrite Hallucination or Structural Breakage
LLMs may alter code syntax, strip markdown formatting, or introduce inaccurate technical claims when rewriting documentation.
Fix: Use strict system prompts that mandate structural preservation. Implement a post-processing validation step that verifies markdown AST integrity after normalization. Pin model versions (e.g., gpt-4o-mini) to ensure consistent output.
4. Concurrency Storms During Large Builds
Fetching dozens of fragments simultaneously can trigger rate limits or upstream service degradation. Fix: Implement a semaphore-based concurrency cap (default: 8 concurrent requests). Add exponential backoff with jitter for transient HTTP failures.
5. Inline vs Block Context Mismatch
Injecting block-level markdown (headings, lists) into an inline context breaks HTML rendering and creates nested paragraph artifacts.
Fix: Parse the directive's surrounding context before injection. Use Markdig's Block vs Inline node detection to route content to the appropriate AST insertion method.
6. Stale Cross-Build Caching
Persisting remote content cache across CI runs causes documentation to serve outdated fragments when upstream services update. Fix: Scope the cache to the build process lifetime. Clear the dictionary on pipeline initialization. Use build metadata or content hashes for cache invalidation if cross-run persistence is required.
7. Silent 404 Failures
Remote services may return 404 or 500 responses that get swallowed by error handlers, resulting in missing documentation sections without build warnings.
Fix: Default to hard failure on non-2xx responses. Provide an explicit --allow-missing flag that renders a visible placeholder comment instead of silently omitting content.
Production Bundle
Action Checklist
- Verify DocFX version exposes
BuildOptions.ConfigureMarkdigbefore integration - Configure concurrency cap matching upstream service rate limits
- Implement
AsyncLocalcycle detection with depth threshold β€ 8 - Route all credentials through environment variables or secret managers
- Add post-AI validation to verify markdown AST integrity after normalization
- Enable hard-fail mode for missing content during initial rollout
- Scope in-process cache to build lifetime to prevent stale fragments
- Test inline vs block directive placement in representative documentation layouts
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Public-facing API docs with frequent updates | Remote Include + Hard Fail | Ensures single source of truth; breaks build if upstream changes | Low (HTTP fetch only) |
| Internal troubleshooting guides with varying team voices | Remote Include + AI Normalization | Aligns tone across fragmented authorship without manual editing | Medium (LLM inference cost) |
| Air-gapped or offline build environments | Pre-fetched Local Cache + Fallback | Maintains determinism when remote services are unreachable | Low (storage overhead) |
| High-compliance regulated documentation | Remote Include + Strict Validation | Prevents unauthorized content injection; enforces structural integrity | Low (validation compute) |
Configuration Template
{
"contentService": {
"baseUrl": "https://docs-api.internal.example.com/",
"urlTemplate": "v1/markdown/{path}",
"allowMissing": false,
"maxDepth": 8,
"concurrencyLimit": 8
},
"authentication": {
"mode": "managedIdentity",
"scope": "api://docs-service/.default"
},
"toneNormalization": {
"enabled": true,
"endpoint": "https://your-aoai.openai.azure.com/",
"deployment": "gpt-4o-mini",
"contextStrategy": "section",
"maxTokens": 4096
}
}
Quick Start Guide
- Register the extension: Add the remote composition package to your documentation project and hook it into DocFX's
ConfigureMarkdigdelegate during build initialization. - Define the directive: Insert
[!remoteinclude[Title](path/to/fragment.md)]tags into your markdown files where remote content should resolve. - Configure credentials: Set environment variables or CI/CD secrets matching your authentication mode. Verify the resolver picks up credentials at startup.
- Run a validation build: Execute
docfx buildwith hard-fail mode enabled. Confirm remote fragments resolve correctly and inline/block contexts render without layout breaks. - Enable tone normalization (optional): Add rewrite hints to directives and configure the AI normalization service. Monitor output for structural preservation before enabling across the full documentation set.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
