nv .venv
source .venv/bin/activate
pip install 'markitdown[pdf,docx,xlsx]'
### Step 2: Pipeline Architecture & Wrapper Design
Instead of scattering `.convert()` calls across your codebase, encapsulate the transformation logic. This centralizes error handling, logging, and plugin configuration.
```python
import logging
from pathlib import Path
from typing import Optional, Any
from markitdown import MarkItDown
logger = logging.getLogger(__name__)
class DocumentTransformer:
def __init__(
self,
enable_vision: bool = False,
vision_client: Optional[Any] = None,
vision_model: Optional[str] = None,
azure_endpoint: Optional[str] = None,
plugin_enabled: bool = False
):
self._engine = MarkItDown(
enable_plugins=plugin_enabled,
llm_client=vision_client if enable_vision else None,
llm_model=vision_model if enable_vision else None,
docintel_endpoint=azure_endpoint
)
self._vision_active = enable_vision
def process_file(self, source_path: Path) -> str:
if not source_path.exists():
raise FileNotFoundError(f"Target document not found: {source_path}")
try:
extraction_result = self._engine.convert(str(source_path))
return extraction_result.text_content
except Exception as exc:
logger.error(f"Conversion failed for {source_path.name}: {exc}")
raise
Step 3: Cost-Aware Vision Routing
LLM-powered image descriptions and OCR are opt-in because they introduce latency and API costs. The wrapper above accepts a vision_client and vision_model only when visual context is required. In production, route images conditionally:
from openai import OpenAI
vision_adapter = OpenAI()
transformer = DocumentTransformer(
enable_vision=True,
vision_client=vision_adapter,
vision_model="gpt-4o"
)
# Only invoke when visual data impacts downstream logic
markdown_output = transformer.process_file(Path("architectural_diagram.jpg"))
Step 4: Async Execution for High-Throughput Pipelines
The underlying .convert() method is synchronous. In event-driven architectures, wrap it to prevent event loop blocking:
import asyncio
from pathlib import Path
async def batch_transform(file_paths: list[Path]) -> dict[str, str]:
transformer = DocumentTransformer()
results = {}
tasks = [
asyncio.to_thread(transformer.process_file, fp)
for fp in file_paths
]
completed = await asyncio.gather(*tasks, return_exceptions=True)
for fp, outcome in zip(file_paths, completed):
if isinstance(outcome, Exception):
results[fp.name] = f"ERROR: {outcome}"
else:
results[fp.name] = outcome
return results
Architecture Rationale
- Markdown as the target format: LLMs are trained on heavily markdown-structured corpora. Preserving
#, |, -, and [] syntax maintains semantic boundaries that chunking algorithms rely on, reducing context window fragmentation.
- Plugin isolation: OCR and Azure Document Intelligence are loaded separately to keep the base installation lightweight. Production systems should only import heavy dependencies when the use case demands them.
- Conditional vision routing: Image transcription and OCR via LLM vision are expensive. The wrapper design forces explicit opt-in, preventing accidental cost spikes when processing directories containing screenshots or scanned pages.
- Thread-pool async wrapping: Synchronous I/O blocks async runtimes.
asyncio.to_thread delegates CPU/IO-bound conversion to worker threads without rewriting the library's internals.
Pitfall Guide
Explanation: The library executes with the privileges of the host process. Passing arbitrary user-uploaded files directly to .convert() can trigger unintended local file access or remote URI resolution.
Fix: Restrict input scope. Use convert_local() for filesystem-bound documents or convert_stream() for network payloads. Validate file extensions and run the conversion service in a sandboxed container with minimal filesystem permissions.
2. LLM Vision Cost Bleed
Explanation: Enabling llm_client and llm_model globally causes every image, diagram, or embedded graphic to trigger a vision API call. Processing a 50-slide presentation with screenshots can quickly accumulate unexpected token costs.
Fix: Implement conditional vision routing. Parse file metadata first, enable vision only for specific MIME types or file patterns, and cache results for identical image hashes.
3. Dependency Bloat from [all]
Explanation: Installing markitdown[all] pulls in heavy optional packages (OCR engines, Azure SDKs, audio transcription libraries). This increases container image size, slows CI/CD pipelines, and introduces version conflicts.
Fix: Audit your actual format requirements. Install only the extras you need: pip install 'markitdown[pdf,docx,youtube-transcription]'. Pin versions in requirements.txt or pyproject.toml.
4. Assuming Pixel-Perfect Layout Preservation
Explanation: The tool is explicitly designed for LLM ingestion, not human-readable document rendering. Complex multi-column layouts, floating images, and intricate invoice forms will lose visual positioning.
Fix: Set clear expectations in downstream consumers. For documents where spatial layout is critical (legal contracts, financial forms), route them through Azure Document Intelligence via the docintel_endpoint parameter instead of relying on standard extraction.
5. Blocking the Event Loop in Async Services
Explanation: Calling .convert() directly inside an async FastAPI or aiohttp handler blocks the main thread, degrading throughput and increasing latency under concurrent load.
Fix: Always delegate conversion to asyncio.to_thread() or a process pool. Never run synchronous file I/O on the async event loop.
6. Ignoring Stream-Based Processing for Archives
Explanation: ZIP files and EPUBs contain multiple internal documents. Loading them entirely into memory before conversion can trigger out-of-memory errors on large archives.
Fix: Use convert_stream() with chunked reading or iterate through archive contents externally, converting individual entries sequentially to maintain a predictable memory footprint.
Explanation: The OCR plugin (markitdown-ocr) requires separate installation and explicit enable_plugins=True. Forgetting to install the plugin package while enabling the flag results in silent failures or missing text extraction.
Fix: Document plugin requirements in your deployment manifest. Verify plugin availability at startup with a health check that attempts a lightweight conversion before accepting production traffic.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Standard RAG ingestion (PDFs, DOCX, HTML) | Base MarkItDown conversion | Preserves semantic structure, zero external API calls | Near-zero |
| Scanned documents with embedded text | OCR plugin + LLM vision | Extracts text from images without custom ML pipelines | Moderate (vision API tokens) |
| Complex financial forms & invoices | Azure Document Intelligence endpoint | Superior table/form recognition, enterprise-grade accuracy | High (Azure service pricing) |
| High-throughput batch processing | Async wrapper + selective extras | Prevents I/O blocking, minimizes memory footprint | Low (compute scaling only) |
| User-uploaded untrusted files | convert_local() + sandboxed container | Prevents arbitrary file/URI access, enforces security boundaries | Low (infrastructure isolation) |
Configuration Template
# production_ingestion.py
import os
import logging
from pathlib import Path
from typing import Optional
from markitdown import MarkItDown
from openai import OpenAI
logging.basicConfig(level=os.getenv("LOG_LEVEL", "INFO"))
logger = logging.getLogger("doc_pipeline")
class ProductionDocumentPipeline:
def __init__(self):
self._vision_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY")) if os.getenv("ENABLE_VISION") == "true" else None
self._vision_model = os.getenv("VISION_MODEL", "gpt-4o")
self._azure_endpoint = os.getenv("AZURE_DOC_INTEL_ENDPOINT")
self._engine = MarkItDown(
enable_plugins=os.getenv("ENABLE_OCR_PLUGIN") == "true",
llm_client=self._vision_client,
llm_model=self._vision_model,
docintel_endpoint=self._azure_endpoint
)
def transform(self, file_path: Path) -> str:
if not file_path.is_file():
raise ValueError(f"Invalid file path: {file_path}")
logger.info(f"Starting conversion: {file_path.name} | Format: {file_path.suffix}")
result = self._engine.convert(str(file_path))
logger.info(f"Conversion complete: {len(result.text_content)} characters extracted")
return result.text_content
def batch_transform(self, directory: Path) -> dict[str, str]:
if not directory.is_dir():
raise ValueError(f"Invalid directory: {directory}")
outputs = {}
for file in directory.iterdir():
if file.is_file():
try:
outputs[file.name] = self.transform(file)
except Exception as e:
logger.error(f"Failed to process {file.name}: {e}")
outputs[file.name] = f"CONVERSION_ERROR: {e}"
return outputs
Quick Start Guide
- Initialize environment: Create a virtual environment with Python 3.10+ and install only the formats you need:
pip install 'markitdown[pdf,docx]'
- Verify installation: Run
markitdown --help to confirm the CLI is accessible and dependencies are resolved.
- Test basic conversion: Execute
markitdown sample.pdf -o output.md to validate extraction quality and markdown structure.
- Integrate into pipeline: Import
MarkItDown in your Python service, instantiate the engine, and call .convert() with proper error handling and logging.
- Scale safely: Wrap synchronous calls in thread pools, restrict input validation, and enable vision/OCR plugins only when downstream logic requires visual context.