ment variables, never hardcoded. For enterprise governance, consider implementing Azure RBAC with Managed Identity instead of account keys, enabling fine-grained role assignments without secret rotation.
The extraction service operates as a stateless HTTP-triggered function. It receives file identifiers, streams content from Blob Storage, normalizes format-specific parsing, and returns structured JSON. Python is selected for its mature document parsing ecosystem (pypdf, python-docx) and native Azure Functions runtime support.
Project Structure:
doc-pipeline/
├── src/
│ ├── __init__.py
│ ├── blob_accessor.py
│ ├── text_normalizer.py
│ └── api_routes.py
├── host.json
├── local.settings.json
├── requirements.txt
└── openapi_contract.json
Core Implementation (src/api_routes.py):
import azure.functions as func
import logging
from src.blob_accessor import BlobStoreClient
from src.text_normalizer import DocumentParser
app = func.FunctionApp(http_auth_level=func.AuthLevel.ANONYMOUS)
blob_client = BlobStoreClient()
parser = DocumentParser()
@app.route(route="catalog", methods=["GET"])
def list_available_files(req: func.HttpRequest) -> func.HttpResponse:
try:
files = blob_client.enumerate_blobs()
return func.HttpResponse(
body=str({"available_files": files}),
status_code=200,
mimetype="application/json"
)
except Exception as exc:
logging.error(f"Catalog enumeration failed: {exc}")
return func.HttpResponse(status_code=500, body="Internal catalog error")
@app.route(route="fetch/{filename}", methods=["GET"])
def retrieve_document_content(req: func.HttpRequest) -> func.HttpResponse:
filename = req.route_params.get("filename")
if not filename:
return func.HttpResponse(status_code=400, body="Filename required")
try:
raw_stream = blob_client.download_blob(filename)
extracted_text = parser.normalize(raw_stream, filename)
payload = {
"source": filename,
"content": extracted_text,
"status": "success"
}
return func.HttpResponse(
body=str(payload),
status_code=200,
mimetype="application/json"
)
except Exception as exc:
logging.warning(f"Extraction failed for {filename}: {exc}")
return func.HttpResponse(
body=str({"source": filename, "status": "failed", "error": str(exc)}),
status_code=422,
mimetype="application/json"
)
Extraction Logic (src/text_normalizer.py):
import io
import csv
from pypdf import PdfReader
from docx import Document as DocxDocument
class DocumentParser:
def normalize(self, stream: io.BytesIO, filename: str) -> str:
ext = filename.rsplit(".", 1)[-1].lower()
if ext == "pdf":
reader = PdfReader(stream)
return "\n".join(page.extract_text() or "" for page in reader.pages)
elif ext == "docx":
doc = DocxDocument(stream)
return "\n".join(para.text for para in doc.paragraphs)
elif ext in ("txt", "md", "csv"):
return stream.read().decode("utf-8")
else:
raise ValueError(f"Unsupported format: .{ext}")
Architecture Rationale:
- Class-based separation:
BlobStoreClient handles connectivity and streaming; DocumentParser manages format-specific logic. This prevents monolithic functions and enables unit testing.
- Stream-based processing: Files are downloaded as byte streams, not saved to disk. This eliminates I/O bottlenecks and aligns with serverless ephemeral storage limits.
- Explicit error mapping: Unsupported formats and extraction failures return
422 Unprocessable Entity with structured payloads, allowing Copilot Studio to handle degradation gracefully.
- Anonymous auth for prototyping:
func.AuthLevel.ANONYMOUS accelerates initial Copilot integration. Production deployments must migrate to OAuth 2.0 or Entra ID managed identities.
3. OpenAPI Contract Generation
Copilot Studio relies on OpenAPI specifications to discover tool capabilities, validate parameters, and map response schemas. The contract must strictly define endpoints, request/response models, and host routing.
Critical Formatting Rule: The host field must contain only the domain name. Protocol prefixes (https://) break Copilot's internal router.
{
"openapi": "3.0.1",
"info": { "title": "Document Retrieval Service", "version": "1.0.0" },
"servers": [{ "url": "https://your-function-app.azurewebsites.net" }],
"paths": {
"/api/catalog": {
"get": {
"operationId": "ListFiles",
"responses": {
"200": { "description": "Returns array of filenames" }
}
}
},
"/api/fetch/{filename}": {
"get": {
"operationId": "FetchContent",
"parameters": [
{ "name": "filename", "in": "path", "required": true, "schema": { "type": "string" } }
],
"responses": {
"200": { "description": "Returns extracted text payload" }
}
}
}
}
}
4. Copilot Studio Integration
Import the OpenAPI specification as a custom connector. Map the ListFiles and FetchContent operations to conversational triggers. Configure the agent to call ListFiles when users request document inventories, and chain FetchContent into a summarization prompt template. The agent receives plain text, applies system instructions for summarization, and returns structured outputs without parsing binary data.
Pitfall Guide
1. Windows Runtime Mismatch for Python Functions
Explanation: Azure Functions Python runtime is exclusively supported on Linux hosting plans. Selecting Windows during deployment causes silent runtime failures or missing extension bundles.
Fix: Always provision Linux-based App Service or Flex Consumption plans. Verify runtime stack in the Azure Portal before deployment.
2. OpenAPI Host Field Protocol Leakage
Explanation: Including https:// in the host or servers.url field breaks Copilot Studio's internal routing engine, resulting in 404 Not Found or connection timeouts.
Fix: Strip protocol prefixes. Use contoso-api.azurewebsites.net instead of https://contoso-api.azurewebsites.net.
3. Hardcoded Secrets in Local Settings
Explanation: Committing local.settings.json to version control exposes storage credentials. Automated scanners frequently flag these leaks, triggering security incidents.
Fix: Add local.settings.json to .gitignore. Use Azure Key Vault or Managed Identity for production secret injection.
4. Ignoring OCR Requirements for Scanned PDFs
Explanation: Image-based PDFs lack embedded text layers. Standard pypdf extraction returns empty strings, causing silent summarization failures.
Fix: Route scanned documents through Azure Document Intelligence or Tesseract OCR pipelines before normalization. Implement format detection to trigger OCR fallbacks.
5. Anonymous Auth in Production Environments
Explanation: func.AuthLevel.ANONYMOUS exposes extraction endpoints to unauthenticated traffic, violating zero-trust policies and enabling data exfiltration.
Fix: Migrate to OAuth 2.0, Entra ID, or Azure API Management with subscription keys. Enforce role-based access at the gateway layer.
6. Memory Exhaustion on Large Binary Files
Explanation: Loading multi-gigabyte documents into memory triggers OutOfMemoryException and function terminations. Serverless functions have strict memory ceilings (typically 1.5GB-4GB).
Fix: Implement chunked streaming, enforce file size limits (e.g., reject >50MB), or offload large files to Azure Batch/Logic Apps for asynchronous processing.
7. Missing Content-Type Validation
Explanation: Returning text/plain or untyped responses breaks Copilot's JSON schema parser, causing tool invocation failures.
Fix: Always set mimetype="application/json" in HTTP responses. Validate payloads against OpenAPI schemas before transmission.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Low-volume internal knowledge base | Flex Consumption | Scales to zero, pay-per-execution, minimal idle cost | $0.000016/GB-second |
| High-frequency enterprise AI agent | Premium Plan v3 | VNet integration, dedicated instances, predictable latency | Fixed hourly + execution |
| Batch document processing | Azure Logic Apps + Blob Triggers | Event-driven, built-in retry, visual orchestration | $0.000025/execution |
| Strict compliance/air-gapped | App Service Environment (ASE) | Fully isolated network, private endpoints, audit trails | High (dedicated infrastructure) |
Configuration Template
local.settings.json
{
"IsEncrypted": false,
"Values": {
"AzureWebJobsStorage": "",
"FUNCTIONS_WORKER_RUNTIME": "python",
"BLOB_CONNECTION_STRING": "DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=core.windows.net",
"BLOB_CONTAINER_NAME": "corporate-knowledge-base"
}
}
requirements.txt
azure-functions
azure-storage-blob
azure-identity
pypdf
python-docx
host.json
{
"version": "2.0",
"logging": {
"applicationInsights": {
"samplingSettings": {
"isEnabled": true,
"excludedTypes": "Request"
}
}
},
"extensionBundle": {
"id": "Microsoft.Azure.Functions.ExtensionBundle",
"version": "[4.*, 5.0.0)"
}
}
Quick Start Guide
- Initialize Project: Run
func init doc-pipeline --worker-runtime python and create the directory structure outlined in the Core Solution.
- Configure Storage: Create a private Azure Blob container, generate a connection string, and populate
local.settings.json.
- Deploy Function: Execute
func azure functionapp publish <app-name> targeting a Linux Flex Consumption plan. Verify endpoints via browser or curl.
- Connect Copilot Studio: Import the OpenAPI contract, map operations to conversational triggers, and test document retrieval with sample files.
This pipeline transforms unstructured cloud storage into a deterministic, AI-ready knowledge layer. By enforcing strict separation between retrieval, normalization, and orchestration, teams achieve scalable, auditable, and cost-efficient document intelligence without compromising security or performance.