tract. This contract guarantees that the artifact is traceable, bounded, and verifiable.
interface ContextSnapshot {
snapshotId: string;
sourceSystem: 'notion' | 'slack' | 'gmail' | 'confluence';
stableIdentity: string; // Original object ID + path
contentHash: string; // SHA-256 of normalized content
asOfTimestamp: string; // ISO 8601 UTC
permissionEnvelope: PermissionEnvelope;
provenance: {
connectorVersion: string;
normalizationVersion: string;
ingestionTimestamp: string;
excludedReasons?: string[];
};
}
interface PermissionEnvelope {
toolAccess: ('read' | 'write' | 'delete')[];
pathGrants: Array<{
identity: string;
allowedPaths: string[];
deniedPaths: string[];
}>;
}
Rationale: The contract separates identity, content, permissions, and provenance. This structure enables deterministic diffing, query-time filtering, and precise audit reconstruction. The asOfTimestamp and contentHash guarantee that two identical sync runs produce identical artifacts, which is critical for debugging and rollback.
Step 2: Normalize SaaS Blobs into Agent-Readable Files
Raw API responses contain metadata noise, HTML artifacts, and platform-specific formatting that degrade agent reasoning. Normalization converts these into predictable, machine-traversable formats.
function normalizeSourcePayload(raw: unknown, source: ContextSnapshot['sourceSystem']): string {
switch (source) {
case 'notion':
return extractMarkdownBlocks(raw as NotionPage).join('\n\n');
case 'slack':
return formatThreadMessages(raw as SlackThread);
case 'gmail':
return sanitizeEmailThread(raw as GmailThread);
default:
throw new Error(`Unsupported source: ${source}`);
}
}
function extractMarkdownBlocks(page: NotionPage): string[] {
return page.blocks
.filter(b => b.type !== 'unsupported' && !isCommentBlock(b))
.map(b => b.plain_text ?? b.content ?? '')
.filter(Boolean);
}
Rationale: Deterministic normalization removes platform-specific noise and creates a stable input surface for embedding models. By stripping comments, unsupported blocks, and formatting artifacts, you reduce token waste and prevent agents from misinterpreting structural metadata as semantic content.
Step 3: Enforce Dual-Layer Authorization
Permissions must be evaluated at two distinct points: during ingestion (to scope what enters the workspace) and during retrieval (to validate what the agent actually receives).
class AuthorizationGateway {
async validateRetrieval(
snapshot: ContextSnapshot,
requestingIdentity: string,
queryPaths: string[]
): Promise<AuthorizedPaths> {
const envelope = snapshot.permissionEnvelope;
const identityGrant = envelope.pathGrants.find(g => g.identity === requestingIdentity);
if (!identityGrant) {
return { allowed: [], denied: queryPaths, reason: 'IDENTITY_NOT_FOUND' };
}
const allowed = queryPaths.filter(p =>
identityGrant.allowedPaths.some(allowed => p.startsWith(allowed)) &&
!identityGrant.deniedPaths.some(denied => p.startsWith(denied))
);
return {
allowed,
denied: queryPaths.filter(p => !allowed.includes(p)),
reason: allowed.length === 0 ? 'PATH_RESTRICTED' : 'PARTIAL_ACCESS'
};
}
}
Rationale: A two-layer model (tool operations + path visibility) prevents unauthorized content from leaking into agent context. Fail-closed behavior ensures that ambiguous or missing grants result in exclusion rather than exposure. Query-time validation catches permission drift that occurred after the initial sync.
Step 4: Version Artifacts and Log Provenance
Versioning extends beyond content. It must cover the normalization logic, embedding model, retrieval configuration, and connector scopes. Logs must separate ingestion events from retrieval events to support distinct investigation workflows.
interface IngestionEvent {
eventType: 'SYNC_START' | 'SYNC_COMPLETE' | 'PERMISSION_REVOKED' | 'CONTENT_DELETED';
snapshotId: string;
connectorScopes: string[];
artifactsProcessed: number;
artifactsExcluded: number;
exclusionReasons: Record<string, number>;
timestamp: string;
}
interface RetrievalEvent {
eventType: 'QUERY_EXECUTED';
snapshotIds: string[];
requestingIdentity: string;
authorizedPaths: string[];
deniedPaths: string[];
embeddingModel: string;
latencyMs: number;
timestamp: string;
}
Rationale: Separating ingestion and retrieval logs allows security teams to trace permission changes independently from agent behavior. Versioning the pipeline components ensures that behavioral changes can be correlated with code or model updates, enabling precise rollback and regression testing.
Pitfall Guide
1. Treating Ingestion as One-Way Sync
Explanation: Many pipelines only handle additions and updates. They ignore deletions and permission revocations, causing the vector store to accumulate stale or restricted content.
Fix: Implement tombstone propagation. When a source object is deleted or restricted, emit a CONTENT_DELETED or PERMISSION_REVOKED event that triggers immediate artifact removal or access denial. Maintain a reconciliation job that diffs source state against the governed workspace weekly.
2. Ignoring Cross-Source Timeline Drift
Explanation: Syncing Notion hourly, Slack every 5 minutes, and Gmail daily creates a fragmented reality. Agents stitching together conclusions from mismatched timestamps produce inconsistent or contradictory outputs.
Fix: Define explicit snapshot boundaries. Use a unified asOfTimestamp for each sync batch. For high-stakes workflows, schedule periodic read-consistent rebuilds that pause ingestion, capture a global timestamp, and generate a unified snapshot across all sources.
3. Validating Permissions Only at Ingest Time
Explanation: SaaS permissions are dynamic. A channel that was public during ingestion may become private hours later. Relying solely on initial validation guarantees eventual data leakage.
Fix: Implement dual-check authorization. Store permission envelopes with each snapshot, but always re-evaluate access at retrieval time using current identity state. Fail closed when grants are ambiguous or missing.
4. Over-Indexing Unstructured SaaS Blobs
Explanation: Raw API responses contain HTML, metadata, comments, and platform-specific formatting that confuse embedding models and waste context windows.
Fix: Apply deterministic normalization with format breakers and content classifiers. Strip unsupported blocks, redact sensitive data, and route suspicious documents to human review before indexing. Preserve source pointers for traceability.
5. Logging Answers Instead of Provenance
Explanation: Storing only the final agent response makes incident investigation impossible. You cannot determine which snapshot, permission state, or normalization version produced a problematic output.
Fix: Log structured ingestion and retrieval events with snapshot IDs, identity grants, and pipeline versions. Maintain a queryable event stream that links retrieval requests to the exact artifacts consumed.
6. Assuming Vector Similarity Equals Relevance
Explanation: High cosine similarity does not guarantee contextual appropriateness or authorization. Agents may retrieve technically relevant but restricted or outdated content.
Fix: Combine vector retrieval with strict permission filtering and recency weighting. Apply a post-retrieval validation step that checks artifact freshness, permission status, and source reliability before passing context to the model.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Low-risk internal knowledge base | Batch snapshot sync with daily consistency checks | Simplifies pipeline, reduces API rate limit exposure, acceptable staleness | Low (infrequent syncs, minimal compute) |
| High-compliance regulated environment | Event-driven sync with dual-check auth and strict tombstone propagation | Guarantees real-time permission alignment, meets audit requirements | Medium-High (continuous monitoring, dual validation overhead) |
| Multi-agent collaborative workspace | Governed workspace with shared snapshot IDs and path-level grants | Prevents context leakage between agents, enables consistent reasoning boundaries | Medium (centralized storage, permission resolution layer) |
| Rapid prototyping / sandbox | Naive vector ingestion with manual permission checks | Fastest path to validation, acceptable for non-production testing | Low (minimal infrastructure, high technical debt) |
Configuration Template
ingestion_pipeline:
connectors:
- name: notion_sync
scope: ['pages:read', 'databases:read']
target_paths: ['/workspace/engineering', '/workspace/product']
sync_interval: '1h'
normalization_version: 'v2.1'
- name: slack_sync
scope: ['channels:history', 'groups:history']
target_paths: ['/workspace/eng-ops', '/workspace/incidents']
sync_interval: '5m'
normalization_version: 'v1.4'
snapshot_contract:
id_format: '{source}_{stable_identity}_{timestamp}'
hash_algorithm: 'sha256'
as_of_granularity: 'iso8601_utc'
permission_model: 'dual_layer'
security:
sensitive_data_filter: true
adversarial_content_classifier: true
fail_closed_on_ambiguous_auth: true
revocation_propagation: 'immediate'
logging:
ingestion_events: 'stream_ingestion'
retrieval_events: 'stream_retrieval'
retention_days: 90
queryable_fields: ['snapshotId', 'identity', 'connectorScopes', 'exclusionReasons']
Quick Start Guide
- Initialize the snapshot contract: Create a TypeScript interface matching the
ContextSnapshot structure. Define your PermissionEnvelope with tool and path grants. Generate a deterministic ID scheme using source system, stable identity, and timestamp.
- Deploy the normalizer: Write source-specific extraction functions that convert raw API responses into clean, agent-readable text. Implement format breakers to strip HTML, comments, and unsupported blocks. Hash the normalized output for version tracking.
- Configure least-privilege connectors: Register your SaaS integrations with minimal OAuth scopes. Restrict target paths to specific workspaces, channels, or folders. Enable short-lived tokens and implement revocation endpoints.
- Attach authorization and logging: Integrate the
AuthorizationGateway to validate permissions at retrieval time. Emit structured ingestion and retrieval events to separate streams. Include snapshot IDs, identity grants, and exclusion reasons in every log entry.
- Validate with a consistency check: Run a test sync across two sources with different intervals. Verify that snapshot boundaries align, permission envelopes attach correctly, and retrieval filters block unauthorized paths. Confirm that deletion events trigger immediate artifact removal.