Architecting a Privacy-First Meeting Intelligence Pipeline: Local Capture, Async Orchestration, and Semantic Ticket Generation

Current Situation Analysis

The modern engineering workflow is heavily fragmented by meeting overhead. The industry standard for capturing meeting intelligence relies on injecting headless browser bots into Zoom, Google Meet, or Teams sessions. This approach introduces three critical failure points that most teams overlook until compliance audits or performance bottlenecks surface.

First, data residency and privacy compliance. Corporate IT security teams routinely block third-party transcription services because routing raw board-meeting audio through external cloud endpoints violates SOC2, HIPAA, or internal data governance policies. The audio stream leaves the corporate perimeter, creating an uncontrolled attack surface.

Second, resource exhaustion during capture. Developers attempting to build local recording solutions often stream uncompressed PCM/WAV data directly into Node.js memory. A two-hour meeting at 48kHz/16-bit stereo generates approximately 829MB of raw audio. Holding this in RAM while simultaneously encoding and uploading it guarantees heap exhaustion and process crashes in Electron or standard Node environments.

Third, the translation gap between conversation and execution. Even when a transcript is successfully generated, engineering teams spend an average of 30 minutes per session manually parsing dialogue, extracting action items, and formatting them into Jira, Linear, or GitHub Issues. The transcript lacks project continuity; it treats every meeting as an isolated event, forcing developers to repeatedly provide context to AI assistants or manually reference past architectural decisions.

The industry has normalized the bot-injection model because it offloads the heavy lifting to cloud providers. However, this trade-off sacrifices data sovereignty, introduces latency, and leaves the post-meeting workflow entirely manual. A privacy-first architecture requires decoupling capture from processing, leveraging local system APIs, and orchestrating a distributed AI pipeline that maintains semantic continuity across sessions.

WOW Moment: Key Findings

Shifting from cloud-injected bots to a local capture and asynchronous processing model fundamentally changes the cost, security, and accuracy profile of meeting intelligence. The following comparison isolates the architectural trade-offs between the legacy bot model and a local-first, queue-driven pipeline.

Approach	Data Residency	Processing Latency	Contextual Accuracy	Infrastructure Overhead
Cloud Bot Injection	External (Third-party cloud)	High (Network-dependent)	Low (Isolated per session)	High (SaaS subscriptions + API markup)
Local Capture + Async Pipeline	Internal (On-prem/Local disk)	Medium (Background queued)	High (Vector-augmented continuity)	Low (BYOK routing + Open-source stack)

This finding matters because it proves that meeting intelligence does not require surrendering audio data to external providers. By capturing system audio locally, chunking it to disk, and pushing work to a distributed queue, teams retain full data sovereignty while achieving higher ticket accuracy through semantic memory. The pipeline transforms meetings from passive recording events into active engineering workflow triggers.

Core Solution

Building a production-grade meeting intelligence system requires decoupling real-time capture from heavy AI processing. The architecture follows a four-stage pipeline: local audio ingestion, asynchronous job orchestration, multi-modal AI processing, and semantic ticket generation.

Stage 1: Local Audio Ingestion & Chunked Storage

Electron provides native access to system audio streams through desktopCapturer and getUserMedia. Instead of mixing streams in memory, the application captures the system output (speakers) and microphone input, routes them through a MediaRecorder instance, and writes compressed WebM/Opus chunks directly to the local filesystem. This prevents heap bloat and ensures graceful degradation if the network drops.

Once a meeting concludes, the chunked files are packaged and uploaded via a multipart stream to the backend. The upload endpoint immediately acknowledges receipt and delegates processing to a background queue.

// audio-ingestion.service.ts
import { v4 as uuidv4 } from 'uuid';
import { createReadStream } from 'fs';
import { pipeline } from 'stream/promises';
import { join } from 'path';

export class AudioIngestionService {
  private readonly storageDir = join(process.cwd(), 'data', 'audio_chunks');

  async persistMeetingSegment(meetingId: string, chunkBuffer: Buffer): Promise<string> {
    const segmentId = uuidv4();
    const segmentPath = join(this.storageDir, `${meetingId}_${segmentId}.webm`);
    
    await pipeline(
      createReadStream(Buffer.from(chunkBuffer)),
      require('fs').createWriteStream(segmentPath, { flags: 'a' })
    );

    return segmentPath;
  }

  async finalizeAndUpload(meetingId: string, filePath: string): Promise<string> {
    const uploadPayload = new FormData();
    uploadPayload.append('meetingId', meetingId);
    uploadPayload.append('audioFile', createReadStream(filePath), { filename: 'meeting.webm' });

    const response = await fetch(`${process.env.BACKEND_URL}/api/v1/audio/ingest`, {
      method: 'POST',
      body: uploadPayload,
    });

    if (!response.ok) throw new Error(`Ingestion failed: ${response.statusText}`);
    return response.json();
  }
}

Stage 2: Asynchronous Job Orchestration

Processing multi-hour audio files, calling external transcription APIs, and running LLM inference is CPU and network-intensive. Synchronous handling blocks the Node.js event loop and drops concurrent connections. A distributed queue solves this by decoupling ingestion from execution.

BullMQ paired with Redis provides job persistence, exponential backoff, dead-letter queues (DLQ), and rate limiting. The backend controller pushes a job descriptor to Redis and returns a 200 OK immediately. Workers consume jobs independently, allowing horizontal scaling during peak meeting hours.

// meeting-queue.manager.ts
import { Queue, Worker, Job } from 'bullmq';
import IORedis from 'ioredis';

const redisConnection = new IORedis(process.env.REDIS_URL!, { maxRetriesPerRequest: null });

export const audioProcessingQueue = new Queue('audio-pipeline', {
  connection: redisConnection,
  defaultJobOptions: {
    attempts: 4,
    backoff: { type: 'exponential', delay: 8000 },
    removeOnComplete: 100,
    removeOnFail: 50,
  },
});

export const pipelineWorker = new Worker(
  'audio-pipeline',
  async (job: Job) => {
    const { meetingId, audioPath } = job.data;
    // Orchestrate transcription -> biometrics -> LLM routing -> vector storage
    await executeMeetingPipeline(meetingId, audioPath);
  },
  { connection: redisConnection, concurrency: 3 }
);

Stage 3: Multi-Modal AI Processing & Dynamic Routing

Transcription alone lacks speaker continuity across sessions. Deepgram handles real-time diarization efficiently, but relies on session-local voice profiles. For cross-meeting speaker verification, a Python microservice running FastAPI and SpeechBrain extracts voice embeddings from audio slices. The Node worker sends segmented audio to this service, which returns a verified speaker identity mapped to internal user records.

LLM routing is handled through OpenRouter, acting as a unified gateway. Instead of hardcoding providers, the system routes tasks based on cognitive complexity:

Context gathering and semantic search: openai/gpt-4o-mini
Structured ticket extraction: anthropic/claude-opus-4.7
Architecture diagram generation: anthropic/claude-sonnet-4.6
Visual feature rendering: black-forest-labs/flux.2-klein-4b

Fallback logic uses OpenRouter's providerOptions to automatically switch providers if rate limits or outages occur.

// llm-router.service.ts
import OpenAI from 'openai';

export class LlmRouterService {
  private client: OpenAI;

  constructor() {
    this.client = new OpenAI({
      baseURL: 'https://openrouter.ai/api/v1',
      apiKey: process.env.OPENROUTER_API_KEY,
    });
  }

  async routeTask(taskType: 'context' | 'extraction' | 'diagram' | 'visual', prompt: string) {
    const modelMap = {
      context: 'openai/gpt-4o-mini',
      extraction: 'anthropic/claude-opus-4.7',
      diagram: 'anthropic/claude-sonnet-4.6',
      visual: 'black-forest-labs/flux.2-klein-4b',
    };

    const response = await this.client.chat.completions.create({
      model: modelMap[taskType],
      messages: [{ role: 'user', content: prompt }],
      provider: { allow_fallbacks: true, order: ['OpenAI', 'Anthropic', 'Google'] },
      temperature: taskType === 'extraction' ? 0.2 : 0.7,
    });

    return response.choices[0].message.content;
  }
}

Stage 4: Semantic Memory & Ticket Generation

Raw transcripts cannot generate accurate engineering tickets without project continuity. Qdrant stores chunked, embedded transcript segments. When the extraction LLM runs, it first queries Qdrant for semantically similar past decisions, architectural constraints, and unresolved bugs. This retrieval-augmented context is injected into the prompt, transforming vague dialogue into precise tickets with acceptance criteria.

The final output is pushed to GitHub Issues, Linear, or Jira via their respective APIs. For AI coding assistants, Repomix packs the monorepo into a single Markdown file, allowing the assistant to ingest the entire codebase context alongside the newly generated issue.

Pitfall Guide

1. Event Loop Blocking During Audio Upload

Explanation: Streaming large audio files through Express routes synchronously blocks the main thread, causing timeout errors for concurrent users. Fix: Always return an immediate acknowledgment and push processing to a message queue. Use multipart streams with chunked reading to avoid buffering entire files in memory.

2. Speaker Drift in Diarization

Explanation: Relying solely on session-local diarization causes the same person to be labeled as different speakers across meetings, breaking continuity. Fix: Implement a voice biometrics microservice using SpeechBrain. Extract embeddings from known speaker samples and match them against incoming segments using cosine similarity thresholds.

3. Hardcoded LLM Provider Dependencies

Explanation: Direct API calls to single providers fail during rate limits, regional outages, or pricing changes, stalling the entire pipeline. Fix: Route all inference through a unified gateway like OpenRouter. Configure fallback chains and dynamic model selection based on task complexity rather than defaulting to the most expensive model.

4. Vector Chunk Boundary Misalignment

Explanation: Splitting transcripts at arbitrary character counts breaks semantic context, causing Qdrant to retrieve fragmented or irrelevant historical data. Fix: Chunk by speaker turns or natural pause intervals. Use a sliding window with overlap (e.g., 15% overlap) to preserve context boundaries before embedding.

5. Type Definition Drift Across Platforms

Explanation: Manually syncing interfaces between backend, web, mobile, and desktop apps leads to runtime type mismatches and silent failures. Fix: Establish a single source of truth (Prisma schema). Generate OpenAPI specs via TSOA, then run openapi-typescript in CI to auto-generate api.d.ts for all frontends. Fail builds if types diverge.

6. MCP Context Window Saturation

Explanation: Feeding an entire monorepo into an AI coding assistant via MCP overwhelms context windows, causing hallucination and degraded code quality. Fix: Use impact-aware indexing. Tools like GitNexus should only inject files relevant to the requested change scope. Filter out node_modules, build artifacts, and test fixtures before packing.

7. Ignoring Rate Limit Backpressure

Explanation: Bursting API calls to Deepgram, Jira, or LLM providers without throttling triggers 429 errors and job failures. Fix: Configure BullMQ concurrency limits and implement token bucket rate limiting at the worker level. Route failed external calls to a DLQ with exponential backoff and alerting.

Production Bundle

Action Checklist

Configure Electron desktopCapturer with chunked WebM/Opus output to prevent heap exhaustion
Set up BullMQ with Redis connection pooling, exponential backoff, and DLQ routing
Deploy SpeechBrain FastAPI service for cross-session voice embedding verification
Implement OpenRouter gateway with provider fallback chains and BYOK workspace keys
Initialize Qdrant collection with speaker-turn chunking and 15% semantic overlap
Automate Prisma → TSOA → OpenAPI → openapi-typescript sync in CI/CD pipeline
Integrate Repomix monorepo packing with scope-aware filtering for AI coding assistants
Add monitoring for queue depth, LLM token consumption, and vector search latency

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
High-compliance enterprise	Local capture + on-prem Redis/Qdrant	Keeps audio and embeddings within corporate perimeter	Higher infra, zero data egress fees
Startup / MVP	Cloud bot + direct OpenAI/Anthropic APIs	Faster implementation, lower initial engineering overhead	High SaaS markup, vendor lock-in risk
Multi-platform team	Prisma → TSOA → OpenAPI sync pipeline	Guarantees type safety across Electron, React, React Native	Minimal CI compute, eliminates runtime type bugs
Heavy AI coding workflow	MCP with impact-aware indexing + Repomix	Prevents context window saturation, improves PR quality	Low token waste, higher developer velocity

Configuration Template

# bullmq.config.ts
import { Queue, Worker } from 'bullmq';
import IORedis from 'ioredis';

const redis = new IORedis(process.env.REDIS_URL!, {
  maxRetriesPerRequest: null,
  retryStrategy: (times) => Math.min(times * 50, 2000),
});

export const meetingQueue = new Queue('meeting-pipeline', {
  connection: redis,
  defaultJobOptions: {
    attempts: 3,
    backoff: { type: 'exponential', delay: 10000 },
    removeOnComplete: { age: 86400, count: 500 },
    removeOnFail: { age: 604800 },
  },
});

export const pipelineWorker = new Worker(
  'meeting-pipeline',
  async (job) => {
    const { meetingId, audioPath } = job.data;
    // Pipeline execution logic
  },
  {
    connection: redis,
    concurrency: parseInt(process.env.WORKER_CONCURRENCY || '2', 10),
    limiter: { max: 10, duration: 60000 },
  }
);

// openrouter.routing.ts
export const MODEL_ROUTES = {
  context_retrieval: { model: 'openai/gpt-4o-mini', temp: 0.3 },
  ticket_extraction: { model: 'anthropic/claude-opus-4.7', temp: 0.1 },
  architecture_diagram: { model: 'anthropic/claude-sonnet-4.6', temp: 0.4 },
  visual_generation: { model: 'black-forest-labs/flux.2-klein-4b', temp: 0.8 },
} as const;

Quick Start Guide

Initialize Infrastructure: Run Redis and Qdrant via Docker Compose. Configure environment variables for REDIS_URL, QDRANT_URL, and OPENROUTER_API_KEY.
Deploy Worker Service: Install BullMQ dependencies, configure the queue with exponential backoff and DLQ routing, and start the worker process with concurrency limits matching your CPU cores.
Connect Transcription & Biometrics: Set up Deepgram API access for diarization. Deploy the SpeechBrain FastAPI microservice and verify voice embedding extraction against known speaker samples.
Sync Type Definitions: Run prisma generate, trigger TSOA to output swagger.json, and execute openapi-typescript swagger.json -o src/types/api.d.ts. Commit the generated types to version control.
Test End-to-End Flow: Record a local meeting segment, trigger the multipart upload, verify queue job creation, monitor LLM routing fallbacks, and confirm ticket generation in your issue tracker. Validate vector retrieval by querying past meeting context.