Architecting Reliable Data Pipelines on the AT Protocol: A Production Guide to Bluesky Integration

Current Situation Analysis

Social data extraction has historically been a defensive engineering exercise. Teams routinely allocate significant development cycles to manage rotating proxy pools, bypass JavaScript rendering challenges, and navigate opaque enterprise pricing tiers. The underlying industry assumption has been that social platforms treat public content as a guarded commercial asset, requiring complex authentication flows and strict rate-limiting policies.

Bluesky fundamentally disrupts this paradigm by operating on the AT Protocol, an open architecture that treats public posts as machine-readable by default. With over 40 million active accounts, the platform exposes a clean REST interface explicitly designed for third-party consumption. Despite this architectural shift, many engineering teams still approach Bluesky with legacy scraping tactics, missing the protocol's native routing rules and authentication boundaries. The oversight stems from ingrained workflows built around walled-garden APIs, where unauthenticated access is either restricted or heavily throttled.

In reality, the AT Protocol enforces a strict separation between public read operations and authenticated queries. This design eliminates the need for subscription tiers, legal-compliance overhead, or reverse-engineering DOM structures. However, the routing logic is non-negotiable: unauthenticated requests must target the public gateway, while authenticated calls require a separate base URL. Misunderstanding this split causes immediate 403 responses and pipeline failures. Recognizing and implementing this architectural boundary is the foundational step toward building sustainable, production-grade data extraction systems on the platform.

WOW Moment: Key Findings

The operational overhead of traditional social data pipelines contrasts sharply with the AT Protocol's design. When engineering teams migrate from legacy scraping stacks to Bluesky's native endpoints, the reduction in infrastructure complexity and cost becomes immediately measurable.

Approach	Auth Overhead	Rate Limit Predictability	Data Normalization Effort	Cost per 1k Records
Legacy Social Scraping	High (OAuth2 + token rotation + proxy auth)	Opaque (dynamic throttling, CAPTCHA triggers)	High (DOM parsing, HTML sanitization, layout drift)	$12–$45 (proxy + compute + enterprise API)
AT Protocol Extraction	Low (App Password → JWT, single endpoint split)	Transparent (HTTP headers, documented limits)	Low (structured JSON, consistent schema)	$3.00 ($0.25/run + $0.003/post)

This finding matters because it shifts data extraction from a maintenance-heavy operation to a deterministic pipeline. Engineers can allocate resources toward data transformation, enrichment, and downstream analytics rather than fighting anti-bot defenses or managing proxy infrastructure. The predictable cost structure and structured JSON responses also simplify compliance auditing and schema versioning, making the platform viable for enterprise-grade social listening, brand monitoring, and lead generation workflows.

Core Solution

Building a reliable extraction pipeline on the AT Protocol requires three architectural decisions: endpoint routing, token lifecycle management, and output normalization. The following implementation demonstrates a production-ready TypeScript client that handles these concerns systematically.

1. Endpoint Routing & Token Management

The AT Protocol enforces a strict boundary between public and authenticated traffic. Unauthenticated reads target public.api.bsky.app, while authenticated requests must route through bsky.social. Authentication is performed against bsky.social, which returns a JWT. That token is then attached exclusively to subsequent authenticated calls.

import { createHash } from 'crypto';

interface AuthCredentials {
  identifier: string;
  appPassword: string;
}

interface TokenResponse {
  accessJwt: string;
  refreshJwt: string;
  handle: string;
}

class BlueskyApiClient {
  private readonly publicBase = 'https://public.api.bsky.app';
  private readonly authBase = 'https://bsky.social';
  private accessToken: string | null = null;

  constructor(private credentials?: AuthCredentials) {}

  async authenticate(): Promise<void> {
    if (!this.credentials) throw new Error('Credentials required for auth');
    
    const payload = {
      identifier: this.credentials.identifier,
      password: this.credentials.appPassword,
    };

    const response = await fetch(`${this.authBase}/xrpc/com.atproto.server.createSession`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify(payload),
    });

    if (!response.ok) {
      throw new Error(`Auth failed: ${response.status} ${response.statusText}`);
    }

    const data = await response.json() as TokenResponse;
    this.accessToken = data.accessJwt;
  }

  private getBaseUrl(requireAuth: boolean): string {
    if (requireAuth && !this.accessToken) {
      throw new Error('Authenticated endpoint requested without valid token');
    }
    return requireAuth ? this.authBase : this.publicBase;
  }

  async request<T>(endpoint: string, requireAuth: boolean = false, params?: Record<string, string>): Promise<T> {
    const baseUrl = this.getBaseUrl(requireAuth);
    const url = new URL(`${baseUrl}/xrpc/${endpoint}`);
    
    if (params) {
      Object.entries(params).forEach(([key, value]) => url.searchParams.append(key, value));
    }

    const headers: Record<string, string> = { 'Content-Type': 'application/json' };
    if (requireAuth && this.accessToken) {
      headers['Authorization'] = `Bearer ${this.accessToken}`;
    }

    const res = await fetch(url.toString(), { headers });
    if (!res.ok) {
      throw new Error(`Request failed: ${res.status} at ${url.pathname}`);
    }
    return res.json() as Promise<T>;
  }
}

Why this structure? Separating base URLs at the client level prevents accidental token leakage to the public gateway. The requireAuth flag enforces routing discipline at compile time, while the centralized request method standardizes error handling and header injection.

2. Data Extraction Patterns

The platform supports three primary extraction modes: keyword search, author timeline retrieval, and conversation thread resolution. Each requires distinct parameter handling and response parsing.

interface PostRecord {
  uri: string;
  cid: string;
  value: {
    text: string;
    createdAt: string;
    reply?: { root: { uri: string }; parent: { uri: string } };
    embed?: {
      images?: Array<{ alt: string; image: { thumb: string; fullsize: string } }>;
      external?: { uri: string; title: string; description: string };
    };
    likeCount?: number;
    repostCount?: number;
    replyCount?: number;
  };
}

interface AuthorInfo {
  handle: string;
  displayName: string;
}

class BlueskyDataExtractor {
  constructor(private client: BlueskyApiClient) {}

  async searchPosts(query: string, options: {
    lang?: string;
    since?: string;
    until?: string;
    sort?: 'top' | 'latest';
    limit?: number;
  }): Promise<PostRecord[]> {
    const params: Record<string, string> = { q: query, limit: String(options.limit || 25) };
    if (options.lang) params.lang = options.lang;
    if (options.since) params.since = options.since;
    if (options.until) params.until = options.until;
    if (options.sort) params.sort = options.sort;

    const data = await this.client.request<{ posts: PostRecord[] }>(
      'app.bsky.feed.searchPosts',
      true,
      params
    );
    return data.posts;
  }

  async fetchAuthorFeed(authorHandle: string, limit: number = 50): Promise<PostRecord[]> {
    const data = await this.client.request<{ feed: Array<{ post: PostRecord }> }>(
      'app.bsky.feed.getAuthorFeed',
      false,
      { actor: authorHandle, limit: String(limit) }
    );
    return data.feed.map(item => item.post);
  }

  async resolveThread(rootUri: string): Promise<PostRecord[]> {
    const data = await this.client.request<{ thread: any }>(
      'app.bsky.feed.getPostThread',
      false,
      { uri: rootUri }
    );
    return this.flattenThreadTree(data.thread);
  }

  private flattenThreadTree(node: any): PostRecord[] {
    const results: PostRecord[] = [];
    const traverse = (current: any) => {
      if (current.post) {
        results.push(current.post);
      }
      if (current.replies && Array.isArray(current.replies)) {
        current.replies.forEach((reply: any) => traverse(reply));
      }
    };
    traverse(node);
    return results;
  }
}

Why depth-first flattening? The AT Protocol returns conversation trees as nested JSON objects. Downstream analytics and database ingestion typically require linear, chronologically ordered records. A recursive depth-first traversal preserves reply hierarchy while producing a flat array suitable for CSV/JSON export or stream processing.

3. Monorepo Bundling Strategy

When operating within a TypeScript monorepo using npm workspaces, shared utilities (logging, retry logic, schema validators) are typically referenced via workspace aliases. However, deployment environments like Apify's build servers do not resolve workspace dependencies. They only package the target actor directory.

The solution is to copy shared source files into each actor's src/shared/ directory during the build phase, then bundle everything into a single executable file. This preserves a single source of truth in the repository while ensuring deployment isolation.

// tsup.config.ts
import { defineConfig } from 'tsup';

export default defineConfig({
  entry: ['src/main.ts'],
  format: ['cjs'],
  target: 'node18',
  outDir: 'dist',
  clean: true,
  bundle: true,
  splitting: false,
  noExternal: ['@apify-actors/shared'],
  banner: {
    js: '/* Bluesky Data Pipeline - Production Bundle */',
  },
});

Why tsup with noExternal? tsup leverages esbuild for fast compilation. Setting noExternal forces the bundler to inline workspace dependencies, eliminating runtime module resolution failures. The single dist/main.js output guarantees deterministic execution across isolated CI/CD runners.

Pitfall Guide

1. Cross-Endpoint Authentication Leakage

Explanation: Sending JWT tokens to public.api.bsky.app triggers immediate 403 responses. The public gateway is fronted by Cloudflare and explicitly rejects authenticated headers. Fix: Implement a routing guard that validates the target base URL against the requireAuth flag before attaching headers. Never reuse the same client instance for mixed traffic without explicit URL switching.

2. Thread Depth Recursion Limits

Explanation: Deep conversation trees can exceed call stack limits if traversed synchronously without tail-call optimization or iterative conversion. Fix: Convert recursive traversal to an iterative stack-based approach when processing threads exceeding 50 replies. Monitor memory allocation and implement chunked processing for viral conversations.

3. Workspace Resolution in Isolated CI Environments

Explanation: npm workspaces resolve dependencies at install time, but deployment platforms often skip npm install for workspace roots, breaking alias imports. Fix: Add a pre-build script that copies shared modules into the actor directory. Verify bundle output with node --check dist/main.js before deployment.

4. Media Payload Bloat

Explanation: Fetching full-size image URLs alongside thread data increases payload size by 300-500%, slowing serialization and increasing storage costs. Fix: Request only thumbnail URLs during extraction. Resolve full-size assets asynchronously in a downstream enrichment pipeline, or omit media entirely if text analysis is the primary goal.

5. URI Parsing Misalignment

Explanation: AT Protocol URIs follow the format at://did:plc:xxxxx/app.bsky.feed.post/xxxxx. Developers often attempt to extract handles directly from URIs, which is impossible without a DID resolution step. Fix: Always fetch the handle and displayName from the author object returned alongside the post. Never derive identity from the URI string.

6. Silent Schema Drift

Explanation: The AT Protocol evolves rapidly. Optional fields like embed or replyCount may be omitted in edge cases, causing downstream type errors. Fix: Implement runtime schema validation using Zod or io-ts. Define strict interfaces with optional chaining and provide default values for missing metrics before serialization.

7. Missing Backoff Headers

Explanation: Assuming static rate limits leads to pipeline crashes during traffic spikes. The platform returns Retry-After and X-RateLimit-Remaining headers. Fix: Parse response headers and implement exponential backoff with jitter. Cache Retry-After values and pause queue processing until the window expires.

Production Bundle

Action Checklist

Configure endpoint routing guard to separate public and authenticated traffic
Implement JWT lifecycle management with automatic refresh on 401 responses
Add iterative thread flattening to prevent stack overflow on viral conversations
Set up pre-build workspace copying script for monorepo deployment compatibility
Integrate runtime schema validation to handle optional AT Protocol fields
Implement header-aware rate limiting with exponential backoff and jitter
Validate bundle output locally before pushing to deployment platform
Configure downstream export pipeline (JSON/CSV) with streaming backpressure

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
High-volume keyword monitoring	Authenticated search with date partitioning	Search endpoint requires JWT; partitioning reduces payload size	$0.25/run + $0.003/post
Competitor timeline tracking	Unauthenticated author feed	No token required; lower latency and simpler auth flow	$0.25/run + $0.003/post
Viral conversation analysis	Thread resolution with iterative flattening	Preserves reply hierarchy while enabling linear analytics	$0.25/run + $0.003/post
Monorepo with shared utilities	Pre-build copy + `tsup` bundling	Eliminates workspace resolution failures in isolated CI	Zero additional cost
Enterprise data warehouse sync	Stream-based JSON export with schema validation	Ensures type safety and prevents pipeline breaks on schema drift	Minimal compute overhead

Configuration Template

// env.example
BSKY_IDENTIFIER=your-handle.bsky.social
BSKY_APP_PASSWORD=xxxx-xxxx-xxxx-xxxx
OUTPUT_FORMAT=json
MAX_THREAD_DEPTH=100
REQUEST_TIMEOUT_MS=15000
RETRY_ATTEMPTS=3

// pipeline.config.ts
export const PipelineConfig = {
  endpoints: {
    public: 'https://public.api.bsky.app',
    authenticated: 'https://bsky.social',
  },
  extraction: {
    searchLimit: 25,
    authorFeedLimit: 50,
    threadMaxDepth: 100,
  },
  output: {
    format: 'json' as const,
    includeMedia: false,
    flattenThreads: true,
  },
  resilience: {
    timeout: 15000,
    retries: 3,
    backoffBase: 1000,
    jitter: true,
  },
};

Quick Start Guide

Initialize the project: Create a new TypeScript directory, install tsup, zod, and dotenv. Configure tsconfig.json with module: NodeNext and target: ES2022.
Set up authentication: Generate an App Password from your Bluesky account settings. Store it in .env and wire it to the client's authentication method.
Build the extraction logic: Implement the routing guard, token manager, and data extraction methods. Add schema validation for post records and thread nodes.
Bundle and validate: Run tsup to generate dist/main.js. Execute node --check dist/main.js to verify syntax and module resolution.
Deploy and monitor: Push the actor directory to your deployment platform. Configure environment variables, trigger a test run, and verify output schema alignment with downstream consumers.

How I built a Bluesky scraper using the AT Protocol API (and published it on Apify)