How I built a Bluesky scraper using the AT Protocol API (and published it on Apify)
Architecting Reliable Data Pipelines on the AT Protocol: A Production Guide to Bluesky Integration
Current Situation Analysis
Social data extraction has historically been a defensive engineering exercise. Teams routinely allocate significant development cycles to manage rotating proxy pools, bypass JavaScript rendering challenges, and navigate opaque enterprise pricing tiers. The underlying industry assumption has been that social platforms treat public content as a guarded commercial asset, requiring complex authentication flows and strict rate-limiting policies.
Bluesky fundamentally disrupts this paradigm by operating on the AT Protocol, an open architecture that treats public posts as machine-readable by default. With over 40 million active accounts, the platform exposes a clean REST interface explicitly designed for third-party consumption. Despite this architectural shift, many engineering teams still approach Bluesky with legacy scraping tactics, missing the protocol's native routing rules and authentication boundaries. The oversight stems from ingrained workflows built around walled-garden APIs, where unauthenticated access is either restricted or heavily throttled.
In reality, the AT Protocol enforces a strict separation between public read operations and authenticated queries. This design eliminates the need for subscription tiers, legal-compliance overhead, or reverse-engineering DOM structures. However, the routing logic is non-negotiable: unauthenticated requests must target the public gateway, while authenticated calls require a separate base URL. Misunderstanding this split causes immediate 403 responses and pipeline failures. Recognizing and implementing this architectural boundary is the foundational step toward building sustainable, production-grade data extraction systems on the platform.
WOW Moment: Key Findings
The operational overhead of traditional social data pipelines contrasts sharply with the AT Protocol's design. When engineering teams migrate from legacy scraping stacks to Bluesky's native endpoints, the reduction in infrastructure complexity and cost becomes immediately measurable.
| Approach | Auth Overhead | Rate Limit Predictability | Data Normalization Effort | Cost per 1k Records |
|---|---|---|---|---|
| Legacy Social Scraping | High (OAuth2 + token rotation + proxy auth) | Opaque (dynamic throttling, CAPTCHA triggers) | High (DOM parsing, HTML sanitization, layout drift) | $12β$45 (proxy + compute + enterprise API) |
| AT Protocol Extraction | Low (App Password β JWT, single endpoint split) | Transparent (HTTP headers, documented limits) | Low (structured JSON, consistent schema) | $3.00 ($0.25/run + $0.003/post) |
This finding matters because it shifts data extraction from a maintenance-heavy operation to a deterministic pipeline. Engineers can allocate resources toward data transformation, enrichment, and downstream analytics rather than fighting anti-bot defenses or managing proxy infrastructure. The predictable cost structure and structured JSON responses also simplify compliance auditing and schema versioning, making the platform viable for enterprise-grade social listening, brand monitoring, and lead generation workflows.
Core Solution
Building a reliable extraction pipeline on the AT Protocol requires three architectural decisions: endpoint routing, token lifecycle management, and output normalization. The following implementation demonstrates a production-ready TypeScript client that handles these concerns systematically.
1. Endpoint Routing & Token Management
The AT Protocol enforces a strict boundary between public and authenticated traffic. Unauthenticated reads target public.api.bsky.app, while authenticated requests must route through bsky.social. Authentication is performed against bsky.social, which returns a JWT. That token is then attached exclusively to subsequent authenticated calls.
import { createHash } from 'crypto';
interface AuthCredentials {
identifier: string;
appPassword: string;
}
interface TokenResponse {
accessJwt: string;
refreshJwt: string;
handle: string;
}
class BlueskyApiClient {
private readonly publicBase = 'https://public.api.bsky.app';
private readonly authBase = 'https://bsky.social';
private accessToken: string | null = null;
constructor(private credentials?: AuthCredentials) {}
async authenticate(): Promise<void> {
if (!this.credentials) throw new Error('Credentials required for auth');
const payload = {
identifier: this.credentials.identifier,
password: this.credentials.appPassword,
};
const response = await fetch(`${this.authBase}/xrpc/com.atproto.server.createSession`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(payload),
});
if (!response.ok) {
throw new Error(`Auth failed: ${response.status} ${response.statusText}`);
}
const data = await response.json() as TokenResponse;
this.accessToken = data.accessJwt;
}
private getBaseUrl(requireAuth: boolean): string {
if (requireAuth && !this.accessToken) {
throw new Error('Authenticated endpoint requested without valid token');
}
return requireAuth ? this.authBase : this.publicBase;
}
async request<T>(endpoint: string, requireAuth: boolean = false, params?: Record<string, string>): Promise<T> {
const baseUrl = this.getBaseUrl(requireAuth);
const url = new URL(`${baseUrl}/xrpc/${endpoint}`);
if (params) {
Object.entries(params).forEach(([key, value]) => url.searchParams.append(key, value));
}
const headers: Record<string, string> = { 'Content-Type': 'application/json' };
if (requireAuth && this.accessToken) {
headers['Authorization'] = `Bearer ${this.accessToken}`;
}
const res = await fetch(url.toString(), { headers });
if (!res.ok) {
throw new Error(`Request failed: ${res.status} at ${url.pathname}`);
}
return res.json() as Promise<T>;
}
}
Why this structure? Separating base URLs at the client level prevents accidental token leakage to the public gateway. The requireAuth flag enforces routing discipline at compile time, while the centralized request method standardizes error handling and header injection.
2. Data Extraction Patterns
The platform supports three primary extraction modes: keyword search, author timeline retrieval, and conversation thread resolution. Each requires distinct parameter handling and response parsing.
interface PostRecord {
uri: string;
cid: string;
value: {
text: string;
createdAt: string;
reply?: { root: { uri: string }; parent: { uri: string } };
embed?: {
images?: Array<{ alt: string; image: { thumb: string; fullsize: string } }>;
external?: { uri: string; title: string; description: string };
};
likeCount?: number;
repostCount?: number;
replyCount?: number;
};
}
interface AuthorInfo {
handle: string;
displayName: string;
}
class BlueskyDataExtractor {
constructor(private client: BlueskyApiClient) {}
async searchPosts(query: string, options: {
lang?: string;
since?: string;
until?: string;
sort?: 'top' | 'latest';
limit?: number;
}): Promise<PostRecord[]> {
const params: Record<string, string> = { q: query, limit: String(options.limit || 25) };
if (options.lang) params.lang = options.lang;
if (options.since) params.since = options.since;
if (options.until) params.until = options.until;
if (options.sort) params.sort = options.sort;
const data = await this.client.request<{ posts: PostRecord[] }>(
'app.bsky.feed.searchPosts',
true,
params
);
return data.posts;
}
async fetchAuthorFeed(authorHandle: string, limit: number = 50): Promise<PostRecord[]> {
const data = await this.client.request<{ feed: Array<{ post: PostRecord }> }>(
'app.bsky.feed.getAuthorFeed',
false,
{ actor: authorHandle, limit: String(limit) }
);
return data.feed.map(item => item.post);
}
async resolveThread(rootUri: string): Promise<PostRecord[]> {
const data = await this.client.request<{ thread: any }>(
'app.bsky.feed.getPostThread',
false,
{ uri: rootUri }
);
return this.flattenThreadTree(data.thread);
}
private flattenThreadTree(node: any): PostRecord[] {
const results: PostRecord[] = [];
const traverse = (current: any) => {
if (current.post) {
results.push(current.post);
}
if (current.replies && Array.isArray(current.replies)) {
current.replies.forEach((reply: any) => traverse(reply));
}
};
traverse(node);
return results;
}
}
Why depth-first flattening? The AT Protocol returns conversation trees as nested JSON objects. Downstream analytics and database ingestion typically require linear, chronologically ordered records. A recursive depth-first traversal preserves reply hierarchy while producing a flat array suitable for CSV/JSON export or stream processing.
3. Monorepo Bundling Strategy
When operating within a TypeScript monorepo using npm workspaces, shared utilities (logging, retry logic, schema validators) are typically referenced via workspace aliases. However, deployment environments like Apify's build servers do not resolve workspace dependencies. They only package the target actor directory.
The solution is to copy shared source files into each actor's src/shared/ directory during the build phase, then bundle everything into a single executable file. This preserves a single source of truth in the repository while ensuring deployment isolation.
// tsup.config.ts
import { defineConfig } from 'tsup';
export default defineConfig({
entry: ['src/main.ts'],
format: ['cjs'],
target: 'node18',
outDir: 'dist',
clean: true,
bundle: true,
splitting: false,
noExternal: ['@apify-actors/shared'],
banner: {
js: '/* Bluesky Data Pipeline - Production Bundle */',
},
});
Why tsup with noExternal? tsup leverages esbuild for fast compilation. Setting noExternal forces the bundler to inline workspace dependencies, eliminating runtime module resolution failures. The single dist/main.js output guarantees deterministic execution across isolated CI/CD runners.
Pitfall Guide
1. Cross-Endpoint Authentication Leakage
Explanation: Sending JWT tokens to public.api.bsky.app triggers immediate 403 responses. The public gateway is fronted by Cloudflare and explicitly rejects authenticated headers.
Fix: Implement a routing guard that validates the target base URL against the requireAuth flag before attaching headers. Never reuse the same client instance for mixed traffic without explicit URL switching.
2. Thread Depth Recursion Limits
Explanation: Deep conversation trees can exceed call stack limits if traversed synchronously without tail-call optimization or iterative conversion. Fix: Convert recursive traversal to an iterative stack-based approach when processing threads exceeding 50 replies. Monitor memory allocation and implement chunked processing for viral conversations.
3. Workspace Resolution in Isolated CI Environments
Explanation: npm workspaces resolve dependencies at install time, but deployment platforms often skip npm install for workspace roots, breaking alias imports.
Fix: Add a pre-build script that copies shared modules into the actor directory. Verify bundle output with node --check dist/main.js before deployment.
4. Media Payload Bloat
Explanation: Fetching full-size image URLs alongside thread data increases payload size by 300-500%, slowing serialization and increasing storage costs. Fix: Request only thumbnail URLs during extraction. Resolve full-size assets asynchronously in a downstream enrichment pipeline, or omit media entirely if text analysis is the primary goal.
5. URI Parsing Misalignment
Explanation: AT Protocol URIs follow the format at://did:plc:xxxxx/app.bsky.feed.post/xxxxx. Developers often attempt to extract handles directly from URIs, which is impossible without a DID resolution step.
Fix: Always fetch the handle and displayName from the author object returned alongside the post. Never derive identity from the URI string.
6. Silent Schema Drift
Explanation: The AT Protocol evolves rapidly. Optional fields like embed or replyCount may be omitted in edge cases, causing downstream type errors.
Fix: Implement runtime schema validation using Zod or io-ts. Define strict interfaces with optional chaining and provide default values for missing metrics before serialization.
7. Missing Backoff Headers
Explanation: Assuming static rate limits leads to pipeline crashes during traffic spikes. The platform returns Retry-After and X-RateLimit-Remaining headers.
Fix: Parse response headers and implement exponential backoff with jitter. Cache Retry-After values and pause queue processing until the window expires.
Production Bundle
Action Checklist
- Configure endpoint routing guard to separate public and authenticated traffic
- Implement JWT lifecycle management with automatic refresh on 401 responses
- Add iterative thread flattening to prevent stack overflow on viral conversations
- Set up pre-build workspace copying script for monorepo deployment compatibility
- Integrate runtime schema validation to handle optional AT Protocol fields
- Implement header-aware rate limiting with exponential backoff and jitter
- Validate bundle output locally before pushing to deployment platform
- Configure downstream export pipeline (JSON/CSV) with streaming backpressure
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| High-volume keyword monitoring | Authenticated search with date partitioning | Search endpoint requires JWT; partitioning reduces payload size | $0.25/run + $0.003/post |
| Competitor timeline tracking | Unauthenticated author feed | No token required; lower latency and simpler auth flow | $0.25/run + $0.003/post |
| Viral conversation analysis | Thread resolution with iterative flattening | Preserves reply hierarchy while enabling linear analytics | $0.25/run + $0.003/post |
| Monorepo with shared utilities | Pre-build copy + tsup bundling |
Eliminates workspace resolution failures in isolated CI | Zero additional cost |
| Enterprise data warehouse sync | Stream-based JSON export with schema validation | Ensures type safety and prevents pipeline breaks on schema drift | Minimal compute overhead |
Configuration Template
// env.example
BSKY_IDENTIFIER=your-handle.bsky.social
BSKY_APP_PASSWORD=xxxx-xxxx-xxxx-xxxx
OUTPUT_FORMAT=json
MAX_THREAD_DEPTH=100
REQUEST_TIMEOUT_MS=15000
RETRY_ATTEMPTS=3
// pipeline.config.ts
export const PipelineConfig = {
endpoints: {
public: 'https://public.api.bsky.app',
authenticated: 'https://bsky.social',
},
extraction: {
searchLimit: 25,
authorFeedLimit: 50,
threadMaxDepth: 100,
},
output: {
format: 'json' as const,
includeMedia: false,
flattenThreads: true,
},
resilience: {
timeout: 15000,
retries: 3,
backoffBase: 1000,
jitter: true,
},
};
Quick Start Guide
- Initialize the project: Create a new TypeScript directory, install
tsup,zod, anddotenv. Configuretsconfig.jsonwithmodule: NodeNextandtarget: ES2022. - Set up authentication: Generate an App Password from your Bluesky account settings. Store it in
.envand wire it to the client's authentication method. - Build the extraction logic: Implement the routing guard, token manager, and data extraction methods. Add schema validation for post records and thread nodes.
- Bundle and validate: Run
tsupto generatedist/main.js. Executenode --check dist/main.jsto verify syntax and module resolution. - Deploy and monitor: Push the actor directory to your deployment platform. Configure environment variables, trigger a test run, and verify output schema alignment with downstream consumers.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
