← Back to Blog
Next.js2026-05-11Β·78 min read

Three months of running a Next.js aggregator on a CSV: what broke and what did not,

By SweepBase

Flat-File Data Architectures in Next.js: Scaling Small Aggregators Without a Database

Current Situation Analysis

The modern web development stack defaults to relational or document databases for nearly every data-driven project. This惯性 stems from a widespread belief that flat files are inherently fragile, unscalable, and unsuitable for production environments. In reality, this mindset creates unnecessary infrastructure overhead for projects with fewer than 1,000 records. Developers routinely provision connection pools, write migration scripts, and maintain ORM layers that yield zero performance or operational ROI at small scale.

The problem is overlooked because flat-file architectures are frequently dismissed as "prototyping" solutions. Teams assume that moving to a database is a mandatory step toward production readiness. This assumption ignores how modern edge networks, static site generation, and incremental static regeneration (ISR) can treat CSV or JSON files as first-class, cacheable data sources. When the dataset remains under a few hundred rows, the actual bottlenecks are rarely data volume. They are cache invalidation timing, parsing edge cases, and third-party asset delivery limits.

Operational data from a 141-row aggregator running on Next.js 15 for three months confirms this. The system maintained sub-100ms response times across thousands of requests without a single database query. The architecture never strained under read load. Instead, failures emerged from misconfigured cache lifecycles, malformed CSV quoting breaking server components, and external image proxy rate limits. These are solvable engineering problems, not architectural dead ends. Treating a CSV as a version-controlled data contract, rather than a temporary placeholder, shifts focus from data plumbing to business logic and editorial velocity.

WOW Moment: Key Findings

When comparing flat-file architectures against traditional database-driven setups for small-scale aggregators, the operational trade-offs become stark. The table below contrasts three common approaches across deployment complexity, data synchronization latency, editorial velocity, and monthly infrastructure cost.

Approach Deployment Complexity Data Sync Latency Editorial Velocity Monthly Cost
Flat-File (CSV) + ISR Low (Git push triggers build) 1–5 minutes (with webhook) High (PR diffs, no SQL client) $0–$20 (Vercel free/Pro tier)
Relational DB (PostgreSQL) Medium (migrations, connection pooling) Real-time (but requires API layer) Low (requires admin UI or raw queries) $15–$50 (managed instance)
Headless CMS Medium (schema sync, webhooks) Real-time (CDN propagation) Medium (UI-driven, but rigid schemas) $25–$100 (tiered plans)

This finding matters because it decouples data storage from application complexity. A flat-file approach eliminates connection management, ORM mapping, and migration pipelines entirely. Editorial teams can modify data through standard pull requests, leveraging Git diff tools and CI validation. The architecture scales linearly with team size rather than infrastructure complexity. When the dataset stays under 500 rows, the flat-file model delivers faster iteration cycles, lower operational overhead, and predictable caching behavior that traditional databases struggle to match without additional tooling.

Core Solution

Building a production-ready flat-file aggregator requires treating the CSV as a strict data contract, not a loose spreadsheet. The architecture revolves around four pillars: schema enforcement, ingestion validation, predicate-based querying, and intelligent cache invalidation.

1. Schema Enforcement as the Single Source of Truth

Define the data contract using a validation library before writing any ingestion logic. This contract dictates types, defaults, and transformation rules. When the schema changes, the entire pipeline fails fast, preventing silent data corruption.

// src/lib/schema/record-schema.ts
import { z } from 'zod';

export const RecordSchema = z.object({
  id: z.string().uuid(),
  title: z.string().min(2).max(120),
  category: z.enum(['payments', 'staking', 'lending', 'nft']),
  metrics: z.object({
    tps: z.number().nonnegative(),
    latency_ms: z.number().positive(),
    fee_usd: z.number().min(0)
  }),
  status: z.enum(['active', 'deprecated', 'beta']).default('active'),
  image_url: z.string().url()
});

export type RecordContract = z.infer<typeof RecordSchema>;

2. CSV Ingestion & Validation Pipeline

Parse the flat file using a streaming parser, then validate each row against the schema. Failures must be logged with context before throwing, ensuring server components never crash silently.

// src/lib/data/ingestion.ts
import { parse } from 'papaparse';
import { RecordSchema, RecordContract } from './schema/record-schema';

export class DatasetLoader {
  async loadFromPath(filePath: string): Promise<RecordContract[]> {
    const raw = await import('fs/promises').then(fs => fs.readFile(filePath, 'utf-8'));
    
    const results = parse(raw, { header: true, skipEmptyLines: true });
    
    if (results.errors.length > 0) {
      console.error('[DatasetLoader] CSV parsing errors:', results.errors);
      throw new Error('Malformed CSV structure detected');
    }

    const validated: RecordContract[] = [];
    
    for (const row of results.data as Record<string, unknown>[]) {
      const parseResult = RecordSchema.safeParse(row);
      
      if (!parseResult.success) {
        console.error('[DatasetLoader] Validation failed for row:', JSON.stringify(row));
        console.error('[DatasetLoader] Zod errors:', parseResult.error.format());
        throw new Error(`Invalid record at row: ${JSON.stringify(row)}`);
      }
      
      validated.push(parseResult.data);
    }

    return validated;
  }
}

3. Predicate-Based Query Layer

Replace complex SQL joins or ORM queries with composable boolean functions. Each category or filter becomes a pure function that accepts a record and returns a boolean. This pattern keeps filtering logic transparent, testable, and trivial to extend.

// src/lib/data/filters.ts
import { RecordContract } from './schema/record-schema';

export type FilterPredicate = (record: RecordContract) => boolean;

export const Filters = {
  isActive: (r: RecordContract) => r.status === 'active',
  isPayments: (r: RecordContract) => r.category === 'payments',
  hasLowLatency: (r: RecordContract) => r.metrics.latency_ms < 50,
  isHighThroughput: (r: RecordContract) => r.metrics.tps > 1000,
  
  // Compose filters dynamically
  combine: (...predicates: FilterPredicate[]) => (r: RecordContract) =>
    predicates.every(p => p(r))
};

4. ISR & Cache Invalidation Strategy

Next.js 15 App Router handles static rendering efficiently, but ISR requires explicit invalidation when the underlying CSV changes. Cold pages will serve stale data until traffic triggers regeneration. A lightweight webhook endpoint solves this without external services.

// src/app/api/revalidate/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { revalidateTag } from 'next/cache';

export async function POST(request: NextRequest) {
  const secret = request.nextUrl.searchParams.get('secret');
  
  if (secret !== process.env.REVALIDATION_SECRET) {
    return NextResponse.json({ error: 'Invalid secret' }, { status: 401 });
  }

  revalidateTag('dataset');
  
  return NextResponse.json({ revalidated: true, now: Date.now() });
}

Apply the tag in route segments:

// src/app/records/[id]/page.tsx
export const revalidate = 86400; // 24 hours
export const dynamicParams = true;

export default async function RecordPage({ params }: { params: { id: string } }) {
  const data = await fetch('/api/dataset', { next: { tags: ['dataset'] } });
  // ... render logic
}

5. Asset Routing & Optimization

External image proxies frequently hit rate limits when served from Vercel egress IPs. The solution is to download new assets locally, convert them to WebP, and serve them through Next.js Image Optimization. Legacy external URLs should only be used as fallbacks.

// src/app/api/image-proxy/route.ts
import { NextRequest, NextResponse } from 'next/server';

export async function GET(request: NextRequest) {
  const url = request.nextUrl.searchParams.get('url');
  if (!url) return NextResponse.json({ error: 'Missing url' }, { status: 400 });

  const response = await fetch(url, {
    headers: { 'User-Agent': 'NextJS-ImageProxy/1.0' }
  });

  if (!response.ok) {
    return NextResponse.json({ error: 'Image fetch failed' }, { status: 502 });
  }

  const buffer = await response.arrayBuffer();
  
  return new NextResponse(buffer, {
    headers: {
      'Content-Type': response.headers.get('content-type') || 'image/webp',
      'Cache-Control': 'public, max-age=604800, immutable',
      'CDN-Cache-Control': 'public, max-age=604800'
    }
  });
}

Architecture Rationale:

  • Zod as contract: Prevents schema drift. Every refactoring passes through a single validation layer.
  • Predicates over queries: Eliminates ORM complexity. Filtering becomes O(n) array operations, which is negligible for <500 records.
  • ISR + webhook: Balances static performance with dynamic updates. The 24-hour revalidation window reduces edge compute costs, while the webhook ensures immediate propagation after CSV commits.
  • Local asset hosting: Removes third-party dependency risk. WebP conversion cuts bandwidth by ~30% compared to PNG/JPG.

Pitfall Guide

1. ISR Cold-Start Staleness

Explanation: Setting revalidate = 86400 without an invalidation trigger means cold pages serve stale data for up to 24 hours. Vercel only regenerates pages when traffic hits them. Fix: Implement a /api/revalidate endpoint secured with a secret. Trigger it via a post-commit Git hook or CI pipeline immediately after CSV changes.

2. Malformed CSV Quoting Crashes Server Components

Explanation: CSV parsers fail silently or throw when encountering unescaped commas inside quoted fields. If validation isn't logged before throwing, the entire route crashes with a generic 500 error. Fix: Always log the raw failing row and Zod error output before throwing. Wrap ingestion in a try/catch that returns a structured error response instead of bubbling up to the framework.

3. Third-Party Image Proxy Throttling

Explanation: External CDNs (Google Drive, S3 public buckets) often rate-limit Vercel egress IPs. Cache hit rates drop, latency spikes, and proxy endpoints return 429 errors. Fix: Migrate new assets to local storage or a dedicated CDN. Convert to WebP during build time. Use external URLs only as fallbacks for legacy records, and implement exponential backoff in the proxy route.

4. Schema Drift Without Contract Enforcement

Explanation: Adding or renaming columns in the CSV without updating the validation schema causes silent type mismatches or runtime crashes downstream. Fix: Treat the Zod schema as the authoritative source. Run schema validation in CI. Fail builds if CSV columns don't map to schema keys. Use TypeScript's z.infer to propagate types automatically.

5. Silent Validation Failures

Explanation: Using .parse() instead of .safeParse() throws exceptions that halt the entire ingestion pipeline. One bad row kills the dataset load. Fix: Always use .safeParse() in production. Collect errors, log them, and either skip the row (with a warning) or fail the build explicitly. Never swallow validation errors.

6. Over-Engineering Filter Logic

Explanation: Developers often reach for SQL-like query builders or ORM filters for flat-file datasets. This adds unnecessary abstraction layers and obscures simple boolean logic. Fix: Stick to predicate functions. Compose them with .every() or .some(). Keep filters in a single module. Test them as pure functions with mock records.

7. Missing Cache Invalidation Automation

Explanation: Manual webhook calls are error-prone. Teams forget to trigger revalidation after CSV updates, leading to stale production pages. Fix: Automate invalidation in CI/CD. Add a post-commit hook or GitHub Actions step that calls the revalidation endpoint after successful CSV merges. Log the response to verify success.

Production Bundle

Action Checklist

  • Define Zod schema as the single source of truth before writing ingestion logic
  • Implement .safeParse() with detailed row-level error logging
  • Replace ORM queries with composable predicate functions
  • Configure ISR with revalidate and attach cache tags to fetch calls
  • Build a secret-protected /api/revalidate endpoint for immediate cache busting
  • Migrate new images to local WebP storage; proxy external URLs only as fallbacks
  • Automate revalidation triggers in CI/CD or Git hooks
  • Add CSV column validation to CI pipeline to catch schema drift early

Decision Matrix

Scenario Recommended Approach Why Cost Impact
< 500 records, frequent editorial updates Flat-File (CSV) + ISR Git diffs, zero DB overhead, fast PR reviews $0–$20/mo
> 5,000 records, complex relational queries Relational DB (PostgreSQL) Indexing, joins, ACID compliance required $15–$50/mo
Non-technical editors, rich media management Headless CMS UI-driven workflows, asset optimization built-in $25–$100/mo
Real-time analytics, user-generated content Document DB + Edge Functions Low-latency writes, flexible schema evolution $20–$80/mo

Configuration Template

// next.config.mjs
/** @type {import('next').NextConfig} */
const nextConfig = {
  images: {
    formats: ['image/webp', 'image/avif'],
    remotePatterns: [
      {
        protocol: 'https',
        hostname: 'cdn.yourdomain.com',
        pathname: '/assets/**'
      }
    ]
  },
  experimental: {
    optimizePackageImports: ['zod', 'papaparse']
  }
};

export default nextConfig;
# .github/workflows/revalidate.yml
name: Revalidate Cache on CSV Update
on:
  push:
    paths: ['data/records.csv']
jobs:
  revalidate:
    runs-on: ubuntu-latest
    steps:
      - name: Trigger Revalidation
        run: |
          curl -X POST "https://yourdomain.com/api/revalidate?secret=${{ secrets.REVALIDATION_SECRET }}"

Quick Start Guide

  1. Initialize the project: Run npx create-next-app@latest aggregator --typescript --tailwind --app. Install dependencies: npm install zod papaparse.
  2. Create the data contract: Add src/lib/schema/record-schema.ts with your Zod definitions. Export the inferred TypeScript type.
  3. Build the ingestion pipeline: Create src/lib/data/ingestion.ts with the DatasetLoader class. Add a /api/dataset route that loads and caches the parsed records.
  4. Configure ISR & revalidation: Add export const revalidate = 86400 to route segments. Implement the /api/revalidate endpoint and secure it with an environment variable.
  5. Deploy & automate: Push to Vercel. Add the GitHub Actions workflow to trigger revalidation on CSV commits. Verify cache behavior by updating a row and checking the webhook response.