What it took to put six cities' affordable housing data on one map
Unifying Fragmented Municipal Data: Spatial Analytics and Schema Normalization at Scale
Current Situation Analysis
Cross-jurisdictional civic data comparison is a foundational requirement for urban policy analysis, but it remains one of the most under-engineered areas in modern data pipelines. Municipal housing departments operate on independent timelines, legacy systems, and policy-driven reporting requirements. The result is a fragmented data landscape where two cities tracking the same metric will rarely use the same column names, categorization logic, or temporal granularity.
This problem is frequently overlooked because most developer tutorials and open-source dashboards assume clean, standardized APIs or single-city deployments. In production, civic data is inherently heterogeneous. Income thresholds are bucketed differently based on local cost-of-living adjustments. Construction phases use jurisdiction-specific terminology. Historical datasets often lack completion timestamps due to legacy record-keeping practices. When analysts attempt to compare supply metrics across metros, they hit a normalization wall that breaks standard ETL assumptions.
Empirical evidence from multi-city housing datasets confirms the scale of the challenge. Across six major metropolitan areas, approximately 6,500 affordable housing projects can be aggregated, but the column-by-column overlap rarely exceeds forty percent. One city may track five income tiers while another uses three. Construction types might be split into preservation versus new development in one jurisdiction, and rehabilitation versus ground-up in another. Some datasets bundle rental and ownership units, while others separate them entirely. Temporal gaps are equally common; certain municipalities publish project rosters without completion dates, rendering time-series analysis impossible without explicit caveats.
The core bottleneck is not spatial computation or visualization. It is schema alignment. Until data is normalized into a canonical structure with explicit provenance tracking, any cross-city metric will produce misleading rankings. The engineering effort required to bridge these gaps is substantial, but it unlocks reliable gap analysis, per-capita normalization, and defensible policy comparisons.
WOW Moment: Key Findings
The most critical insight from multi-city normalization is that raw aggregation masks true supply-demand imbalances. Large metropolitan areas naturally dominate raw counts, obscuring density-level shortages in smaller or more constrained jurisdictions. Normalizing against population fundamentally reorders the data, but the choice of population denominator introduces its own analytical trade-offs.
| Normalization Approach | Ranking Behavior | Urban Core Accuracy | Production Viability |
|---|---|---|---|
| Raw Unit Count | Favors largest metros by absolute volume | Low (inflated by scale) | High for inventory tracking, low for policy comparison |
| Per-Capita (Residential) | Highlights dense, underserved neighborhoods | Medium (fails in commercial/tourist districts) | High for residential planning, requires daytime adjustment |
| Daytime-Adjusted Ratio | Balances worker/tourist influx with housing supply | High (reflects actual service demand) | Medium (requires additional ACS/employment datasets) |
This finding matters because it shifts the analytical focus from volume to density. Per-capita normalization exposes tracts where affordable inventory is critically misaligned with resident demand. However, relying exclusively on residential population creates blind spots in central business districts, hospital corridors, and transit hubs where daytime populations vastly exceed overnight counts. The production solution is to expose both metrics, document the methodological boundaries, and allow stakeholders to toggle between residential and daytime-adjusted views. Transparency about normalization assumptions consistently outperforms forced precision in policy-facing applications.
Core Solution
Building a reliable cross-city housing analytics pipeline requires three architectural pillars: a canonical target schema, idempotent city-specific loaders, and database-native spatial computation. Each layer addresses a specific failure mode in municipal data integration.
Step 1: Define a Canonical Target Schema
The target schema must be broad enough to absorb jurisdictional variations but strict enough to enforce query consistency. The design prioritizes nullable fields over forced defaults, preserving data lineage through audit columns.
// src/schema/canonical.ts
export const CANONICAL_SCHEMA = {
table: 'muni_housing_inventory',
columns: {
id: 'SERIAL PRIMARY KEY',
city_code: 'VARCHAR(3) NOT NULL',
external_ref: 'VARCHAR(64) NOT NULL',
project_name: 'VARCHAR(255)',
geom: 'GEOGRAPHY(Point, 4326)',
total_units: 'INTEGER',
unit_type: 'VARCHAR(32)', // rental, ownership, mixed
income_band: 'VARCHAR(32)', // normalized tier
income_band_raw: 'VARCHAR(64)', // original jurisdiction label
construction_phase: 'VARCHAR(32)', // new, rehab, preservation
funding_source: 'VARCHAR(128)',
status_date: 'DATE',
data_quality_note: 'TEXT'
},
uniqueConstraint: 'UNIQUE(city_code, external_ref)'
};
The income_band_raw column is critical. It preserves the source jurisdiction's exact categorization, enabling auditors to trace normalized values back to their origin. Nullable fields prevent schema rigidity from rejecting valid records.
Step 2: Build Idempotent City Loaders
Each municipality requires a dedicated extraction script. The loader must map source fields to the canonical structure, handle missing values gracefully, and perform upserts to prevent duplication on re-runs.
// src/loaders/sf-mohcd.loader.ts
import { Pool } from 'pg';
import { normalizeIncomeTier } from '../utils/tier-mapper';
export async function loadSanFranciscoData(pool: Pool, rawRecords: any[]) {
const client = await pool.connect();
try {
await client.query('BEGIN');
for (const record of rawRecords) {
const normalized = {
city_code: 'SFO',
external_ref: record.project_id,
project_name: record.project_title,
geom: `SRID=4326;POINT(${record.longitude} ${record.latitude})`,
total_units: record.total_units ?? null,
unit_type: record.is_rental ? 'rental' : 'ownership',
income_band: normalizeIncomeTier(record.ami_bracket, 'SFO'),
income_band_raw: record.ami_bracket,
construction_phase: record.project_type === 'ground_up' ? 'new' : 'rehab',
funding_source: record.funding_program,
status_date: record.completion_date ? new Date(record.completion_date) : null,
data_quality_note: 'AMI brackets mapped to 6-tier standard. Original preserved in income_band_raw.'
};
await client.query(
`INSERT INTO muni_housing_inventory
(city_code, external_ref, project_name, geom, total_units, unit_type,
income_band, income_band_raw, construction_phase, funding_source,
status_date, data_quality_note)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12)
ON CONFLICT (city_code, external_ref)
DO UPDATE SET
total_units = EXCLUDED.total_units,
status_date = EXCLUDED.status_date,
data_quality_note = EXCLUDED.data_quality_note`,
[
normalized.city_code, normalized.external_ref, normalized.project_name,
normalized.geom, normalized.total_units, normalized.unit_type,
normalized.income_band, normalized.income_band_raw, normalized.construction_phase,
normalized.funding_source, normalized.status_date, normalized.data_quality_note
]
);
}
await client.query('COMMIT');
} catch (err) {
await client.query('ROLLBACK');
throw err;
} finally {
client.release();
}
}
The ON CONFLICT clause ensures idempotency. Re-running the loader updates mutable fields (unit counts, dates) without duplicating records. Transaction wrapping guarantees atomicity across batch inserts.
Step 3: Database-Native Spatial Gap Analysis
Spatial filtering must occur inside PostgreSQL. Pulling geometries to the application layer for radius checks introduces network overhead, bypasses spatial indexes, and fails to scale with dynamic user inputs. PostGIS handles meter-native distance calculations efficiently when combined with conditional aggregation.
-- src/queries/supply_gap_analysis.sql
WITH tract_supply AS (
SELECT
t.tract_id,
t.tract_name,
t.resident_population,
t.daytime_population,
t.rent_burdened_hh,
COUNT(h.id) FILTER (
WHERE ST_DWithin(t.centroid::geography, h.geom::geography, 1000)
) AS affordable_units_1km
FROM census_tracts t
LEFT JOIN muni_housing_inventory h
ON h.city_code = t.city_code
AND h.status_date IS NOT NULL
WHERE t.city_code = $1
GROUP BY t.tract_id, t.tract_name, t.resident_population,
t.daytime_population, t.rent_burdened_hh, t.centroid
)
SELECT
tract_id,
tract_name,
rent_burdened_hh,
affordable_units_1km,
ROUND((rent_burdened_hh::numeric / NULLIF(affordable_units_1km, 0)), 2) AS burden_ratio_residential,
ROUND((rent_burdened_hh::numeric / NULLIF(affordable_units_1km, 0)) * (resident_population::numeric / NULLIF(daytime_population, 0)), 2) AS burden_ratio_daytime_adjusted
FROM tract_supply
ORDER BY burden_ratio_residential DESC NULLS LAST
LIMIT 25;
Architecture Rationale:
ST_DWithinwith::geographycast performs accurate meter-based distance calculations on the spheroid, avoiding planar distortion errors.- The
FILTERclause applies the spatial constraint directly within the aggregate, eliminating subqueries or application-side joins. - Daytime adjustment multiplies the residential ratio by the population displacement factor, surfacing commercial districts where housing demand outpaces supply despite low overnight counts.
- Execution time remains under 50ms on full metro datasets because the query leverages a
GISTspatial index onh.geomand a B-tree index ont.centroid.
Pitfall Guide
1. Application-Layer Spatial Filtering
Explanation: Developers frequently fetch all project geometries and census tracts, then compute distances in Node.js or Python. This approach ignores database spatial indexes, transfers unnecessary payload over the network, and degrades exponentially as radius parameters change.
Fix: Push all spatial predicates into PostgreSQL using ST_DWithin, ST_Intersects, or ST_Buffer. Use EXPLAIN ANALYZE to verify index usage.
2. Forcing Exact Category Equivalence
Explanation: Attempting to map five income tiers to three without preserving source labels creates irreversible data loss. Policy auditors cannot validate normalized outputs, and cross-city comparisons become legally defensible only if provenance is tracked.
Fix: Always retain a _raw or _source column alongside normalized fields. Document mapping logic in a versioned configuration file, not hardcoded conditionals.
3. Ignoring Temporal Gaps in Historical Records
Explanation: Some municipalities publish project lists without completion dates. Silently dropping these records or imputing dates introduces bias. Time-series charts will show artificial drops or spikes.
Fix: Flag records with missing temporal data in a data_quality_note column. Exclude them from time-based aggregations but retain them for spatial and inventory queries. Expose the limitation in UI footnotes.
4. Residential-Only Population Normalization
Explanation: Normalizing housing supply against overnight population accurately reflects resident burden but fails in central business districts, medical campuses, and transit corridors where daytime workers and visitors create unmet demand. Fix: Integrate ACS daytime population estimates or local employment density data. Provide toggleable metrics and document the methodological boundary on methodology pages.
5. Silent Data Overwrites
Explanation: Running loaders without conflict handling duplicates records or overwrites historical snapshots. This breaks audit trails and inflates unit counts.
Fix: Use composite unique constraints (city_code, external_ref) with ON CONFLICT DO UPDATE. Log overwrite events to an audit table for compliance tracking.
6. Hiding Normalization Assumptions
Explanation: Presenting normalized data as authoritative without disclosing mapping logic erodes trust. Policy stakeholders will question discrepancies between source portals and aggregated dashboards. Fix: Implement a data quality footnote system. Each city view should display last-updated timestamps, missing field warnings, and tier-mapping assumptions. Transparency increases adoption more than forced precision.
Production Bundle
Action Checklist
- Define canonical schema with nullable fields and source-preservation columns
- Implement composite unique constraints to guarantee idempotent upserts
- Create
GISTspatial indexes on all geometry columns before loading data - Build city-specific loaders with explicit tier-mapping configuration files
- Push all spatial filtering and aggregation into PostgreSQL using
FILTERclauses - Integrate daytime population estimates for commercial district correction
- Document normalization assumptions and data gaps in UI footnotes
- Run
EXPLAIN ANALYZEon spatial queries to verify index utilization
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Single-city dashboard | Application-layer filtering acceptable | Low volume, simpler stack | Low infrastructure cost |
| Multi-city comparison | Database-native PostGIS queries | Scales with radius changes, leverages indexes | Moderate compute, high accuracy |
| Policy audit required | Preserve raw source columns + normalized tiers | Enables traceability and compliance | Low storage overhead |
| Commercial district analysis | Daytime-adjusted per-capita metric | Reflects actual service demand | Requires additional ACS/employment datasets |
| Legacy data with missing dates | Flag and exclude from time-series, retain for spatial | Prevents bias while preserving inventory | Minimal query complexity increase |
Configuration Template
-- Enable PostGIS and create canonical tables
CREATE EXTENSION IF NOT EXISTS postgis;
CREATE TABLE census_tracts (
tract_id VARCHAR(15) PRIMARY KEY,
city_code VARCHAR(3) NOT NULL,
tract_name VARCHAR(128),
centroid GEOGRAPHY(Point, 4326),
resident_population INTEGER,
daytime_population INTEGER,
rent_burdened_hh INTEGER
);
CREATE TABLE muni_housing_inventory (
id SERIAL PRIMARY KEY,
city_code VARCHAR(3) NOT NULL,
external_ref VARCHAR(64) NOT NULL,
project_name VARCHAR(255),
geom GEOGRAPHY(Point, 4326),
total_units INTEGER,
unit_type VARCHAR(32),
income_band VARCHAR(32),
income_band_raw VARCHAR(64),
construction_phase VARCHAR(32),
funding_source VARCHAR(128),
status_date DATE,
data_quality_note TEXT,
UNIQUE(city_code, external_ref)
);
-- Spatial indexes for query performance
CREATE INDEX idx_tracts_centroid ON census_tracts USING GIST (centroid);
CREATE INDEX idx_housing_geom ON muni_housing_inventory USING GIST (geom);
CREATE INDEX idx_housing_city_status ON muni_housing_inventory(city_code, status_date);
Quick Start Guide
- Initialize the database: Run the configuration template SQL against a PostgreSQL instance with the
postgisextension enabled. - Configure city loaders: Copy the tier-mapping JSON schema, populate jurisdiction-specific bucket translations, and set API credentials for each municipal data portal.
- Execute initial load: Run the Node.js loader scripts sequentially. Verify upsert behavior by re-running a loader and confirming zero duplicate inserts.
- Validate spatial queries: Execute the supply gap analysis SQL with a test
city_code. UseEXPLAIN ANALYZEto confirmGISTindex usage and sub-50ms execution times. - Deploy metric toggles: Wire the residential and daytime-adjusted ratio columns to your frontend visualization. Add data quality footnotes sourced from the
data_quality_notecolumn.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
