independent lifecycles.
Step 2: Service Scaffolding with Explicit Contracts
Each service must declare its interface contract before implementation. Use OpenAPI for synchronous endpoints and AsyncAPI for event-driven communication. Contracts live in a shared repository, versioned independently from implementation. This enables parallel development and contract testing.
Step 3: Data Isolation Strategy
Enforce database-per-service. Shared databases violate boundary independence and create implicit coupling through schema changes. Use schema migrations scoped to each service. For cross-service queries, implement CQRS with materialized views or event-sourced projections. Never query another service's primary datastore directly.
Step 4: Communication Topology
Use synchronous REST/gRPC for user-facing, low-latency requests within a single bounded context. Use asynchronous messaging (Kafka, RabbitMQ, NATS) for cross-boundary workflows. Implement idempotent consumers, dead-letter queues, and retry policies with exponential backoff. Avoid synchronous cross-service chains that create cascading failure modes.
Step 5: Observability and Correlation
Integrate OpenTelemetry from day one. Propagate correlation IDs (traceparent) across HTTP headers and message metadata. Structure logs as JSON with service name, version, and trace ID. Export metrics to Prometheus/Grafana and traces to Jaeger/Tempo. Observability is not an afterthought; it is the primary debugging surface for distributed systems.
Step 6: CI/CD and Contract Testing
Automate contract validation using tools like Pact or Schemathesis. Run consumer-driven contract tests in CI before merging. Deploy services independently via containerized artifacts. Use progressive delivery (canary, blue-green) to validate service interactions in production traffic without full rollouts.
TypeScript Implementation Example
The following example demonstrates a bounded context service (Order Processing) with health checks, async event publishing, database isolation, and correlation ID propagation.
// src/contexts/order/order.service.ts
import { Injectable, Logger } from '@nestjs/common';
import { InjectRepository } from '@nestjs/typeorm';
import { Repository } from 'typeorm';
import { Order } from './order.entity';
import { KafkaProducer } from '../infrastructure/kafka.producer';
import { CorrelationContext } from '../infrastructure/correlation.context';
@Injectable()
export class OrderService {
private readonly logger = new Logger(OrderService.name);
constructor(
@InjectRepository(Order)
private readonly orderRepo: Repository<Order>,
private readonly kafka: KafkaProducer,
) {}
async createOrder(payload: CreateOrderDto, correlationId: string): Promise<Order> {
const order = this.orderRepo.create({
...payload,
status: 'PENDING',
createdAt: new Date(),
});
const saved = await this.orderRepo.save(order);
// Publish domain event with correlation propagation
await this.kafka.publish('order.created', {
orderId: saved.id,
amount: saved.amount,
correlationId,
timestamp: new Date().toISOString(),
});
this.logger.log(`Order ${saved.id} created. Trace: ${correlationId}`);
return saved;
}
}
// src/infrastructure/kafka.producer.ts
import { Injectable, OnModuleInit } from '@nestjs/common';
import { Kafka, Producer, logLevel } from 'kafkajs';
@Injectable()
export class KafkaProducer implements OnModuleInit {
private producer: Producer;
async onModuleInit() {
const kafka = new Kafka({
brokers: [process.env.KAFKA_BROKER || 'localhost:9092'],
logLevel: logLevel.WARN,
});
this.producer = kafka.producer({
retry: { retries: 3, initialRetryTime: 200 },
});
await this.producer.connect();
}
async publish(topic: string, message: Record<string, unknown>) {
await this.producer.send({
topic,
messages: [{ value: JSON.stringify(message) }],
});
}
}
// src/contexts/order/order.controller.ts
import { Controller, Post, Body, Headers, HttpCode, HttpStatus } from '@nestjs/common';
import { OrderService } from './order.service';
import { CreateOrderDto } from './dto/create-order.dto';
@Controller('orders')
export class OrderController {
constructor(private readonly orderService: OrderService) {}
@Post()
@HttpCode(HttpStatus.CREATED)
async create(
@Body() dto: CreateOrderDto,
@Headers('x-correlation-id') correlationId: string,
) {
const traceId = correlationId || crypto.randomUUID();
return this.orderService.createOrder(dto, traceId);
}
}
Architecture Decisions and Rationale
- Database-per-service: Prevents schema coupling, enables independent scaling, and forces explicit data ownership. Trade-off: requires eventual consistency and CQRS for cross-service reads.
- Async-first cross-boundary communication: Reduces blast radius. Synchronous calls should never span multiple bounded contexts unless latency SLAs are strict and retries/circuit breakers are implemented.
- Correlation ID propagation: Enables trace reconstruction across services, queues, and databases. Mandatory for MTTR reduction.
- Contract versioning: Backward-compatible changes are deployed freely. Breaking changes require consumer migration windows and deprecation policies.
- Platform standardization: Shared libraries for logging, metrics, tracing, and retry policies reduce cognitive load and enforce consistency without coupling business logic.
Pitfall Guide
-
Splitting by technology layer instead of domain
- Mistake: Creating "Auth Service", "Email Service", "Payment Service" that share data models and require synchronous coordination.
- Impact: Distributed monolith with network latency, partial failures, and no deployment independence.
- Fix: Align services with business capabilities and bounded contexts. Group data and behavior that change together.
-
Distributed transactions without sagas or orchestration
- Mistake: Using 2PC or synchronous REST chains to maintain consistency across services.
- Impact: Cascading failures, lock contention, and degraded availability.
- Fix: Implement choreography (event-driven) or orchestration (workflow engine) sagas. Accept eventual consistency and design compensating actions.
-
Ignoring schema evolution and contract versioning
- Mistake: Modifying event payloads or API responses without backward compatibility.
- Impact: Consumer crashes, data loss, and deployment rollbacks.
- Fix: Use schema registries (Avro/Protobuf), additive-only changes, and explicit versioning. Deprecate fields before removal.
-
Over-relying on synchronous REST for everything
- Mistake: Building request-response chains that span 5+ services for a single user action.
- Impact: Latency multiplication, timeout storms, and fragile dependency graphs.
- Fix: Reserve sync for within-boundary calls. Use async messaging, command/query separation, and materialized views for cross-boundary data needs.
-
Neglecting observability as a first-class concern
- Mistake: Adding logging and tracing after deployment. Missing correlation IDs. Unstructured logs.
- Impact: MTTR exceeds 2 hours. Debugging requires log grepping and guesswork.
- Fix: Integrate OpenTelemetry at scaffolding. Enforce JSON logs, trace propagation, and service metadata. Treat observability as non-negotiable.
-
Treating services as isolated codebases without platform engineering
- Mistake: Each team reinvents deployment scripts, health checks, retry policies, and configuration management.
- Impact: Inconsistent SLAs, security gaps, and operational debt.
- Fix: Provide internal developer platforms (IDP), shared SDKs, standardized CI/CD templates, and service mesh or sidecar patterns for cross-cutting concerns.
-
Underestimating network reliability and partial failure modes
- Mistake: Assuming services are always reachable. No circuit breakers, no fallbacks, no idempotency.
- Impact: Thread pool exhaustion, duplicate processing, and data corruption during outages.
- Fix: Implement circuit breakers, bulkheads, idempotent consumers, and explicit timeout/retry policies. Design for partial failure.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Startup MVP / Rapid Validation | Modular Monolith | Faster iteration, single deployment surface, lower infra overhead | Low (baseline) |
| Regulated Enterprise / High Compliance | Well-Bounded Microservices + Service Mesh | Strict audit trails, independent scaling, mesh enforces mTLS/policies | Medium-High (+20-30%) |
| High-Scale E-Commerce / Event-Driven | Microservices + Kafka/NATS + CQRS | Handles traffic spikes, async workflows, read/write separation | High (+35-50%) |
| Legacy Modernization | Strangler Fig Pattern + API Gateway | Gradual migration, risk containment, preserves existing SLAs | Medium (+15-25%) |
Configuration Template
# docker-compose.yml
version: '3.8'
services:
order-service:
build: ./services/order
ports:
- "3000:3000"
environment:
- DATABASE_URL=postgres://user:pass@db:5432/orders
- KAFKA_BROKER=kafka:9092
- NODE_ENV=production
depends_on:
db:
condition: service_healthy
kafka:
condition: service_started
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 10s
timeout: 5s
retries: 3
db:
image: postgres:15-alpine
environment:
POSTGRES_USER: user
POSTGRES_PASSWORD: pass
POSTGRES_DB: orders
volumes:
- pgdata:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U user"]
interval: 5s
timeout: 3s
retries: 5
kafka:
image: confluentinc/cp-kafka:7.5.0
ports:
- "9092:9092"
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
depends_on:
- zookeeper
zookeeper:
image: confluentinc/cp-zookeeper:7.5.0
environment:
ZOOKEEPER_CLIENT_PORT: 2181
volumes:
pgdata:
# .env.example
DATABASE_URL=postgres://user:pass@localhost:5432/orders
KAFKA_BROKER=localhost:9092
NODE_ENV=development
OTEL_SERVICE_NAME=order-service
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
LOG_LEVEL=info
Quick Start Guide
- Clone the service repository and copy
.env.example to .env. Update DATABASE_URL and KAFKA_BROKER to match your environment.
- Run
docker compose up -d to provision PostgreSQL, Kafka, and Zookeeper. Verify health with docker compose ps.
- Install dependencies:
npm ci. Apply database migrations: npm run migration:run.
- Start the service:
npm run start:dev. Verify with curl http://localhost:3000/health and submit a test order via curl -X POST http://localhost:3000/orders -H "Content-Type: application/json" -d '{"customerId":"c1","amount":99.99}' -H "x-correlation-id: $(uuidgen)".
- Open your tracing dashboard (Tempo/Jaeger) and verify the correlation ID propagates through the HTTP request and Kafka message. Confirm JSON logs contain
traceId and service.name.