attempts through a secondary processor. This mitigates provider-specific outages.
- Card Updater Integration: Enable card updater programs (e.g., Visa Account Updater, Mastercard AutoCart) to automatically refresh expired card details without customer intervention. This addresses the 25% expiration vector automatically.
2. Implementation: TypeScript Recovery Service
The following example demonstrates a robust webhook handler and retry scheduler. This implementation uses a strategy pattern for retries, includes signature verification, and handles idempotency checks.
import { Stripe } from 'stripe';
import { Request, Response } from 'express';
import { Logger } from './logger';
import { EmailService } from './email-service';
import { SubscriptionManager } from './subscription-manager';
// Configuration for recovery attempts
interface RecoveryStep {
delayHours: number;
action: 'retry' | 'notify_and_retry' | 'escalate';
emailTemplateId?: string;
}
const RECOVERY_SCHEDULE: RecoveryStep[] = [
{ delayHours: 0, action: 'retry' },
{ delayHours: 72, action: 'retry' },
{ delayHours: 168, action: 'notify_and_retry', emailTemplateId: 'payment_failed_day7' },
{ delayHours: 360, action: 'escalate', emailTemplateId: 'payment_failed_day15' },
];
export class PaymentRecoveryController {
constructor(
private stripe: Stripe,
private emailService: EmailService,
private subManager: SubscriptionManager,
private logger: Logger
) {}
async handleWebhook(req: Request, res: Response): Promise<void> {
const sig = req.headers['stripe-signature'];
const webhookSecret = process.env.STRIPE_WEBHOOK_SECRET;
if (!sig || !webhookSecret) {
res.status(500).json({ error: 'Missing configuration' });
return;
}
let event: Stripe.Event;
try {
event = this.stripe.webhooks.constructEvent(
req.body,
sig,
webhookSecret
);
} catch (err) {
this.logger.error(`Webhook signature verification failed: ${err.message}`);
res.status(400).json({ error: 'Invalid signature' });
return;
}
if (event.type === 'invoice.payment_failed') {
const invoice = event.data.object as Stripe.Invoice;
await this.processRecovery(invoice);
}
res.json({ received: true });
}
private async processRecovery(invoice: Stripe.Invoice): Promise<void> {
const subscriptionId = invoice.subscription as string;
const attemptCount = invoice.attempt_count;
const customerId = invoice.customer as string;
// Idempotency check: Ensure we haven't already processed this attempt
const currentStatus = await this.subManager.getRecoveryStatus(subscriptionId);
if (currentStatus.lastProcessedAttempt >= attemptCount) {
this.logger.info(`Duplicate event ignored for ${subscriptionId}`);
return;
}
const strategy = RECOVERY_SCHEDULE[attemptCount - 1];
if (!strategy) {
// Exhausted recovery attempts
await this.subManager.suspendSubscription(subscriptionId);
await this.emailService.send(customerId, 'subscription_suspended');
this.logger.warn(`Recovery exhausted for ${subscriptionId}. Subscription suspended.`);
return;
}
// Schedule the next action
const retryTimestamp = Date.now() + (strategy.delayHours * 60 * 60 * 1000);
await this.subManager.scheduleRecoveryAction({
subscriptionId,
customerId,
scheduledAt: retryTimestamp,
action: strategy.action,
emailTemplateId: strategy.emailTemplateId,
attemptCount: attemptCount + 1,
});
this.logger.info(`Recovery scheduled for ${subscriptionId} at ${new Date(retryTimestamp).toISOString()}`);
}
}
3. Rationale
- Strategy Array: The
RECOVERY_SCHEDULE array maps attempt counts to specific actions. This makes the dunning logic configurable and testable without modifying core business logic.
- Idempotency Guard: The check against
lastProcessedAttempt prevents duplicate processing if webhooks are retried by the provider, a common occurrence in distributed systems.
- Decoupled Scheduling: The
scheduleRecoveryAction method delegates to a job queue (e.g., BullMQ, AWS SQS). This ensures retries happen asynchronously and can survive application restarts.
- Action Types: Distinguishing between
retry, notify_and_retry, and escalate allows for nuanced handling. Early attempts retry silently; later attempts engage the customer.
Pitfall Guide
Production dunning systems often fail due to subtle implementation errors. The following pitfalls are derived from real-world deployment experience.
| Pitfall | Explanation | Mitigation Strategy |
|---|
| Aggressive Retry Loops | Retrying immediately and repeatedly triggers bank fraud detection algorithms, causing valid cards to be blocked. | Implement exponential backoff. Respect the delayHours in the schedule. Never retry more than once per day. |
| Ignoring Card Updaters | Failing to enable card updater services results in unnecessary retries on expired cards, wasting API calls and angering customers. | Enable updater programs in the payment provider dashboard. Monitor the card.updated webhook to track automatic fixes. |
| Webhook Signature Neglect | Processing webhooks without verifying signatures exposes the system to replay attacks and spoofed events. | Always verify stripe-signature using the webhook secret. Reject unverified events immediately. |
| The "Lost/Stolen" Trap | Continuously retrying cards reported as lost or stolen generates decline fees and provides no recovery value. | Check the failure_code or failure_message. If the card is reported lost/stolen, bypass retries and force a credential update flow. |
| Tone-Deaf Communication | Emails that blame the customer or use aggressive language increase churn probability even if payment succeeds. | Use neutral, helpful language. Frame failures as temporary issues. Provide clear, one-click update links. |
| Metric Blindness | Deploying a recovery system without tracking outcomes prevents optimization. | Implement dashboards for Recovery Rate, Recovery Revenue, and Time-to-Recovery. Review weekly. |
| Suspension Timing Errors | Suspending access too early loses the customer; suspending too late increases risk of bad debt. | Define a clear grace period. Suspend access only after the final escalation attempt fails. |
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Early-Stage Startup | Use Payment Provider's Built-in Dunning | Low engineering overhead; sufficient for low volume. | Low (included in base fees). |
| High-Volume SaaS | Custom Recovery Engine | Granular control over scheduling, routing, and metrics; optimizes recovery rate. | Medium (engineering time + queue infrastructure). |
| Enterprise/Regulated | Hybrid with Manual Review | Automated retries for common failures; manual review for high-value accounts or complex fraud blocks. | High (operational overhead). |
| Global Customer Base | Multi-Processor Routing | Mitigates regional processor outages; improves authorization rates via local acquirers. | Medium (additional processor fees). |
Configuration Template
Use this JSON structure to externalize your recovery strategy, allowing updates without code deployments.
{
"recovery": {
"enabled": true,
"schedule": [
{
"attempt": 1,
"delay_hours": 0,
"action": "retry",
"notify_customer": false
},
{
"attempt": 2,
"delay_hours": 72,
"action": "retry",
"notify_customer": false
},
{
"attempt": 3,
"delay_hours": 168,
"action": "retry",
"notify_customer": true,
"email_template": "payment_failed_day7"
},
{
"attempt": 4,
"delay_hours": 360,
"action": "retry",
"notify_customer": true,
"email_template": "payment_failed_day15"
}
],
"suspension": {
"trigger_after_attempt": 4,
"grace_period_hours": 24
},
"card_updater": {
"enabled": true,
"auto_retry_on_update": true
}
}
}
Quick Start Guide
- Enable Updaters: Log into your payment provider dashboard and enable the Card Updater program. This requires no code changes but immediately improves recovery on expired cards.
- Deploy Webhook Handler: Implement the webhook handler code provided in the Core Solution. Ensure signature verification is active.
- Configure Queue: Set up a background job processor to handle the
scheduleRecoveryAction calls. Verify that jobs execute at the correct delays.
- Verify Metrics: Confirm that your system logs recovery events. Check that
invoice.payment_succeeded events are correlated with previous failures to calculate recovery rates.
- Iterate: After two weeks, review the Recovery Rate. If below 15%, analyze failure codes and adjust the schedule or email templates. Consider adding a secondary payment processor if decline rates remain high.