Deduplication & Meta Mapping
Duplicates are inevitable in recurring syncs. Instead of checking post titles (which can change), we anchor records to a stable external identifier stored in post meta. Before inserting, we query for an existing post ID matching that identifier. If found, we update; if not, we create. This idempotent pattern ensures safe re-runs without orphaned records.
WordPress performs heavy background work during post creation: term counting, comment counting, and object cache population. During bulk operations, these features degrade performance. We will suspend non-essential WordPress hooks during the import loop and flush targeted caches upon completion.
Implementation Code
<?php
/**
* Bulk Data Sync Engine
* Streams CSV/Excel exports into WordPress Custom Post Types via WP-CLI.
* Designed for high-volume, idempotent synchronization.
*/
namespace Codcompass\Importer;
use WP_CLI;
use WP_Error;
class InventorySync {
private string $post_type;
private string $meta_key;
private array $column_map;
private int $batch_size;
private int $processed = 0;
private int $updated = 0;
private int $skipped = 0;
public function __construct( string $post_type, string $meta_key, array $column_map, int $batch_size = 500 ) {
$this->post_type = $post_type;
$this->meta_key = $meta_key;
$this->column_map = $column_map;
$this->batch_size = $batch_size;
}
public function execute( string $file_path ): void {
if ( ! file_exists( $file_path ) ) {
WP_CLI::error( "Source file not found: {$file_path}" );
return;
}
WP_CLI::log( 'Starting inventory sync. Suspending background WordPress processes...' );
$this->optimize_wp_environment();
$file_handle = new \SplFileObject( $file_path, 'r' );
$file_handle->setFlags( \SplFileObject::READ_CSV | \SplFileObject::SKIP_EMPTY );
$file_handle->setCsvControl( ',', '"', '\\' );
// Skip header row
$file_handle->current();
$file_handle->next();
WP_CLI::log( "Processing feed. Batch size: {$this->batch_size}" );
while ( ! $file_handle->eof() ) {
$row = $file_handle->current();
if ( ! is_array( $row ) || count( $row ) < 3 ) {
$file_handle->next();
continue;
}
$this->process_row( $row );
$this->processed++;
if ( $this->processed % $this->batch_size === 0 ) {
WP_CLI::log( "Checkpoint: {$this->processed} rows processed." );
$this->optimize_wp_environment(); // Re-apply optimizations after batch
}
$file_handle->next();
}
$this->finalize_sync();
}
private function process_row( array $raw_data ): void {
$external_id = trim( $raw_data[ $this->column_map['id'] ] ?? '' );
$title = trim( $raw_data[ $this->column_map['title'] ] ?? '' );
if ( empty( $external_id ) || empty( $title ) ) {
$this->skipped++;
return;
}
$existing = $this->locate_existing_record( $external_id );
$post_id = $existing ? $existing : $this->create_record( $title );
if ( $post_id && ! is_wp_error( $post_id ) ) {
$this->sync_metadata( $post_id, $raw_data, $external_id );
$existing ? $this->updated++ : null;
}
}
private function locate_existing_record( string $identifier ): ?int {
$results = \get_posts( [
'post_type' => $this->post_type,
'meta_key' => $this->meta_key,
'meta_value' => $identifier,
'fields' => 'ids',
'posts_per_page' => 1,
'no_found_rows' => true,
] );
return ! empty( $results ) ? (int) $results[0] : null;
}
private function create_record( string $title ): int {
return \wp_insert_post( [
'post_title' => \sanitize_text_field( $title ),
'post_status' => 'publish',
'post_type' => $this->post_type,
] );
}
private function sync_metadata( int $post_id, array $raw_data, string $external_id ): void {
\update_post_meta( $post_id, $this->meta_key, $external_id );
$price = $raw_data[ $this->column_map['price'] ] ?? '';
$region = $raw_data[ $this->column_map['region'] ] ?? '';
\update_post_meta( $post_id, 'listing_price', \sanitize_text_field( $price ) );
\update_post_meta( $post_id, 'listing_region', \sanitize_text_field( $region ) );
}
private function optimize_wp_environment(): void {
\wp_suspend_cache_addition( true );
\wp_defer_term_counting( true );
\wp_defer_comment_counting( true );
}
private function finalize_sync(): void {
\wp_suspend_cache_addition( false );
\wp_defer_term_counting( false );
\wp_defer_comment_counting( false );
\wp_cache_flush();
WP_CLI::success( sprintf(
'Sync complete. Processed: %d | Updated: %d | Skipped: %d',
$this->processed,
$this->updated,
$this->skipped
) );
}
}
Architecture Rationale
SplFileObject over fopen: Provides iterator semantics, built-in CSV flag handling, and automatic line boundary detection. Reduces boilerplate and improves error resilience.
- Idempotent Upsert Pattern: Checking via
meta_key/meta_value with fields => 'ids' minimizes query overhead. Returning only IDs avoids loading full post objects into memory.
- Batch Checkpoints: Processing in configurable batches allows memory garbage collection and provides progress visibility. Re-applying
wp_suspend_cache_addition() after each batch prevents WordPress from re-enabling background hooks mid-stream.
- Cache Suspension:
wp_suspend_cache_addition() prevents the object cache from growing unbounded during bulk writes. Flushing at the end ensures subsequent frontend requests read fresh data.
Pitfall Guide
1. Eager File Loading
Explanation: Using file() or file_get_contents() loads the entire dataset into PHP memory. A 50MB CSV can easily consume 200MB+ of RAM due to string duplication and array overhead.
Fix: Always use streaming iterators (SplFileObject, fgetcsv, or generators). Process one row at a time and unset variables after use.
2. N+1 Duplicate Queries
Explanation: Running a full WP_Query with default parameters for every row triggers expensive COUNT() queries and joins. At scale, this dominates execution time.
Fix: Use fields => 'ids', no_found_rows => true, and posts_per_page => 1. For datasets under 10k rows, pre-load all existing identifiers into a PHP array and check in_array() instead of querying the database per row.
3. Silent Validation Failures
Explanation: Raw CSV data often contains hidden characters, mismatched delimiters, or type mismatches. Writing unvalidated data corrupts post meta and breaks frontend rendering.
Fix: Implement strict column mapping with fallback defaults. Apply sanitize_text_field(), absint(), or floatval() based on expected types. Log malformed rows to a separate error file instead of skipping silently.
4. Object Cache Bloat
Explanation: WordPress caches every update_post_meta() call. During bulk imports, the object cache grows linearly, eventually exhausting available memory or triggering cache eviction thrashing.
Fix: Wrap the import loop in wp_suspend_cache_addition( true ). Re-enable it after completion and call wp_cache_flush() or targeted cache clears for the affected post type.
5. Ignoring Execution Time Limits
Explanation: Even in CLI environments, max_execution_time may be enforced by PHP-FPM or hosting control panels. Long imports will terminate abruptly without cleanup.
Fix: Call set_time_limit( 0 ) at the start of the command. Verify server php.ini settings. Implement checkpoint logging so interrupted runs can resume from the last processed row.
Explanation: WordPress postmeta values are stored as LONGTEXT, but large JSON payloads or serialized arrays degrade query performance and exceed practical limits for indexing.
Fix: Keep meta values under 8,000 characters. For complex nested data, normalize into separate custom tables or use a dedicated headless CMS/database. Avoid storing raw API responses in post meta.
7. Cache Inertia on Frontend
Explanation: After a bulk sync, frontend requests may still serve stale data from page caches, CDN edge nodes, or persistent object caches.
Fix: Trigger cache invalidation immediately after sync completion. Use plugin-specific purge functions (e.g., WP Rocket, LiteSpeed) or clear CDN caches via API. Never assume the database write alone updates the delivery layer.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| < 5,000 rows, occasional sync | WP-CLI streaming command | Minimal overhead, deterministic execution, zero plugin dependency | Low (dev time only) |
| 5,000–50,000 rows, daily sync | WP-CLI + batch checkpoints + cache suspension | Prevents memory leaks, maintains stable execution time, safe for shared hosting | Low-Medium |
| > 50,000 rows, real-time API feed | External queue (Redis/RabbitMQ) + background worker | Decouples ingestion from web requests, handles backpressure, scales horizontally | High (infrastructure) |
| Non-technical team requires UI | Lightweight admin form + fgetcsv + AJAX chunking | Balances usability with performance, avoids heavy plugin bloat | Medium |
Configuration Template
<?php
/**
* WP-CLI Command Registration
* Place in your custom plugin or mu-plugin directory.
*/
namespace Codcompass\CLI;
use WP_CLI;
use Codcompass\Importer\InventorySync;
WP_CLI::add_command( 'inventory sync', function( $args, $assoc_args ) {
$file_path = $assoc_args['file'] ?? '';
$post_type = $assoc_args['type'] ?? 'listing';
$batch = (int) ( $assoc_args['batch'] ?? 500 );
if ( empty( $file_path ) ) {
WP_CLI::error( 'Usage: wp inventory sync --file=/path/to/feed.csv [--type=listing] [--batch=500]' );
return;
}
$column_mapping = [
'id' => 0,
'title' => 1,
'price' => 2,
'region' => 3,
];
$sync_engine = new InventorySync(
post_type: $post_type,
meta_key: '_ext_inventory_id',
column_map: $column_mapping,
batch_size: $batch
);
$sync_engine->execute( $file_path );
} );
Quick Start Guide
- Prepare the feed: Export your dataset to CSV. Ensure the first row contains headers, and verify UTF-8 encoding without BOM. Place the file in a readable directory (e.g.,
wp-content/uploads/feeds/inventory.csv).
- Register the command: Drop the configuration template into your active plugin or
mu-plugins directory. Verify WP-CLI recognizes it by running wp help inventory sync.
- Execute the sync: Run
wp inventory sync --file=/absolute/path/to/inventory.csv --type=property --batch=1000 from your server terminal. Monitor the checkpoint logs for progress.
- Validate & purge: Check
wp_posts and wp_postmeta for correct mapping. Flush your object cache and trigger page cache invalidation to ensure frontend reflects the new data.