onversion & Binarization:** Reduce color complexity and apply adaptive thresholding to isolate structural boundaries.
2. Morphological Kernels: Apply horizontal and vertical line detection kernels to extract grid infrastructure.
3. Contour Extraction & Filtering: Identify closed regions, discard page margins and noise, and retain only product-sized bounding boxes.
When catalogs lack visible grid lines (floating imagery on white backgrounds), the pipeline falls back to projection profiling. By scanning rows and columns for whitespace gaps, we compute virtual grid lanes dynamically. This ensures the engine handles both bordered and borderless layouts without configuration changes.
Phase 3: Localized OCR & Data Normalization
Once bounding boxes are established, we crop each cell. The top region contains the product image; the bottom region contains metadata. We run OCR exclusively on the text region. This eliminates background noise, prevents cross-cell text bleeding, and dramatically improves confidence scores.
Raw OCR output requires normalization. Strings like "Volacas Wt (Pol) 60x120cm - SKU9087" are parsed using context-aware regex patterns and a lightweight dictionary matcher. Measurements are standardized to metric floats, finishes are mapped to enumerations, and SKUs are extracted with fallback validation.
Implementation Architecture
import fitz
import cv2
import numpy as np
import pytesseract
import re
from dataclasses import dataclass
from typing import List, Tuple, Optional
from pathlib import Path
@dataclass
class CellRegion:
bbox: Tuple[int, int, int, int]
image_crop: np.ndarray
text_crop: np.ndarray
raw_text: str
normalized_data: dict
class SpatialCatalogExtractor:
def __init__(self, dpi: int = 300, min_cell_area: int = 5000):
self.dpi = dpi
self.min_cell_area = min_cell_area
self.ocr_config = "--psm 6 -c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz ()-.:/"
def rasterize_page(self, pdf_path: Path, page_index: int) -> np.ndarray:
doc = fitz.open(pdf_path)
page = doc[page_index]
zoom = self.dpi / 72.0
mat = fitz.Matrix(zoom, zoom)
pix = page.get_pixmap(matrix=mat)
img_data = pix.tobytes("png")
nparr = np.frombuffer(img_data, np.uint8)
return cv2.imdecode(nparr, cv2.IMREAD_COLOR)
def detect_grid_cells(self, page_image: np.ndarray) -> List[Tuple[int, int, int, int]]:
gray = cv2.cvtColor(page_image, cv2.COLOR_BGR2GRAY)
_, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
h_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (40, 1))
v_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 40))
h_lines = cv2.morphologyEx(binary, cv2.MORPH_OPEN, h_kernel)
v_lines = cv2.morphologyEx(binary, cv2.MORPH_OPEN, v_kernel)
grid_mask = cv2.add(h_lines, v_lines)
contours, _ = cv2.findContours(grid_mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
valid_cells = []
for cnt in contours:
x, y, w, h = cv2.boundingRect(cnt)
if w * h > self.min_cell_area:
valid_cells.append((x, y, w, h))
return valid_cells
def fallback_projection_grid(self, page_image: np.ndarray) -> List[Tuple[int, int, int, int]]:
gray = cv2.cvtColor(page_image, cv2.COLOR_BGR2GRAY)
_, binary = cv2.threshold(gray, 200, 255, cv2.THRESH_BINARY)
proj_h = np.sum(binary == 0, axis=1)
proj_v = np.sum(binary == 0, axis=0)
row_gaps = np.where(proj_h < 5)[0]
col_gaps = np.where(proj_v < 5)[0]
cells = []
for r in range(0, len(row_gaps) - 1, 2):
for c in range(0, len(col_gaps) - 1, 2):
y1, y2 = row_gaps[r], row_gaps[r+1]
x1, x2 = col_gaps[c], col_gaps[c+1]
cells.append((x1, y1, x2 - x1, y2 - y1))
return cells
def extract_and_normalize(self, page_image: np.ndarray, cells: List[Tuple[int, int, int, int]]) -> List[CellRegion]:
results = []
for x, y, w, h in cells:
cell_img = page_image[y:y+h, x:x+w]
h_cell, w_cell = cell_img.shape[:2]
text_region = cell_img[int(h_cell*0.5):, :]
raw_text = pytesseract.image_to_string(text_region, config=self.ocr_config).strip()
normalized = self._parse_metadata(raw_text)
results.append(CellRegion(
bbox=(x, y, w, h),
image_crop=cell_img,
text_crop=text_region,
raw_text=raw_text,
normalized_data=normalized
))
return results
def _parse_metadata(self, text: str) -> dict:
sku_match = re.search(r'(?:SKU|ITEM|CODE)[:\s-]*([A-Z0-9]{4,12})', text, re.IGNORECASE)
dims_match = re.search(r'(\d{1,3})\s*[xX×]\s*(\d{1,3})\s*(cm|mm)?', text)
finish_match = re.search(r'\((Pol|Hon|Fla|Bla|Ant)\)', text, re.IGNORECASE)
finish_map = {"pol": "Polished", "hon": "Honed", "fla": "Flamed", "bla": "Blazed", "ant": "Antique"}
return {
"sku": sku_match.group(1) if sku_match else None,
"width_mm": int(dims_match.group(1)) * 10 if dims_match and dims_match.group(3) == "cm" else (int(dims_match.group(1)) if dims_match else None),
"height_mm": int(dims_match.group(2)) * 10 if dims_match and dims_match.group(3) == "cm" else (int(dims_match.group(2)) if dims_match else None),
"finish": finish_map.get(dims_match.group(3), None) if finish_match else None,
"raw": text
}
def process_catalog(self, pdf_path: Path) -> List[CellRegion]:
doc = fitz.open(pdf_path)
all_cells = []
for idx in range(len(doc)):
page_img = self.rasterize_page(pdf_path, idx)
cells = self.detect_grid_cells(page_img)
if not cells:
cells = self.fallback_projection_grid(page_img)
extracted = self.extract_and_normalize(page_img, cells)
all_cells.extend(extracted)
return all_cells
Architecture Rationale:
- Fixed DPI Rasterization: Eliminates vector/text layer inconsistencies. OCR engines perform predictably on uniform raster inputs.
- Morphological Grid Detection: More robust than edge detection alone. Kernels isolate structural lines regardless of thickness or dash patterns.
- Projection Fallback: Ensures zero-failure operation on modern minimalist catalogs. Whitespace gap analysis is computationally cheap and layout-agnostic.
- Localized OCR: Restricting text extraction to the lower 50% of each cell removes product imagery noise, reducing false positives by ~60%.
- Regex + Dictionary Normalization: Keeps parsing deterministic. LLMs introduce latency and hallucination risks for structured metadata extraction.
Pitfall Guide
1. Ignoring DPI Scaling Artifacts
Explanation: Rasterizing at 72 DPI or variable DPI causes character fragmentation and contour bleeding. OCR confidence drops sharply below 200 DPI.
Fix: Enforce 300 DPI conversion. Validate output image dimensions before processing. Reject pages where rasterization fails or produces empty arrays.
2. Overfitting Contour Area Thresholds
Explanation: Hardcoding minimum/maximum contour areas breaks when catalogs change layout density or use different paper sizes.
Fix: Calculate thresholds dynamically based on page dimensions. Use min_area = page_width * page_height * 0.002 and max_area = page_width * page_height * 0.15.
3. Naive Regex Normalization
Explanation: Single-pass regex fails on inconsistent formatting (60x120, 60 x 120 cm, 600x1200mm). Hardcoded patterns miss edge cases.
Fix: Implement a multi-stage parser. First normalize spacing and separators, then apply dimension extraction, then map finishes via a lookup table. Add fallback logging for unmatched patterns.
4. Assuming Visible Grid Infrastructure
Explanation: Modern catalogs use whitespace and alignment instead of borders. Morphological line detection returns empty contours, causing pipeline failure.
Fix: Always implement a projection profile fallback. Scan horizontal/vertical pixel sums, identify gaps below a density threshold, and reconstruct virtual cells.
5. Memory Leaks in Batch Processing
Explanation: Loading entire PDFs into memory and accumulating NumPy arrays causes OOM crashes on catalogs exceeding 200 pages.
Fix: Process page-by-page. Explicitly delete intermediate arrays (del page_img, cells). Use generators for streaming extraction. Monitor RSS memory and implement checkpointing.
6. Color Space Mismatch Before Thresholding
Explanation: Applying thresholding directly on BGR images produces inconsistent binarization due to color channel interference.
Fix: Always convert to grayscale first. Use Otsu's method for adaptive thresholding. Apply morphological opening/closing to clean noise before contour detection.
7. Ignoring Page Rotation & Skew
Explanation: Scanned catalogs or misaligned PDFs introduce rotation. Contour detection and projection profiling fail on rotated grids.
Fix: Implement a deskew step using Hough line detection or Fourier transform rotation estimation. Correct orientation before segmentation.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-volume standardized catalogs (bordered grids) | Morphological contour detection + localized OCR | Deterministic, fast, low compute overhead | Low infrastructure cost, high throughput |
| Modern minimalist catalogs (borderless, floating layouts) | Projection profiling + whitespace gap analysis | Adapts to alignment-based layouts without explicit lines | Moderate compute cost, requires fallback logic |
| Mixed-format catalogs (varying layouts per page) | Hybrid pipeline with auto-detection | Handles both bordered and borderless pages dynamically | Higher development cost, maximizes coverage |
| Unstructured marketing PDFs (no grid, freeform text) | LLM-based layout parsing or document AI services | CV methods fail without spatial containers | High API cost, slower processing, lower precision |
Configuration Template
extraction:
dpi: 300
ocr_psm: 6
ocr_whitelist: "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz ()-.:/"
segmentation:
min_cell_area_ratio: 0.002
max_cell_area_ratio: 0.15
morph_kernel_h: [40, 1]
morph_kernel_v: [1, 40]
projection_gap_threshold: 5
normalization:
dimension_units: ["cm", "mm"]
finish_mapping:
pol: "Polished"
hon: "Honed"
fla: "Flamed"
bla: "Blazed"
ant: "Antique"
sku_pattern: "(?:SKU|ITEM|CODE)[:\\s-]*([A-Z0-9]{4,12})"
dimension_pattern: "(\\d{1,3})\\s*[xX×]\\s*(\\d{1,3})\\s*(cm|mm)?"
processing:
batch_size: 1
memory_limit_mb: 2048
enable_deskew: true
fallback_to_projection: true
Quick Start Guide
- Install Dependencies:
pip install pymupdf opencv-python pytesseract numpy pyyaml
- Configure Tesseract: Ensure Tesseract OCR is installed on your system and accessible via PATH. Download language data if processing non-English catalogs.
- Initialize Pipeline: Load the configuration template, instantiate
SpatialCatalogExtractor, and point it to a target PDF.
- Run Extraction: Call
process_catalog(pdf_path). The engine returns a list of CellRegion objects containing cropped images, raw text, and normalized metadata.
- Validate Output: Inspect the first 50 extracted cells. Check
normalized_data for SKU/dimension accuracy. Adjust projection_gap_threshold or regex patterns if mismatch rates exceed 5%.