age 2: Norm-based Spatial Pooling (NSP)
Instead of dropping tokens randomly or by fixed strides, NSP ranks spatial tokens within each frame by their normalized magnitude. We apply a soft threshold that preserves the top-K spatial regions while maintaining a minimum coverage floor to prevent spatial fragmentation. The pooling operation aggregates selected tokens using weighted averaging, where weights are proportional to their norm scores. This ensures high-signal regions dominate the compressed representation.
Stage 3: Pyramid Temporal Gridding (PTG)
Motion in video occurs at varying frequencies. A single temporal stride cannot capture both slow camera pans and rapid cuts. PTG constructs hierarchical temporal windows across multiple scales. At level 1, we group adjacent frames. At level 2, we skip every other group, and so on. Within each grid cell, we aggregate tokens using temporal attention-weighted pooling. The pyramid structure ensures that both fine-grained motion and long-range dependencies are preserved without dense computation.
Stage 4: Sequence Reconstruction and Projection
The pooled spatial tokens and gridded temporal tokens are flattened into a 1D sequence. Positional encodings are recalculated to reflect the new sequence length and temporal spacing. The sequence is then passed to the LLM's projection layer. Since the compression is deterministic and differentiable-free, it plugs directly into existing inference pipelines.
TypeScript Implementation
interface VisualToken {
id: number;
embedding: Float32Array;
frameIndex: number;
spatialIndex: number;
norm: number;
}
interface CompressionConfig {
spatialTopKRatio: number;
temporalPyramidLevels: number;
minSpatialCoverage: number;
normThresholdSigma: number;
}
class SpatiotemporalTokenProcessor {
private config: CompressionConfig;
constructor(config: CompressionConfig) {
this.config = config;
}
computeTokenNorms(tokens: VisualToken[]): VisualToken[] {
return tokens.map(token => ({
...token,
norm: Math.sqrt(token.embedding.reduce((sum, val) => sum + val * val, 0))
}));
}
applyNormSpatialPooling(tokens: VisualToken[]): VisualToken[] {
const frameGroups = new Map<number, VisualToken[]>();
tokens.forEach(t => {
if (!frameGroups.has(t.frameIndex)) frameGroups.set(t.frameIndex, []);
frameGroups.get(t.frameIndex)!.push(t);
});
const pooled: VisualToken[] = [];
const meanNorm = tokens.reduce((s, t) => s + t.norm, 0) / tokens.length;
const stdNorm = Math.sqrt(tokens.reduce((s, t) => s + Math.pow(t.norm - meanNorm, 2), 0) / tokens.length);
const threshold = meanNorm + this.config.normThresholdSigma * stdNorm;
for (const [frameIdx, frameTokens] of frameGroups) {
const sorted = [...frameTokens].sort((a, b) => b.norm - a.norm);
const keepCount = Math.max(
Math.ceil(sorted.length * this.config.spatialTopKRatio),
Math.ceil(sorted.length * this.config.minSpatialCoverage)
);
const selected = sorted.slice(0, keepCount);
// Weighted aggregation
const totalWeight = selected.reduce((s, t) => s + t.norm, 0);
const aggregatedEmbedding = new Float32Array(tokens[0].embedding.length);
selected.forEach(t => {
const weight = t.norm / totalWeight;
for (let i = 0; i < aggregatedEmbedding.length; i++) {
aggregatedEmbedding[i] += t.embedding[i] * weight;
}
});
pooled.push({
id: frameIdx * 10000,
embedding: aggregatedEmbedding,
frameIndex: frameIdx,
spatialIndex: 0,
norm: totalWeight / selected.length
});
}
return pooled;
}
applyPyramidTemporalGridding(tokens: VisualToken[]): VisualToken[] {
const frames = Array.from(new Set(tokens.map(t => t.frameIndex))).sort((a, b) => a - b);
const gridded: VisualToken[] = [];
let globalId = 0;
for (let level = 0; level < this.config.temporalPyramidLevels; level++) {
const stride = Math.pow(2, level);
for (let i = 0; i < frames.length; i += stride) {
const windowFrames = frames.slice(i, i + stride);
const windowTokens = tokens.filter(t => windowFrames.includes(t.frameIndex));
if (windowTokens.length === 0) continue;
const aggregated = new Float32Array(tokens[0].embedding.length);
windowTokens.forEach(t => {
for (let d = 0; d < aggregated.length; d++) {
aggregated[d] += t.embedding[d];
}
});
const count = windowTokens.length;
for (let d = 0; d < aggregated.length; d++) {
aggregated[d] /= count;
}
gridded.push({
id: globalId++,
embedding: aggregated,
frameIndex: windowFrames[0],
spatialIndex: level,
norm: Math.sqrt(aggregated.reduce((s, v) => s + v * v, 0))
});
}
}
return gridded;
}
process(tokens: VisualToken[]): VisualToken[] {
const normed = this.computeTokenNorms(tokens);
const spatiallyPooled = this.applyNormSpatialPooling(normed);
const temporallyGridded = this.applyPyramidTemporalGridding(spatiallyPooled);
return temporallyGridded;
}
}
Architecture Rationale
- Norm-driven selection over attention weights: Computing attention maps requires forward passes through the vision encoder's self-attention layers, adding latency. Token norms are mathematically equivalent to semantic density proxies in frozen encoders and require only a single pass.
- Pyramid over uniform stride: Video motion is non-stationary. A fixed stride either oversamples static scenes or undersamples action sequences. Hierarchical gridding adapts to temporal complexity without dynamic routing overhead.
- Training-free design: By operating on post-encoder embeddings, the method avoids gradient computation, checkpoint management, and dataset alignment. It functions as a deterministic preprocessing layer compatible with any MLLM that accepts token sequences.
Pitfall Guide
1. Treating Raw Norms as Absolute Semantic Scores
Explanation: Embedding norms vary across layers and model architectures. A high norm in one checkpoint may represent noise in another. Blindly thresholding raw values causes over-pruning in low-magnitude models.
Fix: Normalize norms per-batch using z-score standardization. Apply relative ranking within frames rather than absolute thresholds. Calibrate normThresholdSigma on a validation subset.
2. Fixed Temporal Grids Across Variable Frame Rates
Explanation: PTG assumes uniform temporal spacing. Processing 24 FPS and 60 FPS video with identical grid strides distorts motion continuity, causing temporal aliasing.
Fix: Scale grid strides by frame rate. Convert FPS to a temporal resolution factor before constructing pyramid levels. Use duration-based windows instead of frame-count windows when metadata is available.
3. Over-Aggressive Spatial Pruning Without Coverage Guarantees
Explanation: Aggressive top-K selection can fragment spatial context, removing background tokens that provide scene layout and object grounding.
Fix: Enforce minSpatialCoverage as a hard constraint. Use soft masking (weighted averaging) instead of hard dropping. Maintain a background anchor token per frame to preserve spatial reference.
4. Ignoring Positional Encoding Recalculation
Explanation: Compressing tokens changes sequence length and temporal spacing. Feeding compressed tokens into an LLM with static positional encodings causes misalignment and degraded attention patterns.
Fix: Recompute rotary or absolute positional encodings after compression. Map new token indices to continuous time coordinates. Ensure the projection layer receives consistent dimensionality.
5. Static Compression Ratios for Heterogeneous Scenes
Explanation: Applying a fixed 4:1 ratio to both static interviews and high-action sports wastes budget on simple scenes and starves complex ones.
Fix: Implement dynamic budget allocation. Use scene complexity metrics (e.g., frame difference variance, motion vector magnitude) to adjust spatialTopKRatio and temporalPyramidLevels per-video.
6. Projection Layer Dimension Mismatch
Explanation: The LLM's visual projector expects a specific sequence length or feature dimension. Aggressive pooling can output sequences that break linear projection layers or cause shape errors.
Fix: Add a learnable or fixed adapter layer post-compression that maps pooled tokens to the expected projector input shape. Validate tensor shapes during pipeline integration.
7. Assuming Norm-Semantic Correlation Transfers Across Domains
Explanation: Medical imaging, satellite footage, and UI screens exhibit different norm distributions. A threshold tuned on natural video fails on domain-specific data.
Fix: Run a lightweight calibration pass on target domain samples. Adjust normThresholdSigma and spatialTopKRatio based on domain-specific norm histograms. Log norm distributions in production to detect drift.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Real-time streaming inference | PTG + NSP with dynamic budget | Adapts to scene complexity, maintains low latency | +4.5% compute, -30% context memory |
| Offline batch analysis | Fixed 4:1 PTG + NSP | Deterministic, reproducible, maximizes fidelity | +5.2% compute, -40% context memory |
| Edge/Low-resource deployment | Norm-spatial only + coarse temporal stride | Reduces pyramid overhead, preserves core semantics | +2.8% compute, -25% context memory |
| High-fidelity archival/research | 2:1 PTG + NSP with full pyramid | Maximizes spatiotemporal retention for downstream fine-tuning | +6.1% compute, -15% context memory |
Configuration Template
{
"spatiotemporal_processor": {
"enabled": true,
"spatial_pooling": {
"strategy": "norm_weighted",
"top_k_ratio": 0.45,
"min_coverage_ratio": 0.15,
"norm_threshold_sigma": 1.2,
"normalize_per_batch": true
},
"temporal_gridding": {
"strategy": "pyramid",
"levels": 3,
"stride_base": 2,
"fps_aware_scaling": true,
"reference_fps": 24.0
},
"post_processing": {
"recompute_positional_encodings": true,
"projection_adapter_dim": 4096,
"dynamic_budget": {
"enabled": true,
"complexity_metric": "frame_diff_variance",
"min_ratio": 0.3,
"max_ratio": 0.6
}
}
}
}
Quick Start Guide
- Extract Tokens: Run your video through the frozen vision encoder and flatten the output to
(N, D) token arrays. Attach frame and spatial indices.
- Initialize Processor: Load the configuration template, instantiate
SpatiotemporalTokenProcessor, and pass your token array to the process() method.
- Reconstruct Sequence: Flatten the returned tokens, regenerate positional encodings matching the new sequence length, and feed the result into your MLLM's visual projector.
- Validate: Run a small batch through your target benchmark. Compare token count, inference latency, and accuracy against your baseline pooling method. Adjust
top_k_ratio and norm_threshold_sigma if fidelity drops.
- Deploy: Wrap the processor in a middleware layer between your video loader and LLM inference engine. Enable dynamic budget logging to monitor compression efficiency in production.