3, 768, 768)
dummy_decoder_input = torch.randint(0, 50000, (1, 1))
torch.onnx.export(
model.vision_encoder,
dummy_image,
"florence_vision_encoder.onnx",
input_names=["pixel_values"],
output_names=["vision_features"],
dynamic_axes={"pixel_values": {0: "batch", 2: "height", 3: "width"}},
opset_version=17
)
torch.onnx.export(
model.language_decoder,
(dummy_decoder_input, torch.randn(1, 576, 768)),
"florence_language_decoder.onnx",
input_names=["input_ids", "encoder_hidden_states"],
output_names=["logits", "past_key_values"],
dynamic_axes={"input_ids": {1: "seq_len"}, "encoder_hidden_states": {1: "vision_seq"}},
opset_version=17
)
**Rationale**: Caching `vision_features` eliminates redundant matrix multiplications. The decoder's computational cost shifts from O(V + T) to O(T), where V is vision sequence length and T is token count. This reduction is critical for maintaining consistent frame rates during camera streaming.
### 2. Domain-Calibrated INT8 Quantization
Dynamic quantization recalculates scaling factors at runtime, introducing overhead and accuracy variance. Static quantization requires a calibration dataset that mirrors the target domain's activation distributions.
```python
from onnxruntime.quantization import quantize_static, QuantType, CalibrationDataReader
import numpy as np
class DomainCalibrationReader(CalibrationDataReader):
def __init__(self, image_paths, batch_size=1):
self.paths = image_paths
self.batch_size = batch_size
self.index = 0
def get_next(self):
if self.index >= len(self.paths):
return None
batch = []
for _ in range(self.batch_size):
img = np.load(self.paths[self.index])
batch.append(img)
self.index += 1
return {"pixel_values": np.stack(batch, axis=0)}
calibration_data = DomainCalibrationReader(["cal_001.npy", "cal_002.npy", ...])
quantize_static(
model_input="florence_vision_encoder.onnx",
model_output="florence_vision_encoder_int8.onnx",
calibration_data_reader=calibration_data,
quant_format=QuantFormat.QOperator,
weight_type=QuantType.QUInt8,
activation_type=QuantType.QUInt8,
extra_options={"EnableSubgraph": True}
)
Rationale: Per-channel weight quantization combined with activation calibration minimizes quantization error in attention heads. The EnableSubgraph flag preserves operator fusion opportunities that NNAPI requires for hardware acceleration.
3. NNAPI Delegate Routing
ONNX Runtime Mobile delegates quantized operations to the device's NPU, GPU, or CPU based on capability flags. Explicit configuration prevents fallback to unoptimized CPU paths.
import ai.onnxruntime.OrtEnvironment
import ai.onnxruntime.OrtSession
object ModelSessionProvider {
private val env = OrtEnvironment.getEnvironment()
private lateinit var encoderSession: OrtSession
private lateinit var decoderSession: OrtSession
fun initialize(context: android.content.Context) {
val options = OrtSession.SessionOptions().apply {
addNnapi(
mapOf(
"NNAPI_FLAG_USE_FP16" to "0",
"NNAPI_FLAG_CPU_DISABLED" to "1",
"NNAPI_FLAG_GPU_ONLY" to "0"
)
)
setIntraOpNumThreads(4)
setOptimizationLevel(OrtSession.SessionOptions.OptLevel.ALL_OPT)
setSessionLogLevel(OrtSession.SessionOptions.LogLevel.WARNING)
}
encoderSession = env.createSession(
context.assets.open("florence_vision_encoder_int8.onnx").readBytes(),
options
)
decoderSession = env.createSession(
context.assets.open("florence_language_decoder_int8.onnx").readBytes(),
options
)
}
}
Rationale: Disabling FP16 forces strict INT8 execution paths, which aligns with the quantized model's design. Limiting intra-op threads to 4 prevents CPU saturation on devices with mixed core architectures, reducing thermal throttling during sustained inference.
4. Zero-Copy Image Preprocessing
CameraX delivers frames as ImageProxy objects in YUV_420_888 format. Converting to Bitmap introduces intermediate allocations and GC pressure. Direct YUV-to-tensor conversion eliminates this bottleneck.
import android.media.Image
import ai.onnxruntime.OnnxTensor
import ai.onnxruntime.OrtAllocator
import java.nio.ByteBuffer
import java.nio.ByteOrder
class FrameTensorConverter(private val allocator: OrtAllocator) {
private val mean = floatArrayOf(0.485f, 0.456f, 0.406f)
private val std = floatArrayOf(0.229f, 0.224f, 0.225f)
private val targetSize = 768
fun convertToInputTensor(imageProxy: Image): OnnxTensor {
val yPlane = imageProxy.planes[0].buffer
val uvPlane = imageProxy.planes[1].buffer
val width = imageProxy.width
val height = imageProxy.height
val floatBuffer = allocator.allocateFloatBuffer(3 * targetSize * targetSize)
floatBuffer.order(ByteOrder.nativeOrder())
NativeYuvConverter.convertAndNormalize(
yPlane, uvPlane, width, height,
floatBuffer, targetSize, targetSize,
mean, std
)
return OnnxTensor.createTensor(
OrtEnvironment.getEnvironment(),
floatBuffer,
longArrayOf(1, 3, targetSize, targetSize)
)
}
}
Rationale: A single native call handles resizing, color space conversion, and normalization in ~3ms. Skipping Bitmap creation removes ~15ms of allocation overhead and eliminates GC events that cause tail latency spikes during camera streaming.
5. Deterministic KV Cache Management
Autoregressive decoding requires storing key-value pairs for previous tokens. Dynamic allocation during generation triggers memory fragmentation and garbage collection. Pre-allocating a fixed buffer with zero-copy slicing ensures deterministic latency.
import java.nio.ByteBuffer
import java.nio.ByteOrder
class SequenceBufferPool(maxTokens: Int, layers: Int, hiddenDim: Int) {
private val cacheSize = layers * 2 * maxTokens * hiddenDim * 4
private val buffer = ByteBuffer.allocateDirect(cacheSize).order(ByteOrder.nativeOrder())
fun acquireView(tokenOffset: Int, tokenLength: Int): Map<String, ai.onnxruntime.OnnxTensor> {
val viewStart = tokenOffset * layers * 2 * hiddenDim * 4
val viewLength = tokenLength * layers * 2 * hiddenDim * 4
buffer.position(viewStart)
buffer.limit(viewStart + viewLength)
return mapOf(
"past_key" to ai.onnxruntime.OnnxTensor.createTensor(
OrtEnvironment.getEnvironment(),
buffer.slice(),
longArrayOf(1, layers, tokenLength, hiddenDim)
),
"past_value" to ai.onnxruntime.OnnxTensor.createTensor(
OrtEnvironment.getEnvironment(),
buffer.slice(),
longArrayOf(1, layers, tokenLength, hiddenDim)
)
)
}
fun release() {
buffer.clear()
}
}
Rationale: Pre-allocating 80MB for a 256-token sequence eliminates runtime allocation. Zero-copy slicing reduces p99 latency by ~40% by preventing GC pauses during the autoregressive loop. This is the single most impactful change for production stability.
Pitfall Guide
| Pitfall | Explanation | Fix |
|---|
| Monolithic Graph Export | Running encoder and decoder together forces recomputation of vision embeddings per token, increasing latency and memory pressure. | Split ONNX export into separate encoder/decoder graphs. Cache vision features and pass them as static inputs to the decoder. |
| Generic Calibration Data | Using random internet images for INT8 calibration creates activation distribution mismatches, causing >2% accuracy loss. | Curate 200–500 images matching the target domain (e.g., document scans, retail shelves, UI screenshots). |
| Bitmap Allocation in Camera Loop | Bitmap.createBitmap() triggers native memory allocation and GC events, causing 15–30ms latency spikes. | Convert YUV ImageProxy directly to ONNX tensor buffers using native preprocessing. |
| Unbounded KV Cache Growth | Dynamically allocating key-value tensors during decoding fragments memory and triggers GC pauses. | Pre-allocate a fixed ByteBuffer sized for maximum sequence length. Use zero-copy slicing for each step. |
| Concurrent Decoder Execution | Running multiple autoregressive loops in parallel exceeds memory limits and causes NNAPI delegate contention. | Serialize decoder generation on a single-threaded dispatcher. Keep encoder inference parallelized. |
| Ignoring NNAPI Fallback | Default delegate configuration may silently fall back to CPU for unsupported ops, doubling latency. | Explicitly set NNAPI_FLAG_CPU_DISABLED=1 and monitor session logs for fallback warnings. |
| Over-Provisioning Intra-Op Threads | Setting thread count to device core count (e.g., 8+) causes CPU contention and thermal throttling. | Limit to 4 threads. Match physical performance cores, not logical threads. |
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Real-time camera streaming | INT8 Static + Pre-allocated KV Cache + NNAPI NPU routing | Minimizes GC pauses and thermal throttling; maintains 10–12 tok/sec | Low (calibration dataset curation) |
| Batch document processing | FP16 + CPU fallback + Dynamic batch sizing | Higher accuracy tolerance; no real-time latency requirement | Medium (higher RAM usage) |
| Low-end Android devices (3GB RAM) | INT8 Dynamic + Reduced sequence length (128 tokens) | Prioritizes memory safety over peak accuracy | Low (accuracy drop ~1.5%) |
| Multi-modal UI analysis | Encoder parallelization + Serialized decoder | Maximizes throughput while respecting memory boundaries | Low (thread management overhead) |
Configuration Template
// ONNX Runtime Mobile Session Configuration
val sessionConfig = OrtSession.SessionOptions().apply {
addNnapi(
mapOf(
"NNAPI_FLAG_USE_FP16" to "0",
"NNAPI_FLAG_CPU_DISABLED" to "1",
"NNAPI_FLAG_GPU_ONLY" to "0"
)
)
setIntraOpNumThreads(4)
setOptimizationLevel(OrtSession.SessionOptions.OptLevel.ALL_OPT)
setSessionLogLevel(OrtSession.SessionOptions.LogLevel.WARNING)
}
// Memory Budget Allocation
val memoryAllocation = mapOf(
"runtime_overhead" to 45,
"encoder_model" to 120,
"decoder_model" to 110,
"kv_cache_256" to 80,
"preproc_buffer" to 14,
"tokenizer_overhead" to 20
)
// Total: ~389 MB (120 MB headroom under 500 MB limit)
Quick Start Guide
- Export Split Graphs: Run the Python ONNX export script to generate separate encoder and decoder
.onnx files with dynamic axes enabled.
- Calibrate & Quantize: Prepare a domain-specific image set, run static INT8 quantization with
EnableSubgraph=True, and verify accuracy drop <1.5%.
- Integrate Android Dependencies: Add
com.microsoft.onnxruntime:onnxruntime-mobile:1.18.0 to build.gradle, place quantized models in assets/, and initialize ModelSessionProvider in Application.onCreate().
- Wire Camera Pipeline: Replace
ImageAnalysis.Analyzer bitmap conversion with FrameTensorConverter, route output through SequenceBufferPool, and dispatch decoder generation on Dispatchers.Default.limitedParallelism(1).
- Validate Memory & Latency: Run Android Studio Memory Profiler during camera streaming. Confirm peak RAM <400MB, p99 latency <85ms, and NNAPI delegate utilization >80%.