routing, and backpressure management.
Architecture Rationale
We avoid synchronous metric polling because reading Runtime.getRuntime() or querying thermal status on the main thread introduces jitter. Instead, we use Kotlin's SharedFlow for thermal events (hot stream, multiple subscribers) and StateFlow for strategy state (cold stream, latest value cached). The executor remains completely unaware of thermal logic; it simply receives an ExecutionPolicy enum and adjusts its compute backend accordingly. This separation ensures that thermal monitoring can be swapped, mocked, or extended without touching the inference engine.
Implementation
import android.content.Context
import android.os.PowerManager
import kotlinx.coroutines.*
import kotlinx.coroutines.flow.*
// Hardware routing and execution policy definitions
enum class ComputeBackend { NPU, GPU, CPU }
enum class ExecutionPolicy { FULL_THROTTLE, THERMAL_CAPPED, EMERGENCY_PAUSE }
data class InferenceSessionMetrics(
val ttftMillis: Long,
val avgTps: Double,
val memoryDeltaMb: Long,
val backendUsed: ComputeBackend
)
/**
* Central controller for thermal-aware LLM inference.
* Decouples hardware monitoring from execution strategy.
*/
class OnDeviceInferenceController(
private val context: Context,
private val scope: CoroutineScope
) {
private val powerManager = context.getSystemService(Context.POWER_SERVICE) as PowerManager
// Hot stream: broadcasts thermal changes to all subscribers
private val _thermalEvents = MutableSharedFlow<Int>(extraBufferCapacity = 4)
val thermalEvents: SharedFlow<Int> = _thermalEvents
// Cold stream: holds the current execution policy
private val _currentPolicy = MutableStateFlow(ExecutionPolicy.FULL_THROTTLE)
val currentPolicy: StateFlow<ExecutionPolicy> = _currentPolicy.asStateFlow()
init {
// Bind Android's callback API to a Kotlin Flow
val thermalListener = PowerManager.OnThermalStatusChangedListener { status ->
_thermalEvents.tryEmit(status)
}
powerManager.addThermalStatusListener(thermalListener)
// Map thermal thresholds to execution policies
thermalEvents
.map { status ->
when {
status >= PowerManager.THERMAL_STATUS_CRITICAL -> ExecutionPolicy.EMERGENCY_PAUSE
status >= PowerManager.THERMAL_STATUS_SEVERE -> ExecutionPolicy.THERMAL_CAPPED
else -> ExecutionPolicy.FULL_THROTTLE
}
}
.onEach { policy -> _currentPolicy.value = policy }
.launchIn(scope)
}
/**
* Executes an inference task with automatic metric collection and policy enforcement.
*/
suspend fun runInference(
prompt: String,
inferenceEngine: suspend (String, ComputeBackend) -> Flow<String>
): Result<InferenceSessionMetrics> = withContext(Dispatchers.Default) {
val policy = _currentPolicy.value
if (policy == ExecutionPolicy.EMERGENCY_PAUSE) {
return@withContext Result.failure(IllegalStateException("Thermal threshold exceeded. Inference suspended."))
}
val backend = when (policy) {
ExecutionPolicy.FULL_THROTTLE -> ComputeBackend.NPU
ExecutionPolicy.THERMAL_CAPPED -> ComputeBackend.CPU
else -> ComputeBackend.CPU
}
val memBefore = getUsedMemoryMb()
val startTime = System.nanoTime()
var tokenCount = 0
var firstTokenTime = 0L
try {
inferenceEngine(prompt, backend)
.collect { token ->
if (tokenCount == 0) firstTokenTime = System.nanoTime() - startTime
tokenCount++
}
val totalTimeSec = (System.nanoTime() - startTime) / 1_000_000_000.0
val memAfter = getUsedMemoryMb()
Result.success(
InferenceSessionMetrics(
ttftMillis = firstTokenTime / 1_000_000,
avgTps = tokenCount / totalTimeSec,
memoryDeltaMb = memAfter - memBefore,
backendUsed = backend
)
)
} catch (e: Exception) {
Result.failure(e)
}
}
private fun getUsedMemoryMb(): Long {
val rt = Runtime.getRuntime()
return (rt.totalMemory() - rt.freeMemory()) / (1024 * 1024)
}
}
Why This Architecture Works
- Non-blocking thermal listening:
SharedFlow with extraBufferCapacity prevents backpressure from stalling the Android system callback. Thermal events are queued and processed asynchronously.
- Policy-driven execution: The inference engine never queries thermal status directly. It receives a pre-resolved
ExecutionPolicy. This eliminates race conditions where thermal state changes mid-execution.
- Hardware fallback chain: When
THERMAL_CAPPED activates, we route to CPU instead of GPU. CPUs scale down more gracefully under DVFS and generate less concentrated heat than NPUs/GPUs during sustained matrix operations.
- Metric isolation: TTFT, TPS, and RSS are calculated post-collection. This avoids the observer effect where frequent memory sampling degrades inference latency.
Pitfall Guide
1. Cold-Start Benchmarking Fallacy
Explanation: Measuring performance immediately after device boot or app launch captures peak hardware capability before thermal accumulation occurs. Shipping based on these numbers guarantees user-facing degradation after 2β3 minutes of usage.
Fix: Implement sustained load testing. Run inference loops for 10+ minutes while logging TPS decay curves. Optimize for the 5-minute steady-state metric, not the initial spike.
2. Synchronous Memory Sampling
Explanation: Calling Runtime.getRuntime().freeMemory() inside the inference loop blocks the coroutine and introduces micro-stutters. The JVM/ART garbage collector may trigger during sampling, skewing results.
Fix: Sample memory only at session boundaries (start/end). Use Android's Debug.getMemoryInfo() or ActivityManager.getProcessMemoryInfo() for accurate RSS tracking without blocking the inference thread.
3. Ignoring Android's LMKD Thresholds
Explanation: On-device LLMs easily exceed 1.5GB of working memory. If your app's RSS crosses the LMKD threshold for FOREGROUND_APP or VISIBLE_APP, Android will aggressively kill background services, causing music players, navigation, or system UI to restart.
Fix: Implement memory budgeting. Cap context window size dynamically based on available heap. Use Process.getFreeMemory() to trigger context truncation or streaming pauses before hitting LMKD limits.
4. Static Hardware Routing
Explanation: Hardcoding useNPU = true assumes the accelerator is always available and thermally viable. When the NPU throttles, it may actually perform worse than a lightly loaded CPU core due to voltage scaling penalties.
Fix: Implement dynamic backend selection. Monitor ThermalStatus and BatteryManager state. Route to CPU when thermal status exceeds MODERATE, and fallback to CPU if NPU driver initialization fails.
5. Unbounded Token Streaming
Explanation: Collecting tokens into a List<String> or StringBuilder without backpressure causes memory allocation spikes. The UI thread may also block if rendering outpaces generation.
Fix: Use Kotlin Flow with buffer(Channel.UNLIMITED) or conflate() depending on UX requirements. Implement chunked rendering (e.g., update UI every 3β5 tokens) to reduce layout passes and maintain 60fps scrolling.
6. Thermal State Polling
Explanation: Using delay(1000) loops to check PowerManager.currentThermalStatus wastes CPU cycles and introduces latency between thermal events and policy updates.
Fix: Always use event-driven listeners (OnThermalStatusChangedListener). Convert callbacks to SharedFlow as shown in the core solution. This guarantees sub-50ms reaction time to thermal threshold breaches.
7. Over-Quantization Without Accuracy Validation
Explanation: Developers aggressively quantize models (INT8, INT4) to reduce RSS and heat, but skip validation on domain-specific prompts. The model may appear fast and cool but produce hallucinated or nonsensical outputs.
Fix: Maintain a golden dataset of 50β100 representative prompts. Run automated BLEU/ROUGE or LLM-as-a-judge evaluations after quantization. Only deploy quantized weights if accuracy degradation stays within acceptable bounds (<5% for general chat, <2% for code/math).
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Real-time conversational chat | Thermal-adaptive routing + chunked streaming | Maintains >5 TPS readability threshold while preventing UI jank | Low (minor latency trade-off for stability) |
| Batch document summarization | CPU-only fallback + context chunking | NPUs throttle quickly on long prompts; CPU scales linearly with less heat spike | Medium (slower completion, but avoids thermal shutdown) |
| Offline translation assistant | Pre-quantized INT8 model + NPU burst | Translation requires low TTFT; INT8 preserves accuracy while reducing RSS by ~40% | Low (one-time quantization validation effort) |
| Multi-modal AI (text + image) | Dynamic memory budgeting + LMKD awareness | Image embeddings + LLM context easily exceed 3GB; proactive truncation prevents OOM kills | High (requires complex context management logic) |
Configuration Template
// build.gradle.kts (Module: app)
dependencies {
implementation("androidx.ai:ai-core:1.0.0")
implementation("org.jetbrains.kotlinx:kotlinx-coroutines-android:1.8.1")
}
// AiCoreConfig.kt
import androidx.ai.core.AiCore
import androidx.ai.core.ModelManager
import androidx.ai.core.SessionConfig
object AiCoreConfig {
fun createSessionConfig(): SessionConfig {
return SessionConfig.Builder()
.setModelName("gemini-nano")
.setQuantization("int8")
.setMaxContextTokens(2048)
.setHardwareAcceleration(true)
.setThermalAware(true)
.build()
}
suspend fun initializeModel(): Result<Unit> {
return try {
val manager = AiCore.getModelManager()
manager.downloadModel("gemini-nano")
manager.loadModel("gemini-nano", createSessionConfig())
Result.success(Unit)
} catch (e: Exception) {
Result.failure(e)
}
}
}
Quick Start Guide
- Add AICore Dependency: Include
androidx.ai:ai-core in your build.gradle.kts and sync the project.
- Initialize the Controller: Instantiate
OnDeviceInferenceController in your ViewModel or Application class, passing a CoroutineScope tied to the lifecycle.
- Bind Thermal Events: Observe
controller.currentPolicy in your UI layer. Disable the "Generate" button when EMERGENCY_PAUSE is active.
- Execute Inference: Call
controller.runInference(prompt) { p, backend -> yourStreamingEngine(p, backend) }. Handle the Result to update UI with metrics or error states.
- Validate Thermal Behavior: Run the app on a physical device under sustained load. Verify that TPS stabilizes and the policy transitions from
FULL_THROTTLE to THERMAL_CAPPED without crashing or evicting background apps.