ger MCP joint (landmark 9) defines the hand's scale. Dividing all coordinates by the Euclidean distance between these two points ensures that the same gesture produces identical feature vectors regardless of camera distance or hand size.
Dart Implementation (Flutter Inference Layer):
class HandVectorNormalizer {
static const int _landmarksPerHand = 21;
static const int _dimensions = 3;
List<double> normalize(List<List<double>> rawLandmarks) {
if (rawLandmarks.isEmpty) return [];
final List<double> normalized = [];
for (final hand in rawLandmarks) {
final wrist = _extractPoint(hand, 0);
final midMcp = _extractPoint(hand, 9);
final double span = _euclideanDistance(wrist, midMcp);
if (span < 0.001) continue; // Prevent division by zero
for (int i = 0; i < _landmarksPerHand; i++) {
final point = _extractPoint(hand, i);
normalized.add((point.x - wrist.x) / span);
normalized.add((point.y - wrist.y) / span);
normalized.add((point.z - wrist.z) / span);
}
}
return normalized;
}
double _euclideanDistance(Point3D a, Point3D b) {
final double dx = a.x - b.x;
final double dy = a.y - b.y;
final double dz = a.z - b.z;
return math.sqrt(dx * dx + dy * dy + dz * dz);
}
Point3D _extractPoint(List<double> flatLandmarks, int index) {
final int offset = index * _dimensions;
return Point3D(
x: flatLandmarks[offset],
y: flatLandmarks[offset + 1],
z: flatLandmarks[offset + 2],
);
}
}
class Point3D {
final double x, y, z;
const Point3D({required this.x, required this.y, required this.z});
}
Why this structure: The normalizer is isolated as a pure function. It accepts flat coordinate arrays, applies wrist-relative shifting, and divides by the Euclidean span. The explicit sqrt calculation is mandatory; using squared distance breaks scale invariance and causes classifier drift.
Stage 2: Gesture Classification Pipeline
The normalized vector feeds into a lightweight Multi-Layer Perceptron (MLP) trained on collected landmark data. The model outputs a probability distribution over a custom vocabulary.
Python Training Pipeline (Incremental Collection):
import csv
import numpy as np
import mediapipe as mp
import cv2
class LandmarkCollector:
def __init__(self, vocabulary: list, output_path: str = "gesture_dataset.csv"):
self.vocab = vocabulary
self.output_path = output_path
self.mp_hands = mp.solutions.hands.Hands(max_num_hands=2, min_detection_confidence=0.7)
self.frame_buffer = []
def capture_samples(self, sign_index: int, samples_per_sign: int = 10):
cap = cv2.VideoCapture(0)
current_sign = self.vocab[sign_index]
while len(self.frame_buffer) < samples_per_sign:
ret, frame = cap.read()
if not ret: break
results = self.mp_hands.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
if results.multi_hand_landmarks:
flat_coords = self._flatten_landmarks(results.multi_hand_landmarks)
self.frame_buffer.append(flat_coords)
cv2.putText(frame, f"Capturing: {current_sign} ({len(self.frame_buffer)}/{samples_per_sign})",
(10, 30), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 0), 2)
cv2.imshow("Collection", frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
self._save_to_csv(current_sign)
cap.release()
cv2.destroyAllWindows()
def _flatten_landmarks(self, hand_list):
coords = []
for hand in hand_list:
for lm in hand.landmark:
coords.extend([lm.x, lm.y, lm.z])
return coords
def _save_to_csv(self, label):
with open(self.output_path, 'a', newline='') as f:
writer = csv.writer(f)
for frame in self.frame_buffer:
writer.writerow(frame + [label])
self.frame_buffer.clear()
Architecture Rationale: The collector stores normalized coordinates directly to CSV, avoiding raw video storage. Incremental addition is supported by appending rows and preserving existing label indices. Retraining is required only because the MLP output layer dimension changes with vocabulary size, but previously collected data remains valid.
Stage 3: Contextual Translation via AICore
Raw gesture tokens are buffered and passed to Gemini Nano through Android's AICore runtime. The LLM reconstructs ASL syntax into fluent English.
Dart AICore Bridge:
class AICoreTranslator {
static const MethodChannel _channel = MethodChannel('com.app.aicore/translate');
Future<String> interpretSignSequence(List<String> tokens) async {
if (tokens.isEmpty) return "";
final prompt = _buildInterpreterPrompt(tokens);
try {
final result = await _channel.invokeMethod<String>('generate', {'prompt': prompt});
return result?.trim() ?? "";
} catch (e) {
return _fallbackAssemble(tokens);
}
}
String _buildInterpreterPrompt(List<String> tokens) {
final signString = tokens.join(' ');
return """You are a sign language interpreter. Convert ASL sign tokens into natural fluent English sentences.
WATER NEED → I need some water please.
NAME MY NOOR → My name is Noor.
HELP ME PLEASE → Could you please help me?
Input: $signString
Output ONLY the final sentence. No explanation. Under 15 words.""";
}
String _fallbackAssemble(List<String> tokens) {
return tokens.map((t) => t.capitalize()).join(' ') + '.';
}
}
Architecture Rationale: The prompt uses few-shot examples to anchor the LLM's behavior. The fallback mechanism ensures graceful degradation if AICore is unavailable or the model hasn't finished downloading. Token buffering (typically 3–5 signs) provides sufficient context for the LLM to resolve tense and subject-object relationships.
Pitfall Guide
1. Squared Distance Normalization
Explanation: Developers frequently replace sqrt(dx² + dy² + dz²) with dx² + dy² + dz² for performance, assuming it preserves relative scale. This breaks the mathematical parity between training and inference, causing the classifier to misinterpret distance-invariant gestures.
Fix: Always use Euclidean distance. The computational overhead of sqrt is negligible on modern mobile NPUs and is required for feature vector consistency.
2. Ignoring Temporal Smoothing
Explanation: Raw frame-by-frame classification produces jittery token streams. A single held sign may oscillate between HELLO, NONE, and HELLO due to minor hand tremors or lighting changes.
Fix: Implement a sliding window with majority voting or exponential moving average (EMA) over the last 5–7 frames. Only emit a token when confidence exceeds a threshold (e.g., 0.85) for consecutive frames.
3. AICore Model Download Blocking
Explanation: Gemini Nano requires a silent background download (10–15 minutes) on Wi-Fi while charging. Apps that attempt inference immediately after installation will fail silently or crash.
Fix: Implement a readiness check using adb shell cmd aicore status or an AICore API status callback. Display a provisioning UI that guides users to enable Developer Options and wait for model completion.
4. Class Imbalance in Custom Vocabulary
Explanation: Users naturally collect more samples for frequent signs (e.g., YES, NO) than rare ones. The MLP becomes biased toward high-frequency classes, misclassifying rare signs as common ones.
Fix: Enforce equal sample counts per class during collection. Apply class weighting in the training loss function or use oversampling techniques for underrepresented gestures.
5. Token Stream Fragmentation
Explanation: Passing individual tokens to the LLM without context forces it to guess tense and pronouns, resulting in unnatural output.
Fix: Buffer tokens until a pause is detected (e.g., 1.5 seconds of NONE predictions) or a maximum buffer size (5–7 tokens) is reached. Flush the buffer to the LLM, then reset.
6. Hardcoding Hand Laterality
Explanation: MediaPipe returns hands in arbitrary order. Assuming the first hand is always the dominant hand breaks two-handed signs like WELCOME or THANK YOU.
Fix: Extract landmarks from both detected hands, concatenate them into a single 126-value vector, and train the classifier on bilateral data. Do not filter by hand label.
7. Ignoring ASL Grammatical Structure
Explanation: Treating ASL as a direct 1:1 mapping to English words ignores topic-comment structure, facial grammar, and spatial referencing.
Fix: Rely on the LLM's few-shot prompt to handle syntactic transformation. Do not attempt to hardcode ASL-to-English grammar rules; the language model generalizes this mapping more reliably than rule engines.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-traffic public deployment | On-Device LLM (Gemini Nano) | Zero server costs, guaranteed privacy, consistent latency | $0 infra, requires supported hardware |
| Legacy Android devices (< Android 14) | Rule-Based On-Device | AICore unavailable; fallback ensures functionality | $0 infra, lower grammatical quality |
| Multi-language support required | Cloud Translation API | LLMs handle cross-lingual grammar better than on-device models | High egress costs, privacy trade-off |
| Custom vocabulary expansion | Incremental CSV + MLP Retraining | Preserves existing data, only requires new sign collection | Minimal compute cost, fast iteration |
Configuration Template
# aicore_config.yaml
aicore:
enabled: true
model_version: "gemini-nano-v2"
fallback_strategy: "rule_assembler"
prompt_template: |
You are a sign language interpreter. Convert ASL sign tokens into natural fluent English sentences.
WATER NEED → I need some water please.
NAME MY NOOR → My name is Noor.
HELP ME PLEASE → Could you please help me?
Input: {tokens}
Output ONLY the final sentence. No explanation. Under 15 words.
gesture_pipeline:
normalization: euclidean_wrist_relative
max_hands: 2
smoothing_window: 5
confidence_threshold: 0.85
token_buffer_size: 7
pause_threshold_ms: 1500
training:
samples_per_sign: 10
output_format: csv
incremental_mode: true
label_index_preservation: true
Quick Start Guide
- Provision AICore: Navigate to Settings → About Phone → tap Build Number 7 times. Enable Developer Options → toggle "Gemini Nano" and "On-Device Model". Wait for background download (10–15 mins, Wi-Fi + charging required).
- Collect Baseline Data: Run the Python collector script. Position hands in frame, press SPACE for 5-second countdown, capture 10 frames per sign. Append new signs to the end of the vocabulary list to preserve label indices.
- Train & Export: Execute the training script. It reads the CSV, detects class count, trains the MLP, and exports the model file. No manual label mapping required.
- Integrate & Test: Load the model into the Flutter app. Verify normalization parity with Python. Run the AICore status check. Test with bilateral signs and validate token buffering behavior. Deploy to supported hardware.