Vectorizing the Crust: Operationalizing Satellite Embeddings for Regional Mineral Targeting

Current Situation Analysis

Remote sensing has long been the backbone of greenfield mineral exploration. Traditional workflows rely on spectral band math, atmospheric correction, and manual thresholding to isolate alteration halos, structural lineaments, and lithological boundaries. While theoretically sound, this approach has become a bottleneck in modern exploration programs.

The industry pain point is not a lack of data; it is the computational and interpretive overhead required to make that data analysis-ready. Processing raw Sentinel-2 or Landsat scenes demands rigorous atmospheric correction, cloud/shadow masking, topographic normalization, and seasonal compositing. Even after preprocessing, band ratios (e.g., SWIR combinations for clay minerals or iron oxides) are highly sensitive to local illumination, moisture content, and vegetation cover. A ratio calibrated for an arid porphyry system in Chile frequently fails when applied to a semi-arid epithermal district in Nevada, forcing teams to rebuild preprocessing pipelines for every new permit.

This problem is often overlooked because exploration teams treat remote sensing as a cartographic exercise rather than a pattern recognition problem. The focus remains on extracting physically interpretable indices, ignoring that modern self-supervised learning can compress multi-modal observations into stable, transferable representations. The result is a workflow that consumes 60–70% of project time on data wrangling, leaving minimal bandwidth for actual geological interpretation.

Google’s AlphaEarth initiative addresses this by shifting from explicit band math to learned vector representations. Instead of manipulating reflectance values, teams can now query a 64-dimensional embedding space that inherently fuses optical, radar, topographic, and climatic signals. The dataset (GOOGLE/SATELLITE_EMBEDDING/V1_ANNUAL) is pre-processed, cloud-mosaicked, and globally consistent, effectively decoupling data preparation from geological analysis.

WOW Moment: Key Findings

The transition from spectral ratios to learned embeddings fundamentally changes how exploration teams scale their targeting efforts. The following comparison highlights the operational shift:

Approach	Preprocessing Overhead	Cross-Region Transferability	Interpretability	Temporal Granularity	Computational Cost
Traditional Spectral Ratios	High (atmospheric correction, masking, compositing)	Low (calibration drift across climates)	High (physical band relationships)	Multi-temporal (daily/weekly)	Client-side heavy (local processing)
AlphaEarth Embeddings	Near-zero (analysis-ready composites)	High (learned invariance across zones)	Low (statistical compression, no direct physical mapping)	Annual (2017–2025)	Server-side optimized (GEE execution)

This finding matters because it enables similarity-driven exploration at continental scales. Instead of deriving new indices for each target type, teams can define a reference signature from a known deposit or field sample and propagate that vector across millions of hectares. The embedding space normalizes for phenology, illumination, and atmospheric noise, allowing direct comparison between geographically isolated regions. For exploration programs managing multiple greenfield licenses, this reduces the targeting phase from months to days while maintaining consistent analytical baselines.

Core Solution

Implementing satellite embeddings for mineral targeting requires a shift from pixel-wise arithmetic to vector-space operations. The workflow leverages Google Earth Engine’s server-side execution to avoid client-side memory constraints and ensure reproducible results.

Architecture Decisions & Rationale

Server-Side Vector Operations: All similarity calculations must run within GEE. Transferring 64-band rasters to local memory for Python-based distance calculations will trigger memory limits and network timeouts.
Cosine Similarity Over Euclidean Distance: The embedding vectors are normalized to a roughly -5 to +5 range with non-Gaussian distributions. Cosine similarity measures angular alignment, making it robust to magnitude variations caused by local topography or moisture. Euclidean distance would over-penalize scale differences and produce noisy results.
Annual Composites for Stability: Phenological cycles and cloud cover introduce high-frequency noise. Annual mosaics smooth out seasonal vegetation changes and atmospheric artifacts, aligning with the geological timescale of alteration halos.
Threshold Calibration via Known Endmembers: Instead of arbitrary cutoffs, thresholds should be derived from statistical distributions of similarity scores within validated training zones.

Implementation Workflow

The following implementation demonstrates a production-ready pipeline for regional similarity search. It uses a modular structure, server-side execution, and explicit threshold calibration.

import ee
import geemap

# Initialize Earth Engine
ee.Initialize()

class GeoEmbeddingTargeter:
    def __init__(self, collection_id="GOOGLE/SATELLITE_EMBEDDING/V1_ANNUAL"):
        self.collection = ee.ImageCollection(collection_id)
        self.band_prefix = "A"
        self.band_count = 64
        self.band_names = [f"{self.band_prefix}{i:02d}" for i in range(self.band_count)]

    def load_annual_composite(self, year=2023):
        """Extracts the first available annual composite for the specified year."""
        start_date = f"{year}-01-01"
        end_date = f"{year}-12-31"
        return self.collection.filterDate(start_date, end_date).first()

    def extract_reference_signature(self, image, region, reducer=ee.Reducer.mean()):
        """Computes the mean embedding vector from a known target zone."""
        stats = image.reduceRegion(reducer=reducer, geometry=region, scale=10, maxPixels=1e9)
        # Convert dictionary to a single-band image per dimension for vector math
        vector_image = ee.Image.fromPixels(
            ee.Dictionary({b: stats.get(b) for b in self.band_names}),
            self.band_names
        )
        return vector_image

    def compute_cosine_similarity(self, embedding_image, reference_vector):
        """Calculates cosine similarity server-side across the entire region."""
        # Normalize both vectors to unit length
        def normalize(img):
            magnitude = img.pow(2).reduce(ee.Reducer.sum()).sqrt()
            return img.divide(magnitude)

        norm_embedding = normalize(embedding_image)
        norm_reference = normalize(reference_vector)

        # Dot product of normalized vectors equals cosine similarity
        similarity = norm_embedding.multiply(norm_reference).reduce(ee.Reducer.sum())
        return similarity.rename("cosine_sim")

    def generate_favorability_map(self, similarity_image, lower_threshold=0.85, upper_threshold=0.95):
        """Classifies similarity scores into exploration priority zones."""
        high_priority = similarity_image.gte(upper_threshold).selfMask()
        medium_priority = similarity_image.gte(lower_threshold).lt(upper_threshold).selfMask()
        
        return {
            "high": high_priority,
            "medium": medium_priority,
            "raw_similarity": similarity_image
        }

# Usage Example
targeter = GeoEmbeddingTargeter()
annual_img = targeter.load_annual_composite(2023)

# Define known deposit geometry (replace with actual ROI)
known_deposit_roi = ee.Geometry.Point([-70.5, -25.0]).buffer(500)

# Extract reference vector
ref_vector = targeter.extract_reference_signature(annual_img, known_deposit_roi)

# Compute similarity across broader exploration area
similarity_map = targeter.compute_cosine_similarity(annual_img, ref_vector)

# Generate classified favorability zones
favorability = targeter.generate_favorability_map(similarity_map)

# Export or visualize
# geemap.Map().addLayer(favorability['high'], {'palette': ['red']}, 'High Priority')

Why This Structure Works

Modular Class Design: Encapsulates band naming, vector extraction, and similarity logic, making it reusable across different target types (e.g., porphyry Cu vs. orogenic Au).
Server-Side Normalization: The normalize function computes magnitude and division entirely within GEE’s execution graph, preventing client-side data transfer.
Explicit Thresholding: Separates raw similarity scores from operational decision layers, allowing teams to adjust sensitivity without reprocessing the entire dataset.
Scalable Export: The output dictionaries can be directly passed to ee.Image.export or geemap visualization tools without intermediate file generation.

Pitfall Guide

1. Treating Embedding Bands as Physical Features

Explanation: The 64 dimensions (A00–A63) are statistical compressions, not spectral bands. Attempting to interpret A12 as "clay content" or A05 as "iron oxide" will lead to incorrect geological conclusions. Fix: Validate embeddings against ground truth or traditional indices. Use them as similarity anchors, not direct mineralogical proxies.

2. Ignoring Canopy Penetration Limits

Explanation: While Sentinel-1 SAR data is integrated into the embeddings, L-band penetration is limited. Dense tropical forests mask underlying lithology, causing the model to prioritize canopy structure over bedrock signatures. Fix: Restrict embedding-based targeting to arid, semi-arid, or sparsely vegetated regions. In forested zones, combine with airborne geophysics or LiDAR-derived terrain models.

3. Using Euclidean Distance on Normalized Vectors

Explanation: Euclidean distance penalizes magnitude differences. Since embedding values are normalized but non-Gaussian, magnitude variations often reflect local moisture or topographic shading rather than lithological change. Fix: Always use cosine similarity or angular distance. It measures directional alignment in the 64D space, which correlates better with consistent geological signatures.

4. Overlooking GEE Execution Quotas

Explanation: Processing global or continental-scale similarity maps can trigger compute limits or task timeouts. Running unoptimized client-side loops will fail silently or incur unexpected costs. Fix: Use reduceRegion with maxPixels limits, chunk exports into tiles, and monitor the GEE Code Editor task queue. Implement exponential backoff for export retries.

5. Skipping Ground-Truth Calibration

Explanation: Similarity scores are relative. A 0.92 cosine score in one region may represent a different geological context than 0.92 in another due to training bias or local environmental factors. Fix: Always calibrate thresholds using known deposits within the target region. Generate ROC curves comparing similarity scores against validated mineral occurrences to establish local cutoffs.

6. Assuming Temporal Stability for Active Mining

Explanation: The dataset provides annual composites. It is unsuitable for monitoring active pit walls, waste rock dumps, or seasonal hydrological changes. Fix: Reserve embeddings for greenfield targeting and regional screening. Use high-frequency Sentinel-2/Landsat time series for operational mine monitoring and environmental compliance.

7. Cross-Climate Transfer Without Adjustment

Explanation: The model was likely trained with higher representation in arid zones. Applying the same similarity thresholds to humid tropical regions will yield false positives due to vegetation moisture masking bedrock signals. Fix: Implement region-specific threshold calibration. Use climate stratification (e.g., Köppen-Geiger zones) to adjust similarity cutoffs before field deployment.

Production Bundle

Action Checklist

Initialize GEE environment and verify project quotas before scaling to continental extents
Extract reference vectors from validated deposits or field-confirmed outcrops, not arbitrary pixels
Implement server-side cosine similarity to avoid client-side memory bottlenecks
Calibrate similarity thresholds using local ground truth; do not apply global cutoffs blindly
Mask dense forest and permanent snow zones prior to similarity calculation
Export results as tiled GeoTIFFs to comply with GEE export limits and ensure QGIS compatibility
Cross-validate high-similarity zones with traditional spectral indices (e.g., ASTER 6/8, S2 AIT)
Document threshold decisions and training data provenance for audit and reproducibility

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Greenfield targeting across multiple permits	AlphaEarth Embeddings + Cosine Similarity	Rapid screening, cross-region consistency, minimal preprocessing	Low (GEE free tier covers research/NGO; compute costs scale linearly)
Precise clay/iron oxide mapping for metallurgy	Traditional Spectral Ratios (ASTER/S2)	Physically interpretable, validated for mineral quantification	Medium (requires atmospheric correction, local calibration)
Active mine monitoring & waste rock tracking	High-Frequency Time Series (Sentinel-2/Landsat)	Daily/weekly resolution captures operational changes	High (requires cloud masking, frequent processing, storage)
Forested tropical exploration	Airborne Geophysics + LiDAR	Embeddings cannot penetrate dense canopy; SAR limited to surface structure	High (acquisition costs, but necessary for reliable targeting)
Regulatory reporting & publishable methods	Explicit Band Math + Statistical Validation	Transparent, reproducible, meets compliance standards	Low-Medium (higher analyst time, lower computational overhead)

Configuration Template

# production_config.py
import ee

# Earth Engine Configuration
PROJECT_ID = "your-gcp-project-id"
GEE_SERVICE_ACCOUNT = "your-service@project.iam.gserviceaccount.com"
KEY_FILE_PATH = "/path/to/service-account-key.json"

# Embedding Pipeline Parameters
EMBEDDING_COLLECTION = "GOOGLE/SATELLITE_EMBEDDING/V1_ANNUAL"
TARGET_YEAR = 2023
SIMILARITY_LOWER_THRESHOLD = 0.85
SIMILARITY_UPPER_THRESHOLD = 0.95
EXPORT_SCALE = 10
MAX_PIXELS = 1e9
REGION_MASK_TYPES = ["dense_forest", "permanent_snow", "ocean"]

# Validation & Calibration
KNOWN_DEPOSIT_GEOMETRY = ee.Geometry.Polygon([
    [-70.6, -25.1], [-70.4, -25.1], [-70.4, -24.9], [-70.6, -24.9]
])
CALIBRATION_METRIC = "cosine_similarity"
THRESHOLD_TUNING_METHOD = "local_roc_optimization"

def initialize_ee():
    """Secure GEE initialization with service account credentials."""
    credentials = ee.ServiceAccountCredentials(GEE_SERVICE_ACCOUNT, KEY_FILE_PATH)
    ee.Initialize(credentials, project=PROJECT_ID)
    print("GEE initialized successfully.")

# Run initialization
initialize_ee()

Quick Start Guide

Authenticate & Initialize: Set up a GCP service account with Earth Engine access. Run the initialization block to authenticate and verify quota limits.
Load Annual Composite: Query GOOGLE/SATELLITE_EMBEDDING/V1_ANNUAL for your target year. Extract the first available image to ensure cloud-mosaicked consistency.
Define Reference Zone: Digitize a polygon around a known deposit or field-validated outcrop. Extract the mean 64D vector using reduceRegion.
Compute Similarity: Apply server-side cosine similarity across your exploration ROI. Classify results using locally calibrated thresholds (0.85–0.95 range).
Export & Validate: Export high-priority zones as tiled GeoTIFFs. Overlay with regional geology maps and traditional spectral indices before committing to field campaigns.

Satellite embeddings do not replace geological expertise; they compress data preparation into a single query. When integrated into a hybrid workflow—screening with vectors, validating with indices, confirming with field work—they transform regional targeting from a months-long bottleneck into a repeatable, scalable operation.

AlphaEarth Satellite Embeddings : révolution ou gadget pour l’exploration minière ?