Code Example: GGUF Conversion Script*
import subprocess
import os
def export_to_gguf(
hf_model_path: str,
output_gguf: str,
quant_type: str = "Q4_K_M",
vocab_only: bool = False
):
"""
Exports a Hugging Face model to GGUF format and quantizes it.
Args:
hf_model_path: Path to HF model directory.
output_gguf: Output GGUF file path.
quant_type: Quantization type (e.g., Q4_K_M, Q8_0, F16).
vocab_only: If True, only exports tokenizer/vocab.
"""
llama_cpp_dir = "/path/to/llama.cpp"
convert_script = os.path.join(llama_cpp_dir, "convert-hf-to-gguf.py")
quantize_bin = os.path.join(llama_cpp_dir, "build/bin/quantize")
# Step 1: Convert to FP16 GGUF
print(f"Converting {hf_model_path} to FP16 GGUF...")
subprocess.run([
"python", convert_script,
hf_model_path,
"--outfile", output_gguf,
"--outtype", "f16"
], check=True)
if vocab_only:
return
# Step 2: Quantize
if quant_type != "F16":
print(f"Quantizing to {quant_type}...")
quantized_output = output_gguf.replace(".gguf", f"-{quant_type}.gguf")
subprocess.run([
quantize_bin,
output_gguf,
quantized_output,
quant_type
], check=True)
print(f"Quantized model saved to {quantized_output}")
else:
print("FP16 export complete.")
# Usage
export_to_gguf(
hf_model_path="./models/llama-3-8b-instruct",
output_gguf="./exports/llama-3-8b-f16.gguf",
quant_type="Q4_K_M"
)
2. Export to ONNX for Cross-Platform Inference
ONNX (Open Neural Network Exchange) is ideal for environments requiring hardware abstraction, such as web-based inference via onnxruntime-web or enterprise deployments on onnxruntime with diverse providers (CPU, CUDA, DirectML).
Implementation Steps:
- Use
optimum: Hugging Face's optimum library provides CLI and API tools for ONNX export.
- Select Opset: Choose an ONNX opset version compatible with your runtime.
- Quantize: Apply dynamic or static quantization using
optimum.onnxruntime.
Code Example: ONNX Export with Static Quantization
from optimum.onnxruntime import ORTModelForCausalLM
from optimum.onnxruntime.configuration import QuantizationConfig
from transformers import AutoTokenizer
import numpy as np
def export_onnx_static_quant(
model_name: str,
output_dir: str,
calibration_data: list[str]
):
"""
Exports model to ONNX with static INT8 quantization.
Requires calibration data for accuracy preservation.
"""
model = ORTModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Prepare calibration dataset
def preprocess_fn(examples):
return tokenizer(examples["text"], padding="max_length", max_length=512)
# Mock calibration data structure for demonstration
# In production, use a representative dataset subset
from datasets import Dataset
cal_dataset = Dataset.from_dict({"text": calibration_data})
cal_dataset = cal_dataset.map(preprocess_fn, batched=True)
# Quantization configuration
qconfig = QuantizationConfig(
is_static=True,
format=QuantFormat.QDQ,
# Activate op types that benefit from quantization
activations_dtype="qint8",
weights_dtype="qint8"
)
print("Exporting and quantizing to ONNX...")
model.save_pretrained(
output_dir,
quantization_config=qconfig,
dataset=cal_dataset
)
tokenizer.save_pretrained(output_dir)
print(f"ONNX model saved to {output_dir}")
# Usage
calibration_prompts = [
"Explain the concept of quantum entanglement.",
"Write a Python function to sort a list.",
# ... Add diverse prompts covering model distribution
]
export_onnx_static_quant(
model_name="meta-llama/Llama-2-7b-chat-hf",
output_dir="./exports/llama2-7b-onnx-int8",
calibration_data=calibration_prompts
)
3. Architecture Decisions
- Quantization Granularity: Choose block-wise quantization (GGUF) over per-tensor (older ONNX methods) for better accuracy retention in large language models. Block-wise allows mixing precisions (e.g., Q4_K_M uses Q4 and Q6 blocks), preserving outlier weights critical for attention layers.
- Metadata Embedding: Ensure the export process embeds tokenizer files and model metadata (context length, BOS/EOS tokens) within the model file. External tokenizer dependencies increase deployment complexity and failure modes.
- Hardware Kernels: Validate that the target inference engine supports hardware-specific kernels for the exported format. For example, GGUF files leverage Metal Performance Shaders (MPS) on Apple Silicon and cuBLAS on NVIDIA, whereas generic ONNX models may fall back to slower CPU implementations if providers are not explicitly configured.
Pitfall Guide
-
Tokenizer Mismatch During Export
- Issue: Exporting weights without updating the tokenizer vocabulary or using a different tokenizer file version results in token ID mismatches. The model generates coherent logits but maps them to wrong tokens.
- Mitigation: Always bundle the tokenizer configuration (
tokenizer.json, tokenizer_config.json) with the exported model. Verify token IDs for special tokens (BOS, EOS, PAD) match the training configuration.
-
Ignoring Quantization Calibration Data Distribution
- Issue: Static quantization relies on calibration data to determine activation ranges. If the calibration data is not representative of the inference domain, quantization error spikes, causing severe quality degradation.
- Mitigation: Use a calibration dataset that covers the expected inference distribution. For general-purpose models, use a mix of code, general text, and instruction-following samples. Validate perplexity on a hold-out set post-quantization.
-
Context Window Truncation in Metadata
- Issue: Export tools may default to the model's original context length, ignoring extended context patches (e.g., YaRN, NTK scaling) applied during fine-tuning. This causes the inference engine to truncate inputs or fail on long contexts.
- Mitigation: Explicitly set
max_position_embeddings and rope scaling parameters during export. In GGUF, verify general.context_length metadata. In ONNX, ensure dynamic axes are configured for sequence length.
-
Hardware Kernel Incompatibility
- Issue: Exporting to a format that the inference engine cannot accelerate. For example, exporting to ONNX INT8 but running on a GPU that lacks INT8 tensor core support results in slower performance than FP16 due to dequantization overhead.
- Mitigation: Profile the target hardware. On NVIDIA GPUs, prefer FP8 or FP16 with TensorRT. On older GPUs or CPUs, INT8/INT4 quantization provides benefits. Use
onnxruntime provider API to verify active execution providers.
-
LoRA Merging Omission
- Issue: Deploying a base model and loading LoRA adapters at runtime adds overhead and complexity. Some inference engines do not support dynamic adapter loading efficiently.
- Mitigation: Merge LoRA weights into the base model before export using
peft library utilities. This creates a single monolithic model file optimized for inference speed, though it reduces flexibility for switching adapters.
-
Version Drift and Breaking Changes
- Issue: Export tools like
llama.cpp frequently update the GGUF specification. Models exported with older versions may fail to load in newer inference binaries.
- Mitigation: Pin export tool versions in CI/CD pipelines. Implement automated regression tests that load exported models with the target inference binary version. Maintain a build matrix for critical deployments.
-
Silent Failures in Attention Mechanisms
- Issue: Certain quantization schemes may not support grouped query attention (GQA) or multi-query attention (MQA) correctly, leading to shape mismatches or silent accuracy drops.
- Mitigation: Verify that the export tool explicitly supports the model's attention mechanism. Check export logs for warnings regarding unsupported operations. Run benchmark generations to detect attention-related degradation.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Consumer Edge (M1/M2/M3 Mac) | GGUF Q4_K_M | Native Metal support, excellent k-quantization, low memory footprint. | Low (Hardware reuse) |
| NVIDIA GPU Cluster | TensorRT FP8 / ONNX FP16 | Maximizes throughput via tensor cores, optimized kernels. | Medium (GPU compute cost) |
| Web/In-Browser Inference | ONNX FP16 / WASM-optimized GGUF | onnxruntime-web support, efficient memory usage in browser sandbox. | Low (Client-side compute) |
| High-Fidelity Research | FP16 GGUF / PyTorch | Zero quantization loss, preserves full model capabilities. | High (VRAM requirements) |
| IoT/Low-Resource Device | GGUF Q2_K / ONNX INT8 | Minimal memory usage, runs on CPU with low latency. | Very Low (Hardware cost) |
Configuration Template
GGUF Export Configuration (export_config.yaml)
# Configuration for GGUF Export Pipeline
model:
source: "meta-llama/Meta-Llama-3-8B-Instruct"
revision: "main"
export:
format: "gguf"
outtype: "f16"
quantization:
types: ["Q4_K_M", "Q8_0"]
keep_f16: true
metadata:
context_length: 8192
rope_scaling:
type: "none"
factor: 1.0
validation:
perplexity_dataset: "wikitext-2-raw-v1"
max_perplexity_drop: 0.05
hardware_targets:
- "apple_m3"
- "nvidia_rtx_4090"
ONNX Runtime Configuration (ort_config.json)
{
"model": {
"name": "microsoft/Phi-3-mini-4k-instruct",
"task": "text-generation"
},
"export": {
"opset": 17,
"optimize": true,
"disable_embed_layer_norm": false
},
"quantization": {
"mode": "static",
"format": "qdq",
"calibration": {
"method": "entropy",
"num_samples": 300
}
},
"inference": {
"providers": ["CUDAExecutionProvider", "CPUExecutionProvider"],
"io_binding": true
}
}
Quick Start Guide
- Install Tools:
pip install llama-cpp-python optimum[onnxruntime] transformers
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make
- Download Model:
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --local-dir ./models/llama3-8b
- Export GGUF:
python llama.cpp/convert-hf-to-gguf.py ./models/llama3-8b --outfile ./exports/llama3-8b-f16.gguf
./llama.cpp/build/bin/quantize ./exports/llama3-8b-f16.gguf ./exports/llama3-8b-Q4_K_M.gguf Q4_K_M
- Run Inference:
./llama.cpp/build/bin/main -m ./exports/llama3-8b-Q4_K_M.gguf -p "Explain model export." -n 50
- Verify Output: Check token generation speed and memory usage. Adjust quantization type if quality or performance is suboptimal.