Edge AIApril 22, 20267 min readOmniE2E Engineering

Deploying Vision Models at the Edge: A Deep Dive into Quantization and TensorRT Optimization

Practical strategies for deploying complex vision models on edge devices while maintaining accuracy. Covers INT8 quantization, TensorRT optimization, and real-world benchmarks on Jetson and Hailo platforms.


Deploying Vision Models at the Edge: A Deep Dive into Quantization and TensorRT Optimization

Running complex vision models on edge devices presents a fundamental engineering challenge: how do you maintain the accuracy of a 200MB floating-point model while running it at 30 FPS on a device with limited compute and memory? This post documents our journey deploying multi-person pose estimation and tracking models on NVIDIA Jetson and Hailo-8 platforms.

The Edge Deployment Challenge

Our production pipeline consists of three major components:

  1. Person Detection: YOLOv8-based detector fine-tuned for fisheye distortion
  2. Pose Estimation: HRNet-W32 for 17-keypoint skeleton estimation
  3. Multi-Object Tracking: ByteTrack with appearance features

Running these sequentially on a Jetson Orin Nano (40 TOPS INT8) with FP32 models yields approximately 3 FPS—unacceptable for real-time applications. Our target: 25+ FPS with minimal accuracy degradation.

Understanding Quantization Fundamentals

The Mathematics of INT8 Quantization

Quantization maps floating-point weights and activations to lower-precision integers:

Q(x)=round(xs)+zQ(x) = \text{round}\left(\frac{x}{s}\right) + z

Where:

  • ss is the scale factor
  • zz is the zero-point offset
  • xx is the original FP32 value

The inverse operation recovers an approximation:

x^=s(Q(x)z)\hat{x} = s \cdot (Q(x) - z)

The key challenge is determining optimal scale factors that minimize quantization error while preserving model accuracy.

Calibration Strategies

We evaluated three calibration approaches for determining scale factors:

1. MinMax Calibration

scale = (max_val - min_val) / (qmax - qmin)
zero_point = qmin - round(min_val / scale)

Simple but sensitive to outliers. A single activation spike can dramatically reduce effective precision for the majority of values.

2. Entropy Calibration (KL Divergence)

Minimizes the information loss between the original FP32 distribution and quantized distribution:

DKL(PQ)=iP(i)logP(i)Q(i)D_{KL}(P || Q) = \sum_{i} P(i) \log\frac{P(i)}{Q(i)}

TensorRT's default calibrator uses this approach with 128 histogram bins.

3. Percentile Calibration

Clips outliers by using the 99.99th percentile instead of true min/max:

def percentile_calibration(tensor, percentile=99.99):
    lower = np.percentile(tensor, 100 - percentile)
    upper = np.percentile(tensor, percentile)
    scale = (upper - lower) / 255
    return scale, -lower / scale

This proved most effective for our pose estimation models, which exhibit long-tailed activation distributions.

TensorRT Optimization Pipeline

Step 1: ONNX Export with Dynamic Axes

Export PyTorch models with explicit dynamic dimensions:

import torch
import torch.onnx

def export_pose_model(model, output_path):
    model.eval()
    dummy_input = torch.randn(1, 3, 384, 288).cuda()
    
    torch.onnx.export(
        model,
        dummy_input,
        output_path,
        input_names=['input'],
        output_names=['heatmaps', 'offsets'],
        dynamic_axes={
            'input': {0: 'batch_size'},
            'heatmaps': {0: 'batch_size'},
            'offsets': {0: 'batch_size'}
        },
        opset_version=17,
        do_constant_folding=True
    )

Step 2: TensorRT Engine Building

Build optimized engines with INT8 precision:

import tensorrt as trt

def build_engine(onnx_path, engine_path, calibrator):
    logger = trt.Logger(trt.Logger.WARNING)
    builder = trt.Builder(logger)
    network = builder.create_network(
        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
    )
    parser = trt.OnnxParser(network, logger)
    
    with open(onnx_path, 'rb') as f:
        parser.parse(f.read())
    
    config = builder.create_builder_config()
    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)
    
    # Enable INT8 with calibration
    config.set_flag(trt.BuilderFlag.INT8)
    config.int8_calibrator = calibrator
    
    # Enable FP16 fallback for sensitive layers
    config.set_flag(trt.BuilderFlag.FP16)
    
    # Build and serialize
    serialized_engine = builder.build_serialized_network(network, config)
    with open(engine_path, 'wb') as f:
        f.write(serialized_engine)

Step 3: Custom Calibrator Implementation

class PoseCalibrator(trt.IInt8EntropyCalibrator2):
    def __init__(self, data_loader, cache_file):
        super().__init__()
        self.data_loader = iter(data_loader)
        self.cache_file = cache_file
        self.batch_size = 8
        self.current_index = 0
        
        # Allocate device memory
        self.device_input = cuda.mem_alloc(
            self.batch_size * 3 * 384 * 288 * 4
        )
    
    def get_batch(self, names):
        try:
            batch = next(self.data_loader)
            cuda.memcpy_htod(self.device_input, batch.numpy())
            return [int(self.device_input)]
        except StopIteration:
            return None
    
    def read_calibration_cache(self):
        if os.path.exists(self.cache_file):
            with open(self.cache_file, 'rb') as f:
                return f.read()
        return None
    
    def write_calibration_cache(self, cache):
        with open(self.cache_file, 'wb') as f:
            f.write(cache)

Layer-wise Precision Analysis

Not all layers quantize equally well. We developed a systematic approach to identify problematic layers:

Sensitivity Analysis Protocol

def analyze_layer_sensitivity(model, calibration_data, metric_fn):
    """
    Quantize one layer at a time, measure accuracy impact.
    """
    baseline = metric_fn(model, precision='fp32')
    sensitivities = {}
    
    for layer_name in model.get_quantizable_layers():
        # Quantize only this layer
        model.set_layer_precision(layer_name, 'int8')
        score = metric_fn(model, precision='mixed')
        sensitivities[layer_name] = baseline - score
        model.set_layer_precision(layer_name, 'fp32')
    
    return sorted(sensitivities.items(), key=lambda x: x[1], reverse=True)

Results: Sensitive Layers in HRNet

LayerSensitivity ScoreAction
stage4.fuse_layers.3.30.082Keep FP16
final_layer.conv0.071Keep FP16
stage3.fuse_layers.2.20.043Keep FP16
stage2.branches.1.0.conv10.008Quantize INT8
.........

By keeping only 3 layers in FP16 (2% of total layers), we preserved 99.1% of FP32 accuracy while gaining most of the INT8 speedup.

Memory Optimization Techniques

1. Activation Checkpointing

For multi-stage networks, recompute intermediate activations instead of storing them:

class CheckpointedHRNet(nn.Module):
    def forward(self, x):
        # Stage 1-2: Normal forward
        x = self.stage1(x)
        x = self.stage2(x)
        
        # Stage 3-4: Checkpointed
        x = torch.utils.checkpoint.checkpoint(
            self.stage3, x, use_reentrant=False
        )
        x = torch.utils.checkpoint.checkpoint(
            self.stage4, x, use_reentrant=False
        )
        return x

Memory reduction: 40% with 15% compute overhead.

2. Multi-stream Inference

Overlap data transfer and computation using CUDA streams:

class PipelinedInference:
    def __init__(self, engine, num_streams=2):
        self.streams = [cuda.Stream() for _ in range(num_streams)]
        self.contexts = [engine.create_execution_context() 
                        for _ in range(num_streams)]
        self.buffers = [self._allocate_buffers() 
                       for _ in range(num_streams)]
    
    def infer_async(self, inputs):
        results = []
        for i, inp in enumerate(inputs):
            stream_idx = i % len(self.streams)
            stream = self.streams[stream_idx]
            ctx = self.contexts[stream_idx]
            bufs = self.buffers[stream_idx]
            
            # Async copy input
            cuda.memcpy_htod_async(bufs['input'], inp, stream)
            
            # Execute
            ctx.execute_async_v2(
                bindings=bufs['bindings'],
                stream_handle=stream.handle
            )
            
            # Async copy output
            cuda.memcpy_dtoh_async(bufs['output'], bufs['output_d'], stream)
            results.append((stream, bufs['output']))
        
        return results

3. Unified Memory for Large Batches

For batch sizes exceeding GPU memory:

# Enable unified memory
cuda.mem_alloc_managed(size, cuda.mem_attach_flags.GLOBAL)

Allows automatic page migration between CPU and GPU, enabling larger batch processing at the cost of some latency.

Benchmark Results

Jetson Orin Nano (40 TOPS)

ModelFP32FP16INT8INT8 + Optimizations
YOLOv8s (640x640)8.2 FPS22.1 FPS35.4 FPS41.2 FPS
HRNet-W32 (384x288)4.1 FPS11.3 FPS24.7 FPS28.9 FPS
ByteTrack89.2 FPS91.1 FPS92.3 FPS94.1 FPS
Full Pipeline2.9 FPS7.8 FPS18.2 FPS25.7 FPS

Accuracy Comparison (COCO val2017)

ModelFP32 APINT8 APDegradation
YOLOv8s Detection44.944.2-0.7
HRNet-W32 Pose74.473.8-0.6
Combined mAP67.266.4-0.8

Hailo-8 Deployment Notes

The Hailo-8 accelerator (26 TOPS) uses a different compilation flow:

# Compile ONNX to Hailo Executable Format (HEF)
hailo compiler pose_model.onnx \
    --hw-arch hailo8 \
    --calib-set calibration_data.npy \
    --output pose_model.hef

Key differences from TensorRT:

  • Uses proprietary quantization, less control over per-layer precision
  • Requires Hailo Dataflow Compiler for optimization
  • Better power efficiency (2.5W vs Jetson's 7-15W)

Benchmark: 31 FPS for the full pipeline at 2.5W power consumption.

Production Deployment Checklist

  1. Calibration Data Quality: Use 500-1000 representative images, covering edge cases
  2. Thermal Management: INT8 runs hotter; ensure adequate cooling
  3. Precision Fallback: Keep a FP16 engine for debugging discrepancies
  4. Version Pinning: Lock TensorRT, CUDA, and cuDNN versions
  5. Monitoring: Log inference times and detect thermal throttling

Conclusion

Edge deployment of complex vision models is achievable with systematic optimization. The key insights:

  • Percentile calibration outperforms entropy for long-tailed distributions
  • Selective FP16 layers (< 5%) preserve accuracy with minimal speed impact
  • Multi-stream inference provides 15-20% throughput improvement
  • Combined optimizations achieved 8.8x speedup over FP32 baseline

Our production systems now run reliably at 25+ FPS on Jetson Orin Nano, enabling real-time spatial intelligence in resource-constrained environments.