Computer VisionApril 15, 20269 min readOmniE2E Engineering

Real-time Multi-Object Tracking: From SORT to ByteTrack and Beyond

A comprehensive analysis of modern multi-object tracking algorithms, covering association strategies, ReID integration, and handling occlusions in crowded indoor environments.


Real-time Multi-Object Tracking: From SORT to ByteTrack and Beyond

Multi-object tracking (MOT) in indoor environments presents unique challenges: frequent occlusions, similar appearances, unpredictable movements, and the need for consistent identity preservation across extended time periods. This post examines the evolution of tracking algorithms and our production-tested modifications for ceiling fisheye camera deployments.

The Tracking Problem Formulation

Given a sequence of detections across frames, MOT aims to assign consistent identity labels. Formally:

Let Dt={d1t,d2t,...,dnt}\mathcal{D}_t = \{d_1^t, d_2^t, ..., d_n^t\} be detections at frame tt, and Tt1={T1,T2,...,Tm}\mathcal{T}_{t-1} = \{T_1, T_2, ..., T_m\} be existing tracks. The association problem is finding the optimal assignment matrix AA where:

A=arg minAi,jCijAijA^* = \argmin_{A} \sum_{i,j} C_{ij} \cdot A_{ij}

Subject to constraints that each detection maps to at most one track and vice versa.

SORT: The Foundation

Simple Online and Realtime Tracking (SORT) established the tracking-by-detection paradigm:

Kalman Filter State Model

State vector: x=[u,v,s,r,u˙,v˙,s˙]T\mathbf{x} = [u, v, s, r, \dot{u}, \dot{v}, \dot{s}]^T

Where:

  • (u,v)(u, v): bounding box center
  • ss: scale (area)
  • rr: aspect ratio (constant)
  • (u˙,v˙,s˙)(\dot{u}, \dot{v}, \dot{s}): respective velocities

State transition:

xt=Fxt1+w\mathbf{x}_t = F \mathbf{x}_{t-1} + w

Where FF is the constant velocity motion model:

1 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{bmatrix}$$ ### Hungarian Algorithm for Assignment SORT uses IoU (Intersection over Union) as the cost metric: ```python def iou_cost_matrix(tracks, detections): cost = np.zeros((len(tracks), len(detections))) for i, track in enumerate(tracks): pred_bbox = track.predict() for j, det in enumerate(detections): cost[i, j] = 1 - iou(pred_bbox, det.bbox) return cost # Hungarian algorithm (Kuhn-Munkres) row_indices, col_indices = linear_sum_assignment(cost_matrix) ``` ### SORT Limitations 1. **Pure motion reliance**: No appearance features means ID switches during occlusions 2. **Low-confidence rejection**: Discards potentially valid detections below threshold 3. **Single association stage**: No recovery mechanism for missed associations ## DeepSORT: Adding Appearance DeepSORT addresses ID switches by integrating a ReID (re-identification) network: ### Combined Cost Function $$C = \lambda \cdot C_{appearance} + (1 - \lambda) \cdot C_{motion}$$ Appearance cost using cosine distance: $$C_{appearance}(i, j) = 1 - \frac{\mathbf{r}_i^T \mathbf{r}_j}{||\mathbf{r}_i|| \cdot ||\mathbf{r}_j||}$$ Where $\mathbf{r}$ is the ReID embedding vector. ### Cascade Matching DeepSORT's cascade prioritizes recently seen tracks: ```python def cascade_matching(tracks, detections, max_age=30): unmatched_detections = list(range(len(detections))) matches = [] # Match by increasing age for age in range(max_age): if not unmatched_detections: break tracks_at_age = [t for t in tracks if t.time_since_update == age] if tracks_at_age: matched, unmatched_trks, unmatched_dets = \ min_cost_matching(tracks_at_age, [detections[i] for i in unmatched_detections]) matches.extend(matched) unmatched_detections = unmatched_dets return matches, unmatched_detections ``` ## ByteTrack: Every Detection Matters ByteTrack's key insight: low-confidence detections often correspond to occluded objects and shouldn't be discarded. ### Two-Stage Association ```python class ByteTrack: def update(self, detections, confidence_threshold=0.5): # Split by confidence high_conf = [d for d in detections if d.score >= confidence_threshold] low_conf = [d for d in detections if d.score < confidence_threshold] # First association: high confidence detections matched, unmatched_tracks, unmatched_high = \ self.associate(self.tracks, high_conf) # Second association: remaining tracks with low confidence remaining_tracks = [self.tracks[i] for i in unmatched_tracks] matched_low, still_unmatched, _ = \ self.associate(remaining_tracks, low_conf) # Update matched tracks for track_idx, det_idx in matched + matched_low: self.tracks[track_idx].update(detections[det_idx]) # Handle unmatched self._handle_unmatched(still_unmatched, unmatched_high) ``` ### IoU-based Association Only ByteTrack deliberately avoids appearance features in the association step: ```python def associate(self, tracks, detections, iou_threshold=0.3): if not tracks or not detections: return [], list(range(len(tracks))), list(range(len(detections))) # Predict track positions predicted = [track.predict() for track in tracks] # Compute IoU matrix iou_matrix = np.zeros((len(tracks), len(detections))) for i, pred in enumerate(predicted): for j, det in enumerate(detections): iou_matrix[i, j] = iou(pred, det.bbox) # Apply threshold mask cost = 1 - iou_matrix cost[iou_matrix < iou_threshold] = 1e5 # Hungarian assignment row_ind, col_ind = linear_sum_assignment(cost) matches = [(r, c) for r, c in zip(row_ind, col_ind) if cost[r, c] < 1e5] return matches, unmatched_tracks, unmatched_detections ``` ## Our Modifications for Fisheye Cameras Standard algorithms assume perspective cameras. Fisheye cameras introduce: 1. **Severe radial distortion**: Objects near edges appear stretched 2. **Non-uniform scale**: Same person appears different sizes at different positions 3. **Top-down viewpoint**: Standard ReID models fail ### Distortion-Aware IoU We compute IoU in normalized coordinates after undistortion: ```python class FisheyeIoU: def __init__(self, camera_matrix, dist_coeffs): self.K = camera_matrix self.D = dist_coeffs def undistort_bbox(self, bbox): """Convert fisheye bbox to normalized coordinates.""" x1, y1, x2, y2 = bbox corners = np.array([[x1, y1], [x2, y1], [x2, y2], [x1, y2]], dtype=np.float32) # Undistort corner points undistorted = cv2.fisheye.undistortPoints( corners.reshape(-1, 1, 2), self.K, self.D ).reshape(-1, 2) # Return axis-aligned bbox in normalized space return [ undistorted[:, 0].min(), undistorted[:, 1].min(), undistorted[:, 0].max(), undistorted[:, 1].max() ] def compute(self, bbox1, bbox2): norm1 = self.undistort_bbox(bbox1) norm2 = self.undistort_bbox(bbox2) return standard_iou(norm1, norm2) ``` ### Position-Aware Kalman Filter We adapt process noise based on position in fisheye image: ```python class FisheyeKalmanFilter: def __init__(self, camera_model): self.camera = camera_model self.base_process_noise = 0.01 def get_process_noise(self, position): """Higher noise near image edges due to distortion.""" u, v = position # Distance from center (normalized) center_u, center_v = self.camera.cx, self.camera.cy dist = np.sqrt((u - center_u)**2 + (v - center_v)**2) max_dist = np.sqrt(center_u**2 + center_v**2) # Quadratic scaling of noise with distance scale = 1 + 2 * (dist / max_dist) ** 2 return self.base_process_noise * scale ``` ### Top-Down ReID Features We trained a custom ReID network on top-down person crops: ```python class TopDownReID(nn.Module): def __init__(self, backbone='resnet50'): super().__init__() self.backbone = timm.create_model(backbone, pretrained=True) self.backbone.fc = nn.Identity() # Additional head for top-down viewpoint self.head = nn.Sequential( nn.Linear(2048, 1024), nn.BatchNorm1d(1024), nn.ReLU(), nn.Linear(1024, 256), nn.BatchNorm1d(256) ) def forward(self, x): features = self.backbone(x) embeddings = self.head(features) return F.normalize(embeddings, p=2, dim=1) ``` Training data: 50,000 top-down person crops from ceiling cameras, with identity labels across multiple camera views. ## Handling Long-Term Occlusions Indoor environments feature prolonged occlusions (furniture, pillars). We implemented a track recovery system: ### Track State Machine ```python class TrackState(Enum): TENTATIVE = 1 # New track, needs confirmation CONFIRMED = 2 # Actively tracked OCCLUDED = 3 # Temporarily lost DELETED = 4 # Permanently removed class EnhancedTrack: def __init__(self, detection, track_id): self.id = track_id self.state = TrackState.TENTATIVE self.hits = 1 self.misses = 0 self.max_occlusion_time = 90 # 3 seconds at 30fps # Store appearance history for recovery self.appearance_gallery = deque(maxlen=30) self.last_position = detection.bbox def update_state(self, matched): if matched: self.hits += 1 self.misses = 0 if self.state == TrackState.TENTATIVE and self.hits >= 3: self.state = TrackState.CONFIRMED elif self.state == TrackState.OCCLUDED: self.state = TrackState.CONFIRMED else: self.misses += 1 if self.state == TrackState.CONFIRMED and self.misses >= 3: self.state = TrackState.OCCLUDED elif self.state == TrackState.OCCLUDED and \ self.misses >= self.max_occlusion_time: self.state = TrackState.DELETED ``` ### Appearance-Based Recovery When an occluded track's predicted position doesn't match any detection, we search using appearance: ```python def recover_occluded_tracks(occluded_tracks, new_detections, reid_model): """Attempt to recover occluded tracks using appearance matching.""" recoveries = [] for track in occluded_tracks: if not track.appearance_gallery: continue # Compute gallery embedding (average of last N) gallery_emb = torch.stack(list(track.appearance_gallery)).mean(dim=0) best_match = None best_score = 0.7 # Minimum similarity threshold for det in new_detections: det_emb = reid_model(det.crop) similarity = F.cosine_similarity(gallery_emb, det_emb, dim=0) if similarity > best_score: # Additional spatial plausibility check if is_spatially_plausible(track.last_position, det.bbox, track.misses): best_score = similarity best_match = det if best_match: recoveries.append((track, best_match)) return recoveries ``` ## Benchmark Results ### MOT17 Challenge (Standard Perspective Cameras) | Method | MOTA | IDF1 | HOTA | FPS | |--------|------|------|------|-----| | SORT | 43.1 | 39.8 | 34.0 | 142 | | DeepSORT | 61.4 | 62.2 | 45.6 | 17 | | ByteTrack | 80.3 | 77.3 | 63.1 | 29 | | **Ours (ByteTrack+)** | 78.9 | 79.1 | 64.2 | 27 | ### OmniE2E Indoor Dataset (Fisheye Cameras) | Method | MOTA | IDF1 | ID Switches | Recovery Rate | |--------|------|------|-------------|---------------| | ByteTrack (vanilla) | 52.3 | 48.7 | 847 | - | | ByteTrack + Fisheye IoU | 61.2 | 57.4 | 523 | - | | + Position-aware KF | 64.8 | 61.2 | 412 | - | | + Top-Down ReID | 68.3 | 71.8 | 298 | 67.3% | | **Full System** | **71.2** | **74.6** | **231** | **78.9%** | ## Production Considerations ### Multi-Camera Track Handoff For environments with multiple fisheye cameras: ```python class MultiCameraTracker: def __init__(self, cameras, overlap_regions): self.trackers = {cam.id: FisheyeTracker(cam) for cam in cameras} self.overlaps = overlap_regions self.global_id_counter = 0 self.local_to_global = {} # (cam_id, local_id) -> global_id def handoff(self, cam_from, cam_to, track_from, track_to): """Transfer identity between cameras.""" key_from = (cam_from, track_from.id) if key_from in self.local_to_global: global_id = self.local_to_global[key_from] self.local_to_global[(cam_to, track_to.id)] = global_id else: self.global_id_counter += 1 self.local_to_global[key_from] = self.global_id_counter self.local_to_global[(cam_to, track_to.id)] = self.global_id_counter ``` ### Memory Management For 24/7 operation, implement track pruning: ```python def prune_tracks(self, max_tracks=1000, max_age_hours=24): """Remove old tracks to prevent memory growth.""" current_time = time.time() cutoff = current_time - max_age_hours * 3600 # Remove by age self.tracks = [t for t in self.tracks if t.last_seen > cutoff] # Remove by count if still too many if len(self.tracks) > max_tracks: self.tracks = sorted(self.tracks, key=lambda t: t.total_hits, reverse=True)[:max_tracks] ``` ## Conclusion Building a production-ready MOT system for fisheye cameras requires addressing: 1. **Geometric challenges**: Distortion-aware metrics and adaptive filtering 2. **Viewpoint shift**: Top-down specific appearance models 3. **Long-term occlusions**: State machines with appearance-based recovery 4. **Scale**: Multi-camera handoff and memory management Our modifications to ByteTrack improved IDF1 from 48.7% to 74.6% on indoor fisheye data, with a 78.9% recovery rate for occluded tracks—critical for accurate occupancy analytics and behavior understanding.