Real-time Multi-Object Tracking: From SORT to ByteTrack and Beyond

Multi-object tracking (MOT) in indoor environments presents unique challenges: frequent occlusions, similar appearances, unpredictable movements, and the need for consistent identity preservation across extended time periods. This post examines the evolution of tracking algorithms and our production-tested modifications for ceiling fisheye camera deployments.

The Tracking Problem Formulation

Given a sequence of detections across frames, MOT aims to assign consistent identity labels. Formally:

Let $\mathcal{D}_t = \{d_1^t, d_2^t, ..., d_n^t\}$ be detections at frame $t$ , and $\mathcal{T}_{t-1} = \{T_1, T_2, ..., T_m\}$ be existing tracks. The association problem is finding the optimal assignment matrix $A$ where:

$A^* = \argmin_{A} \sum_{i,j} C_{ij} \cdot A_{ij}$

Subject to constraints that each detection maps to at most one track and vice versa.

SORT: The Foundation

Simple Online and Realtime Tracking (SORT) established the tracking-by-detection paradigm:

Kalman Filter State Model

State vector: $\mathbf{x} = [u, v, s, r, \dot{u}, \dot{v}, \dot{s}]^T$

Where:

$(u, v)$ : bounding box center
$s$ : scale (area)
$r$ : aspect ratio (constant)
$(\dot{u}, \dot{v}, \dot{s})$ : respective velocities

State transition:

$\mathbf{x}_t = F \mathbf{x}_{t-1} + w$

Where $F$ is the constant velocity motion model: