计算机视觉2026年4月15日12 min readOmniE2E 工程团队
实时多目标跟踪:从 SORT 到 ByteTrack 及其演进
现代多目标跟踪算法的全面分析,涵盖关联策略、ReID 集成,以及室内拥挤环境中的遮挡处理。
实时多目标跟踪:从 SORT 到 ByteTrack 及其演进
室内环境中的多目标跟踪(MOT)面临独特挑战:频繁遮挡、外观相似、运动不可预测,以及需要在较长时间内保持身份一致性。本文研究跟踪算法的演进,以及我们在天花板鱼眼摄像头部署中经过生产验证的改进方案。
跟踪问题的形式化定义
给定跨帧的检测序列,MOT 旨在分配一致的身份标签。形式化表述:
设 为第 帧的检测, 为现有轨迹。关联问题是找到最优分配矩阵 :
满足约束:每个检测最多映射到一个轨迹,反之亦然。
SORT:基础框架
简单在线实时跟踪(SORT)建立了检测跟踪范式:
卡尔曼滤波状态模型
状态向量:
其中:
- :边界框中心
- :尺度(面积)
- :宽高比(常数)
- :相应速度
状态转移:
其中 是恒速运动模型:
1 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{bmatrix}$$ ### 匈牙利算法分配 SORT 使用 IoU(交并比)作为代价度量: ```python def iou_cost_matrix(tracks, detections): cost = np.zeros((len(tracks), len(detections))) for i, track in enumerate(tracks): pred_bbox = track.predict() for j, det in enumerate(detections): cost[i, j] = 1 - iou(pred_bbox, det.bbox) return cost # 匈牙利算法 (Kuhn-Munkres) row_indices, col_indices = linear_sum_assignment(cost_matrix) ``` ### SORT 的局限性 1. **纯运动依赖**:无外观特征意味着遮挡时发生 ID 切换 2. **低置信度拒绝**:丢弃阈值以下可能有效的检测 3. **单阶段关联**:缺乏对未匹配关联的恢复机制 ## DeepSORT:引入外观特征 DeepSORT 通过集成 ReID(重识别)网络解决 ID 切换问题: ### 组合代价函数 $$C = \lambda \cdot C_{appearance} + (1 - \lambda) \cdot C_{motion}$$ 使用余弦距离的外观代价: $$C_{appearance}(i, j) = 1 - \frac{\mathbf{r}_i^T \mathbf{r}_j}{||\mathbf{r}_i|| \cdot ||\mathbf{r}_j||}$$ 其中 $\mathbf{r}$ 是 ReID 嵌入向量。 ### 级联匹配 DeepSORT 的级联优先处理最近出现的轨迹: ```python def cascade_matching(tracks, detections, max_age=30): unmatched_detections = list(range(len(detections))) matches = [] # 按年龄递增匹配 for age in range(max_age): if not unmatched_detections: break tracks_at_age = [t for t in tracks if t.time_since_update == age] if tracks_at_age: matched, unmatched_trks, unmatched_dets = \ min_cost_matching(tracks_at_age, [detections[i] for i in unmatched_detections]) matches.extend(matched) unmatched_detections = unmatched_dets return matches, unmatched_detections ``` ## ByteTrack:每个检测都重要 ByteTrack 的核心洞察:低置信度检测通常对应被遮挡的目标,不应丢弃。 ### 两阶段关联 ```python class ByteTrack: def update(self, detections, confidence_threshold=0.5): # 按置信度分割 high_conf = [d for d in detections if d.score >= confidence_threshold] low_conf = [d for d in detections if d.score < confidence_threshold] # 第一阶段关联:高置信度检测 matched, unmatched_tracks, unmatched_high = \ self.associate(self.tracks, high_conf) # 第二阶段关联:剩余轨迹与低置信度检测 remaining_tracks = [self.tracks[i] for i in unmatched_tracks] matched_low, still_unmatched, _ = \ self.associate(remaining_tracks, low_conf) # 更新匹配的轨迹 for track_idx, det_idx in matched + matched_low: self.tracks[track_idx].update(detections[det_idx]) # 处理未匹配 self._handle_unmatched(still_unmatched, unmatched_high) ``` ### 仅基于 IoU 的关联 ByteTrack 在关联步骤中刻意避免外观特征: ```python def associate(self, tracks, detections, iou_threshold=0.3): if not tracks or not detections: return [], list(range(len(tracks))), list(range(len(detections))) # 预测轨迹位置 predicted = [track.predict() for track in tracks] # 计算 IoU 矩阵 iou_matrix = np.zeros((len(tracks), len(detections))) for i, pred in enumerate(predicted): for j, det in enumerate(detections): iou_matrix[i, j] = iou(pred, det.bbox) # 应用阈值掩码 cost = 1 - iou_matrix cost[iou_matrix < iou_threshold] = 1e5 # 匈牙利分配 row_ind, col_ind = linear_sum_assignment(cost) matches = [(r, c) for r, c in zip(row_ind, col_ind) if cost[r, c] < 1e5] return matches, unmatched_tracks, unmatched_detections ``` ## 针对鱼眼摄像头的改进 标准算法假设透视相机。鱼眼相机引入了: 1. **严重径向畸变**:边缘目标出现拉伸 2. **非均匀尺度**:同一人在不同位置显示不同大小 3. **俯视视角**:标准 ReID 模型失效 ### 畸变感知 IoU 我们在去畸变后的归一化坐标中计算 IoU: ```python class FisheyeIoU: def __init__(self, camera_matrix, dist_coeffs): self.K = camera_matrix self.D = dist_coeffs def undistort_bbox(self, bbox): """将鱼眼边界框转换为归一化坐标。""" x1, y1, x2, y2 = bbox corners = np.array([[x1, y1], [x2, y1], [x2, y2], [x1, y2]], dtype=np.float32) # 去畸变角点 undistorted = cv2.fisheye.undistortPoints( corners.reshape(-1, 1, 2), self.K, self.D ).reshape(-1, 2) # 返回归一化空间中的轴对齐边界框 return [ undistorted[:, 0].min(), undistorted[:, 1].min(), undistorted[:, 0].max(), undistorted[:, 1].max() ] def compute(self, bbox1, bbox2): norm1 = self.undistort_bbox(bbox1) norm2 = self.undistort_bbox(bbox2) return standard_iou(norm1, norm2) ``` ### 位置感知卡尔曼滤波 我们根据鱼眼图像中的位置调整过程噪声: ```python class FisheyeKalmanFilter: def __init__(self, camera_model): self.camera = camera_model self.base_process_noise = 0.01 def get_process_noise(self, position): """边缘附近由于畸变,噪声更高。""" u, v = position # 距中心的距离(归一化) center_u, center_v = self.camera.cx, self.camera.cy dist = np.sqrt((u - center_u)**2 + (v - center_v)**2) max_dist = np.sqrt(center_u**2 + center_v**2) # 噪声随距离二次增长 scale = 1 + 2 * (dist / max_dist) ** 2 return self.base_process_noise * scale ``` ### 俯视 ReID 特征 我们在俯视人体裁剪图像上训练了自定义 ReID 网络: ```python class TopDownReID(nn.Module): def __init__(self, backbone='resnet50'): super().__init__() self.backbone = timm.create_model(backbone, pretrained=True) self.backbone.fc = nn.Identity() # 俯视视角的额外头部 self.head = nn.Sequential( nn.Linear(2048, 1024), nn.BatchNorm1d(1024), nn.ReLU(), nn.Linear(1024, 256), nn.BatchNorm1d(256) ) def forward(self, x): features = self.backbone(x) embeddings = self.head(features) return F.normalize(embeddings, p=2, dim=1) ``` 训练数据:来自天花板摄像头的 50,000 张俯视人体裁剪图像,具有跨多个摄像机视角的身份标签。 ## 处理长期遮挡 室内环境存在长时间遮挡(家具、柱子)。我们实现了轨迹恢复系统: ### 轨迹状态机 ```python class TrackState(Enum): TENTATIVE = 1 # 新轨迹,需要确认 CONFIRMED = 2 # 正在跟踪 OCCLUDED = 3 # 暂时丢失 DELETED = 4 # 永久删除 class EnhancedTrack: def __init__(self, detection, track_id): self.id = track_id self.state = TrackState.TENTATIVE self.hits = 1 self.misses = 0 self.max_occlusion_time = 90 # 30fps 下 3 秒 # 存储外观历史用于恢复 self.appearance_gallery = deque(maxlen=30) self.last_position = detection.bbox def update_state(self, matched): if matched: self.hits += 1 self.misses = 0 if self.state == TrackState.TENTATIVE and self.hits >= 3: self.state = TrackState.CONFIRMED elif self.state == TrackState.OCCLUDED: self.state = TrackState.CONFIRMED else: self.misses += 1 if self.state == TrackState.CONFIRMED and self.misses >= 3: self.state = TrackState.OCCLUDED elif self.state == TrackState.OCCLUDED and \ self.misses >= self.max_occlusion_time: self.state = TrackState.DELETED ``` ### 基于外观的恢复 当被遮挡轨迹的预测位置与任何检测都不匹配时,我们使用外观进行搜索: ```python def recover_occluded_tracks(occluded_tracks, new_detections, reid_model): """尝试使用外观匹配恢复被遮挡轨迹。""" recoveries = [] for track in occluded_tracks: if not track.appearance_gallery: continue # 计算画廊嵌入(最后 N 个的平均值) gallery_emb = torch.stack(list(track.appearance_gallery)).mean(dim=0) best_match = None best_score = 0.7 # 最小相似度阈值 for det in new_detections: det_emb = reid_model(det.crop) similarity = F.cosine_similarity(gallery_emb, det_emb, dim=0) if similarity > best_score: # 额外的空间合理性检查 if is_spatially_plausible(track.last_position, det.bbox, track.misses): best_score = similarity best_match = det if best_match: recoveries.append((track, best_match)) return recoveries ``` ## 基准测试结果 ### MOT17 挑战赛(标准透视相机) | 方法 | MOTA | IDF1 | HOTA | FPS | |----|------|------|------|-----| | SORT | 43.1 | 39.8 | 34.0 | 142 | | DeepSORT | 61.4 | 62.2 | 45.6 | 17 | | ByteTrack | 80.3 | 77.3 | 63.1 | 29 | | **我们的方法 (ByteTrack+)** | 78.9 | 79.1 | 64.2 | 27 | ### OmniE2E 室内数据集(鱼眼摄像头) | 方法 | MOTA | IDF1 | ID 切换 | 恢复率 | |----|------|------|-------|-------| | ByteTrack(原版) | 52.3 | 48.7 | 847 | - | | ByteTrack + 鱼眼 IoU | 61.2 | 57.4 | 523 | - | | + 位置感知 KF | 64.8 | 61.2 | 412 | - | | + 俯视 ReID | 68.3 | 71.8 | 298 | 67.3% | | **完整系统** | **71.2** | **74.6** | **231** | **78.9%** | ## 生产环境考量 ### 多相机轨迹交接 对于具有多个鱼眼摄像头的环境: ```python class MultiCameraTracker: def __init__(self, cameras, overlap_regions): self.trackers = {cam.id: FisheyeTracker(cam) for cam in cameras} self.overlaps = overlap_regions self.global_id_counter = 0 self.local_to_global = {} # (cam_id, local_id) -> global_id def handoff(self, cam_from, cam_to, track_from, track_to): """在相机之间转移身份。""" key_from = (cam_from, track_from.id) if key_from in self.local_to_global: global_id = self.local_to_global[key_from] self.local_to_global[(cam_to, track_to.id)] = global_id else: self.global_id_counter += 1 self.local_to_global[key_from] = self.global_id_counter self.local_to_global[(cam_to, track_to.id)] = self.global_id_counter ``` ### 内存管理 对于 24/7 运行,实现轨迹修剪: ```python def prune_tracks(self, max_tracks=1000, max_age_hours=24): """删除旧轨迹以防止内存增长。""" current_time = time.time() cutoff = current_time - max_age_hours * 3600 # 按年龄删除 self.tracks = [t for t in self.tracks if t.last_seen > cutoff] # 如果仍然太多则按数量删除 if len(self.tracks) > max_tracks: self.tracks = sorted(self.tracks, key=lambda t: t.total_hits, reverse=True)[:max_tracks] ``` ## 结论 为鱼眼摄像头构建生产就绪的 MOT 系统需要解决: 1. **几何挑战**:畸变感知度量和自适应滤波 2. **视角转换**:俯视特定的外观模型 3. **长期遮挡**:带外观恢复的状态机 4. **规模化**:多相机交接和内存管理 我们对 ByteTrack 的改进使室内鱼眼数据的 IDF1 从 48.7% 提升到 74.6%,被遮挡轨迹的恢复率达到 78.9%——这对于准确的占用分析和行为理解至关重要。