This document includes answers and information relating to common questions and topics regarding multi-object tracking. For more general Machine Learning questions, such as "How many training examples do I need?" or "How to monitor GPU usage during training?", see also the image classification FAQ.
-
Data
-
Training and Inference
-
Evaluation
-
State-of-the-Art(SoTA) Technology
- Popular MOT datasets
- What is the architecture of the FairMOT tracking algorithm?
- What object detectors are used in tracking-by-detection trackers?
- What feature extraction techniques are used in tracking-by-detection trackers?
- What affinity and association techniques are used in tracking-by-detection trackers?
- What is the difference between online and offline (batch) tracking algorithms?
-
[Popular Publications and Datasets]
For training we use the exact same annotation format as for object detection (see this FAQ). This also means that we train from individual frames, without taking temporal location of these frames into account.
For evaluation, we follow the py-motmetrics repository which requires the ground-truth data to be in MOT challenge format. The last 3 columns can be set to -1 by default, for the purpose of ground-truth annotation:
[frame number] [id number] [bbox left] [bbox top] [bbox width] [bbox height][confidence score][class][visibility]
See below an example where we use VOTT to annotate the two cans in the image as can_1
and can_2
where can_1
refers to the white/yellow can and can_2
refers to the red can. Before annotating, it is important to correctly set the extraction rate to match that of the video. After annotation, you can export the annotation results into several forms, such as PASCAL VOC or .csv form. For the .csv format, VOTT would return the extracted frames, as well as a csv file containing the bounding box and id info:
[image] [xmin] [y_min] [x_max] [y_max] [label]
Under the hood (not exposed to the user) the FairMOT repository uses this annotation format for training where each line describes a bounding box as follows, as described in the Towards-Realtime-MOT repository:
[class] [identity] [x_center] [y_center] [width] [height]
The class
field is set to 0, for all, as only single-class multi-object tracking is currently supported by the FairMOT repo (e.g. cans). The field identity is an integer from 0 to num_identities - 1 which maps class names to integers (e.g. coke can, coffee can, etc). The values of [x_center] [y_center] [width] [height] are normalized by the width/height of the image, and range from 0 to 1.
Losses generated by FairMOT include detection-specific losses (e.g. hm_loss, wh_loss, off_loss) and id-specific losses (id_loss). The overall loss (loss) is a weighted average of the detection-specific and id-specific losses, see the FairMOT paper.
- input_w and input_h: image resolution of the dataset video frames
- conf_thres, nms_thres, min_box_area: these thresholds used to filter out detections that do not meet the confidence level, nms level and size as per the user requirement;
- track_buffer: if a lost track is not matched for some number of frames as determined by this threshold, it is deleted, i.e. the id is not reused.
The MOT Challenge website hosts the most common benchmarking datasets for pedestrian MOT. Different datasets exist: MOT15, MOT16/17, MOT 19/20. These datasets contain many video sequences, with different tracking difficulty levels, with annotated ground-truth. Detections are also provided for optional use by the participating tracking algorithms.
As multi-object-tracking is a complex CV task, there exists many different metrics to evaluate the tracking performance. Based on how they are computed, metrics can be event-based CLEARMOT metrics or id-based metrics. The main metrics used to gauge performance in the MOT benchmarking challenge include MOTA, IDF1, and ID-switch.
- MOTA (Multiple Object Tracking Accuracy) gauges overall accuracy performance using an event-based computation of how often mismatch occurs between the tracking results and ground-truth. MOTA contains the counts of FP (false-positive), FN (false negative), and id-switches (IDSW) normalized over the total number of ground-truth (GT) tracks.
- IDF1 measures overall performance with id-based computation of how long the tracker correctly identifies the target. It is the harmonic mean of identification precision (IDP) and recall (IDR).
- ID-switch measures when the tracker incorrectly changes the ID of a trajectory. This is illustrated in the following figure: in the left box, person A and person B overlap and are not detected and tracked in frames 4-5. This results in an id-switch in frame 6, where person A is attributed the ID_2, which was previously tagged as person B. In another example in the right box, the tracker loses track of person A (initially identified as ID_1) after frame 3, and eventually identifies that person with a new ID (ID_2) in frame n, showing another instance of id-switch.
It consists of a single encoder-decoder neural network that extracts high resolution feature maps of the image frame. As a one-shot tracker, it feeds into two parallel heads for predicting bounding boxes and re-id features respectively, see source:
Source: Zhang, 2020
The most popular object detectors used by SoTA tacking algorithms include: Faster R-CNN, SSD and YOLOv3. Please see our object detection FAQ page for more details.
While older algorithms used local features, such as optical flow or regional features (e.g. color histograms, gradient-based features or covariance matrix), newer algorithms have deep-learning based feature representations. The most common deep-learning approaches, typically trained on re-id datasets, use classical CNNs to extract visual features. One such dataset is the MARS dataset. The following figure is an example of a CNN used for MOT by the DeepSORT tracker:
Newer deep-learning approaches include Siamese CNN networks, LSTM networks, or CNN with correlation filters. In Siamese CNN networks, a pair of identical CNN networks are used to measure similarity between two objects, and the CNNs are trained with loss functions that learn features that best differentiate them.Source: (Simon-Serra et al, 2015)
In an LSTM network, extracted features from different detections in different time frames are used as inputs. The network predicts the bounding box for the next frame based on the input history.
Source: Ciaparrone, 2019
Correlation filters can also be convolved with feature maps from CNN network to generate a prediction of the target's location in the next time frame. This was done by Ma et al as follows:
Simple approaches use similarity/affinity scores calculated from distance measures over features extracted by the CNN to optimally match object detections/tracklets with established object tracks across successive frames. To do this matching, Hungarian (Huhn-Munkres) algorithm is often used for online data association, while K-partite graph global optimization techniques are used for offline data association.
In more complex deep-learning approaches, the affinity computation is often merged with feature extraction. For instance, Siamese CNNs and Siamese LSTMs directly output the affinity score.
Online and offline algorithms differ at their data association step. In online tracking, the detections in a new frame are associated with tracks generated previously from previous frames. Thus, existing tracks are extended or new tracks are created. In offline (batch) tracking, all observations in a batch of frames can be considered globally (see figure below), i.e. they are linked together into tracks by obtaining a global optimal solution. Offline tracking can perform better with tracking issues such as long-term occlusion, or similar targets that are spatially close. However, offline tracking tends to be slower and hence not suitable for tasks which require real-time processing such as autonomous driving.
Name | Year | Duration | # tracks/ids | Scene | Object type |
---|---|---|---|---|---|
MOT15 | 2015 | 16 min | 1221 | Outdoor | Pedestrians |
MOT16/17 | 2016 | 9 min | 1276 | Outdoor & indoor | Pedestrians & vehicles |
CVPR19/MOT20 | 2019 | 26 min | 3833 | Crowded scenes | Pedestrians & vehicles |
PathTrack | 2017 | 172 min | 16287 | YouTube people scenes | Persons |
Visdrone | 2019 | - | - | Outdoor view from drone camera | Pedestrians & vehicles |
KITTI | 2012 | 32 min | - | Traffic scenes from car camera | Pedestrians & vehicles |
UA-DETRAC | 2015 | 10h | 8200 | Traffic scenes | Vehicles |
CamNeT | 2015 | 30 min | 30 | Outdoor & indoor | Persons |
Name | Year | MOT16 IDF1 | MOT16 MOTA | Inference Speed(fps) | Online/ Batch | Detector | Feature extraction/ motion model | Affinity & Association Approach |
---|---|---|---|---|---|---|---|---|
A Simple Baseline for Multi-object Tracking -FairMOT | 2020 | 70.4 | 68.7 | 25.8 | Online | One-shot tracker with detector head | One-shot tracker with re-id head & multi-layer feature aggregation, IOU, Kalman Filter | JV algorithm on IOU, embedding distance, |
How to Train Your Deep Multi-Object Tracker -DeepMOT-Tracktor | 2020 | 53.4 | 54.8 | 1.6 | Online | Single object tracker: Faster-RCNN (Tracktor), GO-TURN, SiamRPN | Tracktor, CNN re-id module | Deep Hungarian Net using Bi-RNN |
Tracking without bells and whistles -Tracktor | 2019 | 54.9 | 56.2 | 1.6 | Online | Modified Faster-RCNN | Temporal bbox regression with bbox camera motion compensation, re-id embedding from Siamese CNN | Greedy heuristic to merge tracklets using re-id embedding distance |
Towards Real-Time Multi-Object Tracking -JDE | 2019 | 55.8 | 64.4 | 18.5 | Online | One-shot tracker - Faster R-CNN with FPN | One-shot - Faster R-CNN with FPN, Kalman Filter | Hungarian Algorithm |
Exploit the connectivity: Multi-object tracking with TrackletNet -TNT | 2019 | 56.1 | 49.2 | 0.7 | Batch | MOT challenge detections | CNN with bbox camera motion compensation, embedding feature similarity | CNN-based similarity measures between tracklet pairs; tracklet-based graph-cut optimization |
Extending IOU based Multi-Object Tracking by Visual Information -VIOU | 2018 | 56.1(VisDrone) | 40.2(VisDrone) | 20(VisDrone) | Batch | Mask R-CNN, CompACT | IOU | KCF to merge tracklets using greedy IOU heuristics |
Simple Online and Realtime Tracking with a Deep Association Metric -DeepSORT | 2017 | 62.2 | 61.4 | 17.4 | Online | Modified Faster R-CNN | CNN re-id module, IOU, Kalman Filter | Hungarian Algorithm, cascaded approach using Mahalanobis distance (motion), embedding distance |
Multiple people tracking by lifted multicut and person re-identification -LMP | 2017 | 51.3 | 48.8 | 0.5 | Batch | Public detections | StackeNetPose CNN re-id module | Spatio-temporal relations, deep-matching, re-id confidence; detection-based graph lifted-multicut optimization |