Multi-Object Tracking

Frequently asked questions

This document includes answers and information relating to common questions and topics regarding multi-object tracking. For more general Machine Learning questions, such as "How many training examples do I need?" or "How to monitor GPU usage during training?", see also the image classification FAQ.

Data
- How to annotate images?
Training and Inference
- What are the training losses in FairMOT?
- What are the main inference parameters in FairMOT?
Evaluation
- What is the MOT Challenge?
- What are the commonly used evaluation metrics?
State-of-the-Art(SoTA) Technology
[Popular Publications and Datasets]
- Popular Datasets
- Popular Publications

Data

How to annotate images?

For training we use the exact same annotation format as for object detection (see this FAQ). This also means that we train from individual frames, without taking temporal location of these frames into account.

For evaluation, we follow the py-motmetrics repository which requires the ground-truth data to be in MOT challenge format. The last 3 columns can be set to -1 by default, for the purpose of ground-truth annotation:

[frame number] [id number] [bbox left] [bbox top] [bbox width] [bbox height][confidence score][class][visibility]

See below an example where we use VOTT to annotate the two cans in the image as can_1 and can_2 where can_1 refers to the white/yellow can and can_2 refers to the red can. Before annotating, it is important to correctly set the extraction rate to match that of the video. After annotation, you can export the annotation results into several forms, such as PASCAL VOC or .csv form. For the .csv format, VOTT would return the extracted frames, as well as a csv file containing the bounding box and id info:

[image] [xmin] [y_min] [x_max] [y_max] [label]

Under the hood (not exposed to the user) the FairMOT repository uses this annotation format for training where each line describes a bounding box as follows, as described in the Towards-Realtime-MOT repository:

[class] [identity] [x_center] [y_center] [width] [height]

The class field is set to 0, for all, as only single-class multi-object tracking is currently supported by the FairMOT repo (e.g. cans). The field identity is an integer from 0 to num_identities - 1 which maps class names to integers (e.g. coke can, coffee can, etc). The values of [x_center] [y_center] [width] [height] are normalized by the width/height of the image, and range from 0 to 1.

Training and inference

What are the training losses in FairMOT?

Losses generated by FairMOT include detection-specific losses (e.g. hm_loss, wh_loss, off_loss) and id-specific losses (id_loss). The overall loss (loss) is a weighted average of the detection-specific and id-specific losses, see the FairMOT paper.

What are the main inference parameters in FairMOT?

input_w and input_h: image resolution of the dataset video frames
conf_thres, nms_thres, min_box_area: these thresholds used to filter out detections that do not meet the confidence level, nms level and size as per the user requirement;
track_buffer: if a lost track is not matched for some number of frames as determined by this threshold, it is deleted, i.e. the id is not reused.

Evaluation

What is the MOT Challenge?

The MOT Challenge website hosts the most common benchmarking datasets for pedestrian MOT. Different datasets exist: MOT15, MOT16/17, MOT 19/20. These datasets contain many video sequences, with different tracking difficulty levels, with annotated ground-truth. Detections are also provided for optional use by the participating tracking algorithms.

What are the commonly used evaluation metrics?

As multi-object-tracking is a complex CV task, there exists many different metrics to evaluate the tracking performance. Based on how they are computed, metrics can be event-based CLEARMOT metrics or id-based metrics. The main metrics used to gauge performance in the MOT benchmarking challenge include MOTA, IDF1, and ID-switch.

MOTA (Multiple Object Tracking Accuracy) gauges overall accuracy performance using an event-based computation of how often mismatch occurs between the tracking results and ground-truth. MOTA contains the counts of FP (false-positive), FN (false negative), and id-switches (IDSW) normalized over the total number of ground-truth (GT) tracks.

IDF1 measures overall performance with id-based computation of how long the tracker correctly identifies the target. It is the harmonic mean of identification precision (IDP) and recall (IDR).

ID-switch measures when the tracker incorrectly changes the ID of a trajectory. This is illustrated in the following figure: in the left box, person A and person B overlap and are not detected and tracked in frames 4-5. This results in an id-switch in frame 6, where person A is attributed the ID_2, which was previously tagged as person B. In another example in the right box, the tracker loses track of person A (initially identified as ID_1) after frame 3, and eventually identifies that person with a new ID (ID_2) in frame n, showing another instance of id-switch.

State-of-the-Art

What is the architecture of the FairMOT tracking algorithm?

It consists of a single encoder-decoder neural network that extracts high resolution feature maps of the image frame. As a one-shot tracker, it feeds into two parallel heads for predicting bounding boxes and re-id features respectively, see source:

Source: Zhang, 2020

What object detectors are used in tracking-by-detection trackers?

The most popular object detectors used by SoTA tacking algorithms include: Faster R-CNN, SSD and YOLOv3. Please see our object detection FAQ page for more details.

What feature extraction techniques are used in tracking-by-detection trackers?

While older algorithms used local features, such as optical flow or regional features (e.g. color histograms, gradient-based features or covariance matrix), newer algorithms have deep-learning based feature representations. The most common deep-learning approaches, typically trained on re-id datasets, use classical CNNs to extract visual features. One such dataset is the MARS dataset. The following figure is an example of a CNN used for MOT by the DeepSORT tracker:

Newer deep-learning approaches include Siamese CNN networks, LSTM networks, or CNN with correlation filters. In Siamese CNN networks, a pair of identical CNN networks are used to measure similarity between two objects, and the CNNs are trained with loss functions that learn features that best differentiate them.

Source: (Simon-Serra et al, 2015)

In an LSTM network, extracted features from different detections in different time frames are used as inputs. The network predicts the bounding box for the next frame based on the input history.

Source: Ciaparrone, 2019

Correlation filters can also be convolved with feature maps from CNN network to generate a prediction of the target's location in the next time frame. This was done by Ma et al as follows:

What affinity and association techniques are used in tracking-by-detection trackers?

Simple approaches use similarity/affinity scores calculated from distance measures over features extracted by the CNN to optimally match object detections/tracklets with established object tracks across successive frames. To do this matching, Hungarian (Huhn-Munkres) algorithm is often used for online data association, while K-partite graph global optimization techniques are used for offline data association.

In more complex deep-learning approaches, the affinity computation is often merged with feature extraction. For instance, Siamese CNNs and Siamese LSTMs directly output the affinity score.

What is the difference between online and offline tracking algorithms?

Online and offline algorithms differ at their data association step. In online tracking, the detections in a new frame are associated with tracks generated previously from previous frames. Thus, existing tracks are extended or new tracks are created. In offline (batch) tracking, all observations in a batch of frames can be considered globally (see figure below), i.e. they are linked together into tracks by obtaining a global optimal solution. Offline tracking can perform better with tracking issues such as long-term occlusion, or similar targets that are spatially close. However, offline tracking tends to be slower and hence not suitable for tasks which require real-time processing such as autonomous driving.

Popular publications and datasets

Popular Datasets

Name	Year	Duration	# tracks/ids	Scene	Object type
MOT15	2015	16 min	1221	Outdoor	Pedestrians
MOT16/17	2016	9 min	1276	Outdoor & indoor	Pedestrians & vehicles
CVPR19/MOT20	2019	26 min	3833	Crowded scenes	Pedestrians & vehicles
PathTrack	2017	172 min	16287	YouTube people scenes	Persons
Visdrone	2019	-	-	Outdoor view from drone camera	Pedestrians & vehicles
KITTI	2012	32 min	-	Traffic scenes from car camera	Pedestrians & vehicles
UA-DETRAC	2015	10h	8200	Traffic scenes	Vehicles
CamNeT	2015	30 min	30	Outdoor & indoor	Persons

Popular publications

Name	Year	MOT16 IDF1	MOT16 MOTA	Inference Speed(fps)	Online/ Batch	Detector	Feature extraction/ motion model	Affinity & Association Approach
A Simple Baseline for Multi-object Tracking -FairMOT	2020	70.4	68.7	25.8	Online	One-shot tracker with detector head	One-shot tracker with re-id head & multi-layer feature aggregation, IOU, Kalman Filter	JV algorithm on IOU, embedding distance,
How to Train Your Deep Multi-Object Tracker -DeepMOT-Tracktor	2020	53.4	54.8	1.6	Online	Single object tracker: Faster-RCNN (Tracktor), GO-TURN, SiamRPN	Tracktor, CNN re-id module	Deep Hungarian Net using Bi-RNN
Tracking without bells and whistles -Tracktor	2019	54.9	56.2	1.6	Online	Modified Faster-RCNN	Temporal bbox regression with bbox camera motion compensation, re-id embedding from Siamese CNN	Greedy heuristic to merge tracklets using re-id embedding distance
Towards Real-Time Multi-Object Tracking -JDE	2019	55.8	64.4	18.5	Online	One-shot tracker - Faster R-CNN with FPN	One-shot - Faster R-CNN with FPN, Kalman Filter	Hungarian Algorithm
Exploit the connectivity: Multi-object tracking with TrackletNet -TNT	2019	56.1	49.2	0.7	Batch	MOT challenge detections	CNN with bbox camera motion compensation, embedding feature similarity	CNN-based similarity measures between tracklet pairs; tracklet-based graph-cut optimization
Extending IOU based Multi-Object Tracking by Visual Information -VIOU	2018	56.1(VisDrone)	40.2(VisDrone)	20(VisDrone)	Batch	Mask R-CNN, CompACT	IOU	KCF to merge tracklets using greedy IOU heuristics
Simple Online and Realtime Tracking with a Deep Association Metric -DeepSORT	2017	62.2	61.4	17.4	Online	Modified Faster R-CNN	CNN re-id module, IOU, Kalman Filter	Hungarian Algorithm, cascaded approach using Mahalanobis distance (motion), embedding distance
Multiple people tracking by lifted multicut and person re-identification -LMP	2017	51.3	48.8	0.5	Batch	Public detections	StackeNetPose CNN re-id module	Spatio-temporal relations, deep-matching, re-id confidence; detection-based graph lifted-multicut optimization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FAQ.md

FAQ.md

Multi-Object Tracking

Frequently asked questions

Data

How to annotate images?

Training and inference

What are the training losses in FairMOT?

What are the main inference parameters in FairMOT?

Evaluation

What is the MOT Challenge?

What are the commonly used evaluation metrics?

State-of-the-Art

What is the architecture of the FairMOT tracking algorithm?

What object detectors are used in tracking-by-detection trackers?

What feature extraction techniques are used in tracking-by-detection trackers?

What affinity and association techniques are used in tracking-by-detection trackers?

What is the difference between online and offline tracking algorithms?

Popular publications and datasets

Popular Datasets

Popular publications

Files

FAQ.md

Latest commit

History

FAQ.md

File metadata and controls

Multi-Object Tracking

Frequently asked questions

Data

How to annotate images?

Training and inference

What are the training losses in FairMOT?

What are the main inference parameters in FairMOT?

Evaluation

What is the MOT Challenge?

What are the commonly used evaluation metrics?

State-of-the-Art

What is the architecture of the FairMOT tracking algorithm?

What object detectors are used in tracking-by-detection trackers?

What feature extraction techniques are used in tracking-by-detection trackers?

What affinity and association techniques are used in tracking-by-detection trackers?

What is the difference between online and offline tracking algorithms?

Popular publications and datasets

Popular Datasets

Popular publications