Exploring Lightweight, Real-Time Multi-Object Tracking Models

Undergraduate Honours thesis

Abstract

Multi-Object Tracking (MOT), which aims to predict the trajectories of multiple objects in an input video, is a computer vision task that is crucial for many applications such as autonomous driving, smart robotics, and surveillance system. Following the dominant tracking-by-detection paradigm, state-of-the-art MOT models generally contain two subtasks. The first subtask is an object detection task that predicts the location and bounding box data of the objects in the input video frames. The second subtask is a re-identification task that aims at generating an appearance embedding for each object to represent its identity and perform data association. Both of these tasks require extensive computation, which brings a huge challenge to building realtime MOT systems.

There has been remarkable progress in building one-shot MOT models, aiming to combine the detection and re-identification task into a single neural network to reduce computation cost. While these MOT models claim that they have achieved near real-time performance, their experiments are conducted under ideal test environments containing expensive GPU units to boost the network inference process of the MOT models. Therefore, these models encounter severe performance degradation when they are deployed to low-cost computing devices like personal laptops, which significantly affect their practicability and commercial value.

Our work aims at optimizing the network architecture of the state-of-the-art realtime MOT models to produce lightweight versions of these models with small deployment costs. We investigate which method effectively reduces the size of a cumbersome MOT model and accelerates the model running speed. In addition, we also explore the feasibility of using soft attention-based feature selection and adjusting the multi-task learning (MTL) strategy for the MOT model to enhance the model’s tracking performance. The three main contributions of this thesis are: (1) We establish an anchor-free baseline MOT model using MobileNet v2 and Feature Pyramid Network for low-cost computing devices, and prove that structured convolutional neural network pruning is an effective method to remove redundant parameters from the baseline MOT model and increase the model speed. Our lightweight MOT model achieves 73% of the performance of the original MOT model and runs three times faster. (2)We employ the soft attention-based feature selection mechanism to the MOT model that generates both shared and task-specific features for the MOT subtasks and find that this mechanism leads to improved tracking performance. By applying our modified network architecture to the MOT model, we observe the tracking performance increased by 12%. (3) We study the influence of different MTL architectures to the MOT model and prove that different parameter sharing strategies affect the performance of the MOT subtasks. We also present the best MTL architecture we discovered for the MOT models, which is the partial parameter sharing strategy.