Exploiting Multi-view Part-wise Correlation via an Efficient Transformer for Vehicle Re-Identification

Abstract

Image-based vehicle re-identification (ReID) has witnessed much progress in recent years thanks to the advances of deep neural networks. However, most of existing works struggled to extract robust and discriminative features from a single image at each feedforward to represent one vehicle instance. We argue that images taken from distinct viewpoints, e.g., front and back, have significantly different appearances and patterns for recognition. In order to “memorize” each vehicle, existing models often have to capture consistent “ID codes” from totally different view images, which causes learning difficulties. Additionally, we claim that correspondences among views, i.e., various vehicle parts observed from the identical image and the same part observed from different viewpoints, contribute to instance-level feature learning as well. Motivated by these observations, we propose to extract comprehensive instance-specific representations of the same vehicle from multiple views through modelling part-wise correlations. To this end, we present an efficient transformer-based framework to exploit both inner- and inter-view correlations for vehicle ReID. In specific, we first adopt a deep encoder to condense a series of patch embeddings from each view image. Then our efficient transformer, consisting of a distillation token and a noise token in addition to a regular class token, is constructed to enforce all patch embeddings to interact with each other regardless of whether they are taken from identical or different views. For inference, one testing image together with its augmented counterparts (pseudo views) are regarded as multi-view inputs and fed into our framework to capture its representations. We conduct extensive experiments on two widely used vehicle ReID benchmarks, and our approach achieves the state-of-the-art performance, showing the effectiveness of our method.