Surveillance video person re-identification under multi-modal information fusion

Home > Archive>Volume 47, Issue 1, 2026 >270-286

Surveillance video person re-identification under multi-modal information fusion
DOI:
                        
CSTR:
                        
Author:
                        
Affiliation:1.School of Electronic Engineering and Automation, Guilin University of Electronic Technology， Guilin 541004, China; 2.Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences， Ningbo 315201, China
Clc Number:TP391.41TH74
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

Addressing the challenges of low resolution, severe occlusion and significant changes in personnel pose or shape variations, this paper proposes a new method for personnel re-identification (PR) in surveillance videos based on multimodal information fusion, using YOLOv9 as the backbone network and combining it with The Multi-Modal model CLIP (contrastive language-image pre-training). The method is divided into two stages. In the first stage, a ReID-YOLO network is constructed to enhance person feature detection performance under challenging conditions. A receptive-field enhancement module and deformable convolution are introduced to improve feature extraction for personnel with diverse poses and shapes. A spatially enhanced attention mechanism is employed to model relationships among person features and restore occluded information. In addition, a normalized Gaussian distance-based loss function is designed to increase sensitivity to low-resolution person features. These strategies jointly improve the accuracy and robustness of person feature detection in surveillance videos affected by low resolution, pose variation, shape deformation, and occlusion. In the second stage, the Multi-Modal model CLIP is introduced to improve the overall accuracy and scene generalization ability. By leveraging CLIP′s image-text alignment ability, personnel targets extracted in the first stage are predicted using discriminative features provided by ReID-YOLO. This fusion strategy mitigates CLIP′s excessive reliance on global scene information while compensating for the limited scene-awareness and target semantic parsing capability of YOLO-based networks. Experimental results under challenging conditions such as low resolution, ablation studies, and cross-identity scenarios demonstrate that the proposed method achieves outstanding performance in video-based person re-identification. It outperforms YOLO-series networks and seven other state-of-the-art video re-identification models, showing considerable promise for practical applications.

Reference

Cited by

Get Citation

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:
Revised:
Adopted:
Online: March 30,2026
Published:

Home

Introduction

Current Issue

Editorial Committee

Policy

Contact Us

中文版

Get Citation

Related Videos

Share

Article Metrics

History

Article QR Code