Abstract
Rapid advancements in artificial intelligence have made video surveillance increasingly pervasive across public and private domains. Multi-object detection (MOD) has emerged as a key research focus within video surveillance due to its critical role in security monitoring, crowd analysis, and anomaly detection. Traditional MOD systems rely on machine learning-based pipelines that typically follow a divide-and-conquer strategy for parameter optimization. However, these methods often exhibit limited performance due to constraints in model architecture and feature representation. To address these challenges, this paper proposes an enhanced You Only Look Once Version 5 (YOLOv5) framework, termed Attention-based YOLOv5 (AYOLOv5), specifically optimized for MOD in surveillance videos. The proposed system integrates attention mechanisms to refine feature extraction, improving both the detection accuracy and computational efficiency. Initially, the surveillance video frames undergo pre-processing steps, including frame conversion and data augmentation, to enrich the dataset and improve model generalization. Subsequently, the AYOLOv5 model detects multiple objects, leveraging a fuzzy c-means (FCM) clustering approach to optimize anchor box generation. Experimental evaluations conducted on the MOT20 dataset demonstrate that the proposed framework achieves a superior detection accuracy of 98.90%, outperforming existing state-of-the-art models. These results highlight the model’s effectiveness in handling complex scenarios involving occlusions, overlapping objects, and varying object scales, thereby significantly enhancing the reliability and practical utility of video surveillance systems for realworld applications.
Keywords: Multi-object Detection, Transfer Learning, Video Surveillance, You Look Only Once.