Abstract
In recent years, the YOLO series has achieved remarkable progress in the field of real-time object detection, striking a favorable balance between speed and accuracy. However, challenges still persist in feature representation and bounding box regression, especially in scenarios involving dense objects, overlapping instances, and small object detection. To enhance detection performance, this paper proposes an improved architecture based on YOLOv11, named YOLO-ACR, which introduces enhancements in feature fusion, structural design, and regression optimization. Specifically, YOLO-ACR incorporates the CASAFF module, which adaptively fuses channel and spatial features to effectively strengthen the model's multi-scale representation capability. The proposed C3k2-RV structure draws on the efficient design of RepVGG, achieving a balance between lightweight architecture and feature extraction capability. For bounding box localization, we introduce the IS-MPDIoU loss function, which combines the spatial distance between bounding boxes and the overlap ratio of internal regions, and incorporates a dynamic scaling and coordinate distribution modeling mechanism to significantly improve regression precision and model robustness. Experiments on the PASCAL VOC dataset demonstrate that YOLO-ACR consistently outperforms YOLOv11 across different model scales: achieving a 1.7% improvement in mAP50 for the large-scale model (approximately 59M parameters), a 2.7% improvement for the medium-scale model (approximately 38M parameters), and a 0.5% improvement for the small-scale model (approximately 9.5M parameters). Furthermore, the proposed method also achieves noticeable performance improvements on the COCO2017 dataset.