Abstract:In autonomous driving perception tasks, a multi-modal fusion of camera and LiDAR features based on a bird′s-eye view has become a mainstream research paradigm to combine information from different modalities into a unified spatial representation. Although representative frameworks such as BEVFusion achieve high 3D object detection accuracy, they rely heavily on depth prediction during the perspective transformation from 2D image features to the BEV space. This depth module is often complex, parameter-intensive, and results in low inference efficiency and high memory consumption, posing challenges for deployment on edge devices or resource-constrained platforms. To address these issues, we build upon the BEVFusion framework and focus on improving the accuracy and efficiency of the perspective transformation process. A BEV visual feature optimization algorithm is proposed, which integrates camera and LiDAR information by embedding LiDAR-provided depth data into the image feature representation, replacing the original depth prediction module. Additionally, the BEV space construction and pooling modules are restructured for computational efficiency. Experimental results show that, without compromising 3D detection accuracy, the proposed method reduces the inference time of key modules to 16% of the original, improves end-to-end inference speed by 83%, and lowers peak memory usage by 27%. It also significantly reduces sensitivity to input image resolution, enhancing adaptability to varying compute resources and improving deployment feasibility in real-world applications.