Abstract:Voxel-based 3D object detection methods often suffer from poor real-time performance when processing large-scale LiDAR point clouds due to their heavy dependence on dense 2D backbone networks. In this paper, we propose VoxelFSD, a voxel-based fully sparse 3D object detector that significantly enhances the real-time capability of long-range detection. The model features three core components: Firstly, parallel convolutional branches (PCB), which expand the receptive field and comprehensively extract object features while mitigating the impact of missing object center features; Then, a sparse region proposal network (SRPN) head that predicts objects sparsely, reducing redundant computations compared to dense prediction and thus improving efficiency for large-scale point clouds; Finally, an ROI head with an attention fusion module (AFM-ROI) that employs cross-attention to effectively fuse 3D backbone features with compressed bird′s eye view (BEV) features in the second stage, refining object representation for improved detection accuracy. By removing the dense 2D backbone from traditional voxel-based detectors and integrating PCB and SRPN, we first present VoxelFSD-S, a fully sparse, single-stage, lightweight detector that achieves a superior balance between speed and accuracy relative to existing lightweight voxel-based models. Building upon VoxelFSD-S, we introduce VoxelFSD-T, a two-stage detector enhanced with AFM-ROI, which boosts accuracy with minimal additional computational cost. On the KITTI test set, VoxelFSD-S and VoxelFSD-T achieve accuracies of 77.67% and 81.50% , respectively.