Abstract:Accurate and reliable extrinsic calibration of sensors is essential for achieving high-precision localization and navigation in camera-LiDAR fusion systems. However, existing end-to-end camera-LiDAR calibration methods suffer from various limitations, such as large model parameter sizes and mismatched cross-modal feature correlation computation. To address these issues, this article proposes a novel joint calibration method based on stereo camera-estimated depth maps and initial LiDAR-projected depth maps. Specifically, the SGBM algorithm is used to perform stereo matching and generate high-accuracy depth estimation maps. These maps, along with the initial LiDAR depth projections, are fed into a lightweight deep neural network designed for multi-modal feature fusion, effectively mitigating modality inconsistency. A correlation matching layer is then utilized to compute feature-level correspondences, and two separate self-attention mechanisms are introduced to independently model rotational and translational extrinsic. Finally, an iterative refinement training strategy is adopted to enhance calibration accuracy. Compared with the state-of-the-art method LCCNet, experimental results on the KITTI Odometry dataset show that the proposed method achieves an average translation error of 0.67 cm and an average rotation error of 0.09°, representing reductions of 59.64% and 72.73%, respectively. And it requires fewer model parameters. In addition, real-world vehicle tests further demonstrate the effectiveness of the proposed method. When used as the initial extrinsic calibration in the LVI-SAM system, the absolute trajectory root mean square error is reduced by 5.18% compared with LCCNet, validating the accuracy and practical applicability of the method.