Abstract:To address the challenges of incomplete geometric feature information of two-dimensional visual images of roadside facilities and traffic participants, as well as the lack of scene semantic information inaccurate perception and understanding of complex traffic scenarios for autonomous driving, a semantic understanding model for complex autonomous driving scenarios based on element information completion is proposed. Firstly, a dense connection network (DenseNet) is utilized to extract multi-scale 2D features from visual images. Then, the feature line-of-sight projection (FLoSP) module is used to inverse-map voxels to 3D space. A dimension decomposition residual (DDR) module is utilized to construct a 3D UNet, extracting 3D features of scene objects and enabling the transformation of singleframe 2D visual image features into 3D features. Additionally, a contextual residual prior (3D CRP) layer is introduced between the 3D UNet encoder and decoder. Atrous spatial pyramid pooling (ASPP) and Softmax layers are used to output scene semantic completion results, thereby enhancing the spatial semantic understanding capability of the model. Meanwhile, image caption generation technology is utilized to formulate a context-aware semantic embedding scene understanding language description model based on an improved VGG-16 encoder and a long short-term memory (LSTM) decoder. The improved VGG-16 encoder integrates and concatenates features of traffic scenes at different scales and inputs them into the LSTM decoder via a projection matrix, establishing a semantic representation between scene object images and predicate relations, and automatically generating natural language descriptions of object detection results and autonomous driving decision-making suggestions. Finally, the proposed complex scene semantic understanding algorithm is validated on the Semantic KITTI dataset and through real vehicle experiments. Compared with the JS3CNet algorithm, the results show that the proposed algorithm achieves a relative improvement of 11.27% in mean intersection over union (mIoU), realizes accurate perception and semantic understanding of complex scenarios in autonomous driving through semantic completion, and provides a reliable basis for autonomous driving decision-making and planning.