Abstract:To address the challenges of camera pose estimation and mobile robot localization, a camera pose estimation method is proposed based on a hybrid frequency domain Transformer to predict the position and orientation of a camera from RGB images. Firstly, a camera pose estimation dataset, RotIndoor, is constructed based on indoor scenes, with each sample containing an RGB image of the scene and the ground truth camera poses obtained from a VICON system. Secondly, a pose regression network model, CamPose, is introduced, which effectively integrates spatial and frequency domain information to enhance the representation capability of image features, ultimately achieving higher accuracy in camera pose estimation. Specifically, CamPose incorporates a feature enhancement module based on differential convolution networks to capture fine-grained features within the images. Additionally, a frequency domain encoding layer is designed that applies Fourier transformation to extract frequency characteristics while integrating a frequency domain attention module, enabling the model to sensitively perceive the importance of different frequency components. Finally, experiments are implemented on the public datasets 7Scenes and RotIndoor. The experimental results show that the pose estimation error on the 7Scenes dataset is reduced to 0.17 m/7.85°, and the positioning accuracy on RotIndoor is improved by 23% compared to other methods.