Abstract:In rehabilitation exercise scenarios, motion input is typically in the form of video sequences. However, pseudo-3D solutions based on mainstream 2D human pose estimation methods and depth cameras are incapable of accurately measuring distances between skeletal points within videos, thereby affecting the final assessment performance. To address this issue, this paper proposes a sequence-to-sequence 3D frame-focused pose recognition method tailored for rehabilitation evaluation. The goal is to directly extract more comprehensive and detailed 3D coordinate information from the original noisy 2D scenarios and conduct motion sequence analysis based on this data. The proposed method adopts a four-branch streaming transformer architecture that captures the spatiotemporal interactions across long sequences by independently modeling the temporal and spatial aspects of the raw 2D input. These four branches are integrated through learnable proportional parameters, and an additional module combining a spatial encoder with an enhanced temporal decoder is employed to generate the final output. Our method outperforms state-of-the-art approaches on the Human 3.6M dataset, achieving a mean per-joint position error (MPJPE) of only 14.4 mm, the lowest 3D pose coordinate error reported to date. This demonstrates that the proposed backbone architecture is effective in handling more complex rehabilitation motion video sequence tasks. Moreover, comparative experiments on real-world rehabilitation video sequences further validate the effectiveness of our approach. Based on this advanced human pose estimation method, we have developed a novel multi-dimensional intelligent rehabilitation exercise evaluation and analysis system, capable of estimating motion metrics for 120 joint actions. The system has entered the clinical validation phase and has been tested on over 2 000 patients, achieving an average accuracy of 93.2%.