ReMP: Reusable Motion Prior for Multi-domain 3D Human Pose Estimation and Motion Inbetweening

WACV 2025
Seoul National University

Reusable motion prior extracted from large-scale motion dataset is useful for various tasks.

Abstract

We present Reusable Motion prior (ReMP), an effective motion prior that can accurately track the temporal evolution of motion in various downstream tasks. Inspired by the success of foundation models, we argue that a robust spatio-temporal motion prior can encapsulate underlying 3D dynamics applicable to various sensor modalities. We learn the rich motion prior from a sequence of complete parametric models of posed human body shape. Our prior can easily estimate poses in missing frames or noisy measurements despite significant occlusion by employing a temporal attention mechanism. More interestingly, our prior can guide the system with incomplete and challenging input measurements to quickly extract critical information to estimate the sequence of poses, significantly improving the training efficiency for mesh sequence recovery. ReMP consistently outperforms the baseline method on diverse and practical 3D motion data, including depth point clouds, LiDAR scans, and IMU sensor data.

Method

The overall pipeline of our method consists of two parts: (a) training motion prior and (b) reusing pre-trained prior. In the motion prior training phase, a sequence of pose parameters θ and the root translation transitions Δx form a sequence of motion parameter M. We use a transformer encoder and MLP layers to generate Gaussian distributions where we can sample the latent vectors. We feed the latent vectors to a transformer decoder to generate the motion parameters then to the SMPL parameters. After training the prior, we freeze all the networks used in the first phase. In the reusing phase, we encode the input data and use a transformer encoder to generate a distribution that is then used to sample the latent vectors for the transformer decoder. We use an additional shape parameter estimator for β. Finally, we combine all three parameters with the SMPL layer to reconstruct the human motion.

Results (Depth, Synthetic)

Input GT ReMP DMR[1] VoteHMR[2] Zuo et al.[3]

Results (Depth, Real)

Input ReMP DMR[1] VoteHMR[2] Zuo et al.[3]

Results (LiDAR, Real)

Input GT ReMP DMR[1] VoteHMR[2] Zuo et al.[3]

Results (IMU, Real)

GT ReMP PIP[4] TransPose[5]

Results (Inbetweening)

References

  1. Hojun Jang, Minkwan Kim, Jinseok Bae, and Young Min Kim, Dynamic Mesh Recovery from Partial Point Cloud Sequence, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  2. Guanze Liu, Yu Rong, and Lu Sheng, VoteHMR: Occlusion-aware Voting Network for Robust 3D Human Mesh Recovery from Partial Point Clouds, in Proceedings of the 29th ACM International Conference on Multimedia, 2021.
  3. Xinxin Zuo, Sen Wang, Qiang Sun, Minglun Gong, and Li Cheng, Self-supervised 3D Human Mesh Recovery from Noisy Point Clouds, arXiv preprint arXiv:2107.07539, 2021.
  4. Xinyu Yi, Yuxiao Zhou, Marc Habermann, Soshi Shimada, Vladislav Golyanik, Christian Theobalt, and Feng Xu. Physical Inertial Poser (PIP): Physics-aware Real-time Human Motion Tracking from Sparse Inertial Sensors, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  5. Xinyu Yi, Yuxiao Zhou, and Feng Xu, TransPose: Real-time 3D Human Translation and Pose Estimation with Six Inertial Sensors, ACM Transactions on Graphics, 40(4), 2021.

BibTeX

@InProceedings{Jang_2025_WACV,
  author    = {Jang, Hojun and Kim, Young Min},
  title     = {ReMP: Reusable Motion Prior for Multi-domain 3D Human Pose Estimation and Motion Inbetweening},
  booktitle = {Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  month     = {},
  year      = {2025},
  pages     = {}
}