Physics-based character control generates realistic motion dynamics by leveraging kinematic priors from large-scale data within a simulation engine. The simulated motion respects physical plausibility, while dynamic cues like contacts and forces guide compelling human-scene interaction. However, leveraging audio cues, which can capture physical contacts in a cost-effective way, has been less explored in animating human motions. In this work, we demonstrate that audio inputs can enhance accuracy in predicting footsteps and capturing human locomotion dynamics. Experiments validate that audio-aided control from sparse observations (e.g., an IMU sensor on a VR headset) enhances the prediction accuracy of contact dynamics and motion tracking, offering a practical auxiliary signal for robotics, gaming, and virtual environments.
We use a 2-stage training pipeline. We first pretrain the motion imitation policy with the large-scale motion dataset. We then leverage the codebook and the decoder of the pretrained imitation policy to train our audio-aided high-level policy. The high-level policy takes footstep audio and the head-mounted IMU data as an observation to make good latent code for the character control.
Given the audio sigal, Root Mean Squared Energy (RMSE) is the most straightforward feature we can extract from the signal. It simply captures the amplitude of the signal. Onset strength focuses on detecting the start of individual events, so that they are good for beat and transient detection. Finally, Mel-Frequency Cepstral Coefficients (MFCCs) are coefficients of the short term power spectrum in mel-frequency domain. MFCCs contain the largest amount of information but the footstep timings are not easily recognizable. Onset strength is relatively more sensitive than RMSE in detecting the local peaks, which may correspond to the contact moment.
We show the results of our model using onset strength as an audio feature. Blue character is the controlled RL agent and white character shows the ground truth motion. Our model better matches the footstep timings, when compared to the results that only uses IMU data without auxiliary audio signal. Even when the agent fails to closely follow the position of the reference motion, the agent makes the footstep when the audio peak is made.
Audio-aided Character Control | Character Control without Audio |
---|
@inproceedings{10.2312:egs.20251045, booktitle = {Eurographics 2025 - Short Papers}, editor = {Ceylan, Duygu and Li, Tzu-Mao}, title = {{Audio-aided Character Control for Inertial Measurement Tracking}}, author = {Jang, Hojun and Bae, Jinseok and Kim, Young Min}, year = {2025}, publisher = {The Eurographics Association}, ISSN = {1017-4656}, ISBN = {978-3-03868-268-4}, DOI = {10.2312/egs.20251045} }