Holistic-Motion2D: Scalable Whole-body Human Motion Generation in 2D Space

Yuan Wang1,5,6*, Zhao Wang2*, Junhao Gong3*, Di Huang4,5, Tong He5, Wanli Ouyang5, Jile Jiao1,6, Xuetao Feng1,6, Qi Dou2, Shixiang Tang2, Dan Xu7

1Tsinghua University  2The Chinese University of Hong Kong  3Shandong University  4The University of Sydney   5Shanghai AI Labortary  6Alibaba Group  7HKUST

*Equal contribution. Corresponding author.


Abstract

In this paper, we introduce a novel path to general human motion generation by focusing on 2D space. Traditional methods have primarily generated human motions in 3D, which, while detailed and realistic, are often limited by the scope of available 3D motion data in terms of both the size and the diversity. To address these limitations, we exploit extensive availability of 2D motion data. We present Holistic-Motion2D, the first comprehensive and large-scale benchmark for 2D whole-body motion generation, which includes over 1M in-the-wild motion sequences, each paired with high-quality whole-body/partial pose annotations and textual descriptions. Notably, Holistic-Motion2D is ten times larger than the previously largest 3D motion dataset. We also introduce a baseline method, featuring innovative whole-body part-aware attention and confidence-aware modeling techniques, tailored for 2D Text-drivEN whole-boDy motion genERation, namely Tender. Extensive experiments demonstrate the effectiveness of Holistic-Motion2D and Tender in generating expressive, diverse, and realistic human motions. We also highlight the utility of 2D motion for various downstream applications and its potential for lifting to 3D motion.

Dataset

Comparison between our proposed Holistic-Motion2D and existing text-motion datasets. The video quantity of our Holistic-Motion2D is 10x larger than the previously largest 3D motion dataset, i.e., Motion-X.


Overview of the keypoints and pose descriptions annotation pipeline of 2D whole-body motions.

Method

Overview of our Tender framework. (a) PA-VAE to embed whole-body part-aware spatio-temporal features into a latent space. (b) The diffusion model to generate realistic whole-body motions conditioned on texts. (c) Whole-body Part-Aware Attention to model spatial relations of different parts with CAG mechanism.

Comparison with State-of-the-arts

Qualitative results of our Tender compared with previous SOTA methods. Our Tender generates clearly more vivid human motions
and preserves the fidelity, together with superior temporal consistency.

Application 1: Pose-guided Human Video Generation

Reference Image

              DensePose (mapped from 2D keypoints)

              Generated Video

Visualization results of pose-guided human video generation using our proposed Tender model in multiple visual scenarios,
with a corresponding text prompt given below and a reference image given on the left.

Application 2: 2D-to-3D Motion Lifting

2D Human Motion

            Lifted 3D Human Motion

Visualization results of 3D motion lifting using our proposed Tender model in multiple visual scenarios,
with a corresponding text prompt given below.

More Generated 2D Human Motions

Visualization results of he generated 2D whole-body motions using our proposed Tender model in multiple visual scenarios,
with a corresponding text prompt given below.

References

[1] Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. T2m-gpt: Generating human motion from textual descriptions with discrete representations. CVPR, 2023.

[2] Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. Human motion diffusion model. ICLR, 2023.

[3] Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. CVPR, 2023.

BibTeX


@article{wang2024holistic,
  title={Holistic-Motion2D: Scalable Whole-body Human Motion Generation in 2D Space},
  author={Wang, Yuan and Wang, Zhao and Gong, Junhao and Huang, Di and He, Tong and Ouyang, Wanli and Jiao, Jile and Feng, Xuetao and Dou, Qi and Tang, Shixiang and Xu, Dan},
  journal={arXiv preprint arXiv:},
  year={2024}
}