MUGL Main Image
Examples of single and multi-person human pose action sequences generated by our model, MUGL. The user can select from among a large number of action classes. The changing length of dotted line connecting kinematic tree root joints of individuals in two-person actions indicates change in their relative global positions. Note the distinction from single person actions where individual frames correspond to timesteps, i.e. global displacements are absent. Also note the variable length of sequences generated. Additional results below and in supplementary.


  • We introduce MUGL, a novel deep neural model for large-scale, diverse generation of single and multi-person pose-based action sequences with locomotion.
  • Our controllable approach enables variable-length generations customizable by action category, across more than 100 categories.
  • To enable intra/inter-category diversity, we model the latent generative space as a Conditional Gaussian Mixture Variational Autoencoder.
  • To enable realistic generation of actions involving locomotion, we decouple local pose and global trajectory components of the action sequence. We incorporate duration-aware feature representations to enable variable-length sequence generation.
  • We use a hybrid pose sequence representation with 3D pose sequences sourced from videos and 3D Kinect-based sequences of NTU-RGBD-120.
  • To enable principled comparison of generation quality, we employ suitably modified strong baselines during evaluation. Although smaller and simpler compared to baselines, MUGL outperforms the baselines across multiple generative model metrics.
  • Code will be made available.


MUGL is a novel deep neural model for variable-length pose-based action generation with locomotion. Instead of using 3-D joint to represent skeleton pose, we decouple local pose rotations and global trajectory to better represent locomotion and multi-person interaction.

Generated Samples

MUGL Quantitative Comparision

Quantitative Comparision

MUGL Quantitative Comparision
Existing approaches have not shown results beyond a small number of categories and single-person actions. Our controllable method adopts an extension of VAE known as Gaussian Mixture VAE which enables diverse generation of single and multi-person actions at scale.

Qualitative Comparision

Comparing conditionally generated single-person action sequences across models. Also, note the variable sequence length of MUGL’s examples

Issues with feature based generative quality measures

Issues with feature based generative quality measures
The above figure shows the typical pipeline for computing feature representation based generative quality scores. The sequence at bottom right shows the effect of preprocessing. For e.g., even though rotation (about vertical axis) might be a signature of original action, we see that preprocessing can distort and eliminate such signature components. Quality score (e.g. FID) based on feature representations of such sequences fail to capture the key action dynamics. We empirically observed that these scores correlate poorly with visual quality of category-conditioned action generations.


Please cite our paper if you end up using it for your own research.

                author    = {Maheshwari, Shubh and Gupta, Debtanu and Sarvadevabhatla, Ravi Kiran},
                title     = {MUGL: Large Scale Multi Person Conditional Action Generation With Locomotion},
                booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
                month     = {January},
                year      = {2022},
                pages     = {257-265}