DSAG: A Scalable Deep Framework for Action-Conditioned Multi-Actor Full Body Motion Synthesis



Walking towards
Thumb up
Drinking water
Jumping jack

Renderings of action sequences generated by DSAG. Our model generates in-place and locomotory single/multi-actor variable duration full body sequences. It also scales across datasets containing a large range in frame-rate, joints and actions. The dotted square shows magnified detail of fingers. Additional results below and in supplementary.




Abstract

  • We introduce DSAG, a controllable deep neural frame-work for action-conditioned generation of full body multi-actor variable duration actions.
  • To compensate for incompletely detailed finger joints in existing large-scale datasets, we introduce full body dataset variants with detailed finger joints.
  • To overcome shortcomings in existing generative approaches, we introduce dedicated representations for encoding finger joints.
  • We also introduce novel spatiotemporal transformation blocks with multi-head self attention and specialized temporal processing.
  • The design choices enable generations for a large range in body joint counts (24 - 52), frame rates (13 - 50), global body movement (inplace, locomotion) and action categories (12 - 120), across multiple datasets (NTU-120, HumanAct12, UESTC, Human3.6M).
  • Our experimental results demonstrate DSAG’s significant improvements over state-of-the-art, its suitability for action-conditioned generation at scale.



Architecture

Architecture
(a) Architecture of DSAG showing dedicated action sequence components at local level (hand:Xh and body:Xl), global level (hand:Xg and body:Xw ) and action duration t[n]. The local components are encoded using a series of novel ST-blocks. A series of 1D convolutions with swish activation is used for encoding other components. The decoder components map the latent representation to the generated class-conditioned action sequence. X1 and X2 represent the actors. The blue and green dots at the torso of each actor indicate the shared origin of the action sequence’s local component. The red and purple squares represent wrist joints’ 3D coordinate global trajectories. (b) ST encoding block - Residual convolution (top left, shaded green) is applied to process the spatiotemporal information. Multi-head self-attention (bottom left, orange) is used to incorporate the global temporal dependency.


Quantitative Comparision

DSAG Quantitative Comparision
Model comparison in terms of generative quality scores against five representative baselines – MUGL, ACTOR, SA-GCN, action2motion and VAE-LSTM.


Qualitative Comparision

DSAG Qualitative Comparision
Visual comparison of generated single-person action sequence snapshot renderings across models trained on NTU-Xpose-Single-Person dataset. Note the varying duration of DSAG sequences. Also note that the examples for ACTOR and MUGL exhibit less finger and body movement compared to sequences from DSAG.


Citation

Please cite our paper if you end up using it for your own research.

            @InProceedings{DSAG,
                author    = {Gupta, Debtanu and Maheshwari, Shubh and Kalakonda, Sai Shashank and Manasvi and Sarvadevabhatla, Ravi Kiran},
                title     = {DSAG: A Scalable Deep Framework for Action-Conditioned Multi-Actor Full Body Motion Synthesis},
                booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
                month     = {January},
                year      = {2023}
            }