DSAG: A Scalable Deep Framework for Action-Conditioned Multi-Actor Full Body Motion Synthesis, Winter Conference on Applications of Computer Vision (WACV), 2023
Authors: Debtanu Gupta, Shubh Maheshwari, Sai Shashank Kalakonda, Manasvi, Ravi Kiran Sarvadevabhatla

Abstract: We introduce DSAG, a controllable deep neural framework for action-conditioned generation of full body multi-actor variable duration actions. To compensate for incompletely detailed finger joints in existing large-scale datasets, we introduce full body dataset variants with detailed finger joints. To overcome shortcomings in existing generative approaches, we introduce dedicated representations for encoding finger joints. We also introduce novel spatiotemporal transformation blocks with multi-head self attention and specialized temporal processing. The design choices enable generations for a large range in body joint counts (24 - 52), frame rates (13 - 50), global body movement (inplace, locomotion) and action categories (12 - 120), across multiple datasets (NTU-120, HumanAct12, UESTC, Human3.6M). Our experimental results demonstrate DSAG’s significant improvements over state-of-the-art, its suitability for action-conditioned generation at scale.
 author = {Gupta, Debtanu and Maheshwari, Shubh and Kalakonda, Sai Shashank and Manasvi and Sarvadevabhatla, Ravi Kiran}, 
 title = {DSAG: A Scalable Deep Framework for Action-Conditioned Multi-Actor Full Body Motion Synthesis}, 
 booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
 month = {January}, 
 year = {2023}}

PSUMNet: Unified Modality Part Streams are All You Need for Efficient Pose-based Action Recognition, , 2022
Authors: Neel Trivedi, Ravi Kiran Sarvadevabhatla

Abstract: Pose-based action recognition is predominantly tackled by approaches which treat the input skeleton in a monolithic fashion, i.e. joints in the pose tree are processed as a whole. However, such approaches ignore the fact that action categories are often characterized by localized action dynamics involving only small subsets of part joint groups involving hands (e.g. Thumbs up) or legs (e.g. Kicking). Although part-grouping based approaches exist, each part group is not considered within the global pose frame, causing such methods to fall short. Further, conventional approaches employ independent modality streams (e.g. joint, bone, joint velocity, bone velocity) and train their network multiple times on these streams, which massively increases the number of training parameters. To address these issues, we introduce PSUMNet, a novel approach for scalable and efficient pose-based action recognition. At the representation level, we propose a global frame based part stream approach as opposed to conventional modality based streams. Within each part stream, the associated data from multiple modalities is unified and consumed by the processing pipeline. Experimentally, PSUMNet achieves state of the art performance on the widely used NTURGB+D 60/120 dataset and dense joint skeleton dataset NTU 60-X/120-X. PSUMNet is highly efficient and outperforms competing methods which use 100%-400% more parameters. PSUMNet also generalizes to the SHREC hand gesture dataset with competitive performance. Overall, PSUMNet’s scalability, performance and efficiency makes it an attractive choice for action recognition and for deployment on computerestricted embedded and edge devices.

MUGL: Large Scale Multi Person Conditional Action Generation with Locomotion, Winter Conference on Applications of Computer Vision (WACV), 2022
Authors: Debtanu Gupta*, Shubh Maheshwari*, Ravi Kiran Sarvadevabhatla

Abstract: We introduce MUGL, a novel deep neural model for large-scale, diverse generation of single and multi-person pose-based action sequences with locomotion. Our controllable approach enables variable-length generations customizable by action category, across more than 100 categories. To enable intra/inter-category diversity, we model the latent generative space as a Conditional Gaussian Mixture Variational Autoencoder. To enable realistic generation of actions involving locomotion, we decouple local pose and global trajectory components of the action sequence. We incorporate duration-aware feature representations to enable variable-length sequence generation. We use a hybrid pose sequence representation with 3D pose sequences sourced from videos and 3D Kinect-based sequences of NTU-RGBD-120. To enable principled comparison of generation quality, we employ suitably modified strong baselines during evaluation. Although smaller and simpler compared to baselines, MUGL outperforms the baselines across multiple generative model metrics.
 author    = {Maheshwari, Shubh and Gupta, Debtanu and Sarvadevabhatla, Ravi Kiran}, 
 title     = {MUGL: Large Scale Multi Person Conditional Action Generation With Locomotion}, 
 booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, 
 month     = {January}, 
 year      = {2022}, 
 pages     = {257-265} 

Syntactically Guided Generative Embeddings for Zero-Shot Skeleton Action Recognition, International Conference on Image Processing (ICIP), 2021
Authors: Pranay Gupta, Divyanshu Sharma, Ravi Kiran Sarvadevabhatla

Abstract: We introduce SynSE, a novel syntactically guided generative approach for Zero-Shot Learning (ZSL). Our end-to-end approach learns progressively refined generative embedding spaces constrained within and across the involved modalities (visual, language). The inter-modal constraints are defined between action sequence embedding and embeddings of Parts of Speech (PoS) tagged words in the corresponding action description. We deploy SynSE for the task of skeleton-based action sequence recognition. Our design choices enable SynSE to generalize compositionally, i.e., recognize sequences whose action descriptions contain words not encountered during training. We also extend our approach to the more challenging Generalized Zero-Shot Learning (GZSL) problem via a confidence-based gating mechanism. We are the first to present zero-shot skeleton action recognition results on the large-scale NTU-60 and NTU-120 skeleton action datasets with multiple splits. Our results demonstrate SynSE's state of the art performance in both ZSL and GZSL settings compared to strong baselines on the NTU-60 and NTU-120 datasets.
 title=Syntactically Guided Generative Embeddings for Zero-Shot Skeleton Action Recognition}, 
 author={Pranay Gupta and Divyanshu Sharma and Ravi Kiran Sarvadevabhatla}, 

NTU-X: An Enhanced Large-scale Dataset for Improving Pose-based Recognition of Subtle Human Actions, ICVGIP, 2021
Authors: Anirudh Thatipelli, Neel Trivedi, Ravi Kiran Sarvadevabhatla

Abstract: The lack of fine-grained joints (facial joints, hand fingers) is a fundamental performance bottleneck for state of the art skeleton action recognition models. Despite this bottleneck, community's efforts seem to be invested only in coming up with novel architectures. To specifically address this bottleneck, we introduce two new pose based human action datasets - NTU60-X and NTU120-X. Our datasets extend the largest existing action recognition dataset, NTU-RGBD. In addition to the 25 body joints for each skeleton as in NTU-RGBD, NTU60-X and NTU120-X dataset includes finger and facial joints, enabling a richer skeleton representation. We appropriately modify the state of the art approaches to enable training using the introduced datasets. Our results demonstrate the effectiveness of these NTU-X datasets in overcoming the aforementioned bottleneck and improve state of the art performance, overall and on previously worst performing action categories.
 title={NTU-X: An enhanced large-scale dataset for improving pose-based recognition of subtle human actions}, 
 author={Trivedi, Neel and Thatipelli, Anirudh and Sarvadevabhatla, Ravi Kiran}, 
 booktitle={Proceedings of the Twelfth Indian Conference on Computer Vision, Graphics and Image Processing}, 

Quo Vadis, Skeleton Action Recognition ?, International Journal of Computer Vision (IJCV), 2021
Authors: Pranay Gupta, Anirudh Thatipelli, Aditya Aggarwal, Shubh Maheshwari, Neel Trivedi, Sourav Das, Ravi Kiran Sarvadevabhatla

Abstract: In this paper, we study current and upcoming frontiers across the landscape of skeleton-based human action recognition. To begin with, we benchmark state-of-the-art models on the NTU-120 dataset and provide multi-layered assessment of the results. To examine skeleton action recognition 'in the wild', we introduce Skeletics-152, a curated and 3-D pose-annotated subset of RGB videos sourced from Kinetics-700, a large-scale action dataset. The results from benchmarking the top performers of NTU-120 on Skeletics-152 reveal the challenges and domain gap induced by actions 'in the wild'. We extend our study to include out-of-context actions by introducing Skeleton-Mimetics, a dataset derived from the recently introduced Mimetics dataset. Finally, as a new frontier for action recognition, we introduce Metaphorics, a dataset with caption-style annotated YouTube videos of the popular social game Dumb Charades and interpretative dance performances. Overall, our work characterizes the strengths and limitations of existing approaches and datasets. It also provides an assessment of top-performing approaches across a spectrum of activity settings and via the introduced datasets, proposes new frontiers for human action recognition.
 author={Gupta, Pranay and Thatipelli, Anirudh and Aggarwal, Aditya and Maheshwari, Shubh and Trivedi, Neel and Das, Sourav and Sarvadevabhatla, Ravi Kiran}, 
 title={Quo Vadis, Skeleton Action Recognition?}, 
 journal={International Journal of Computer Vision},