ws_moma_2026_title

ws_moma_2026_call

 

Call is OPEN

 

Deadline: 1st of May 2026 

Notification of acceptance: 12th of May 2026 

Camera-ready submissions: 22th of May 2026 

 

We invite original submissions presenting innovative ideas, creative approaches, and rigorous methodologies that advance the state of the art in motion analysis, for this upcoming workshop, in the form of: 

  • Extended Abstract Paper in length of 1-2 pages (references do not count toward the page limit) in IEEE Conf. style. Authors may optionally include a supplementary video to accompany their submission.
  • Standalone videos with a max. length of 3 min.

Selected extended abstracts/videos will have the opportunity to be archived on this website and will be promoted to a broader audience through various media channels. 

Authors of accepted workshop papers will be required to hold a spotlight teaser presentation, which is accompanied by a poster presentation during the interactive session at the coffee-break. 

Please submit your paper/video via the following link

ws_moma_2026_outline

Outline and Objectives

The perception, reconstruction, and synthesis of human motion have long been central topics in computer vision. Over the past decade, remarkable progress in vision-based human motion understanding has been enabled by the availability of large-scale datasets and the rise of powerful foundation models trained on them. These developments have substantially advanced our ability to model human pose, dynamics, and interaction from visual input alone.

Yet, human motion is inherently multimodal. It is not only seen but also felt and heard, and can be measured with a variety of devices. Recent research has increasingly explored the integration of diverse sensing modalities, from wearable devices such as IMUs and insoles to non-visual signals like WiFi and sound. This multimodal shift opens new possibilities for building richer, all-round, context-aware representations of human behavior, while also posing open challenges in cross-sensor alignment, temporal reasoning, and data-efficient learning. Moreover, each sensing modality comes with its own limitations. Thus, there is a growing need to connect multimodal sensing and motion understanding within a unified framework.

The Workshop on Multimodal Human Motion Analysis (MOMA) aims to catalyze this integration. Bringing together researchers from robotics, multimodal learning, and perception, it provides a forum to discuss new methodologies, benchmarks, and frameworks for robust, generalizable, and ethically aligned motion understanding. The workshop focuses on two complementary areas: multimodal perception, covering unified representations, temporal reasoning, and data-efficient learning for action analysis, and embodied, human-centered intelligence, addressing foundation models, edge-efficient deployment, and responsible evaluation. Through invited talks and panel discussions, MOMA highlights emerging directions and fosters interdisciplinary dialogue toward real-world, human-centered motion understanding.

Topics of interest

The topics covered in the workshop include, but are not limited to:

  • Multimodal human action and behavior analysis from visual, depth, inertial, and physiological data.
  • Cross-sensor fusion and alignment for motion understanding.
  • Multimodality and Robustness.
  • Temporal reasoning and long-term modeling of human activities and interactions.
  • Advances in human motion representations for multimodal human motion understanding.
  • Human-centric Foundation and generative models (e.g., diffusion, transformers, LLMs).
  • Self, weakly, and unsupervised learning methods for data-efficient and cross-domain generalization.
  • Advances in edge-deployable, and energy-efficient AI models for real-time human sensing.
  • Introducing robustness to occlusions, crowded scenes, and domain or subject variability.
  • Responsible and human-centered evaluation: fairness, bias mitigation, privacy, and transparency.
  • Applications in healthcare, rehabilitation, sports performance, workplace safety, Extended Reality (XR), and robotics.

ws_moma_2026_program

Workshop Schedule

 

Time Description
08.30 – 08.45 Opening remarks

08.45 – 10.15

Session 1: Multimodal Perception and Robust Human Motion Understanding

  • Thomas Ploetz - Sensor-Based Human Activity Recognition as the Basis for Effective Health and Wellbeing Assessments
  • Ronald Poppe - Temporal Coordination in Fine-Grained Analysis of Parent-Child Interactions
  • Suining Henry He - Human-Mobility Interaction: A Multimodal Tale of Micromobility
10.15 – 10.30 Coffee Break
10.30 – 11.30

Session 2: From Perception to Embodied Intelligence

  • ​​​Kristen Grauman - From Novice to Expert: Analyzing Skilled Human Activity in Video
  • ​​​​​​Jianfei Yang - Multimodal Foundation Model for Language-Grounded Human Sensing and Reasoning​​​​​​​
11.30 – 12.30 Panel discussion: Towards Unified and Responsible Human Motion Understanding​​​​​​​
12.30 – 13.00 Poster session & closing remarks

 

ws_moma_2026_invited_speakers

Invited Speakers

Jianfei Yang

Talk Title

Multimodal Foundation Model for Language-Grounded Human Sensing and Reasoning

Bio

Jianfei Yang is an Assistant Professor at Nanyang Technological University (NTU), where he leads the Multimodal AI and Robotic Systems (MARS) Lab. His research focuses on Human-Centric Physical AI and Embodied AI, integrating multimodal sensing, foundation models, and robotics for real-world applications such as human sensing, activity understanding, and intelligent interaction.

Ronald Poppe

Talk Title

Temporal Coordination in Fine-Grained Analysis of Parent-Child Interactions

Bio

Ronald Poppe is an associate professor in the Information and Computing Sciences Department of Utrecht University. His research interests center around the analysis of human (interactive) behavior from videos and other sensors, with applications in media analysis and generation, and in the clinical domain. He received a Ph.D. from the University of Twente, The Netherlands (2009) and was a visiting researcher at the Delft University of Technology, Stanford University, and University of Lancaster. He is a senior member of the IEEE.

Thomas Ploetz

Talk Title

Sensor-Based Human Activity Recognition as the Basis for Effective Health and Wellbeing Assessments

Bio

Thomas Ploetz is a Computer Scientist with expertise and decades of experience in Pattern Recognition and Machine Learning research (PhD from Bielefeld University, Germany). His core research lies in the field of wearable and ubiquitous computing with specific focus on computational behavior analysis that is driven by the automated analysis of what people are doing and how this changes over time — all based on the automated analysis of multimodal time series data that are captured using a range of sensors that are either body worn or integrated into the built environment. He works as a Professor of Computing at the School of Interactive Computing at the Georgia Institute of Technology in Atlanta, USA, where he leads the Computational Behavior Analysis research lab (cba.gatech.edu).

Suining Henry He

Talk Title

Human-Mobility Interaction: A Multimodal Tale of Micromobility

Bio

Suining Henry He is currently working as the Associate Professor (with Tenure) at School of Computing, University of Connecticut (UConn). Before that, Henry was working as a Tenure-Track Assistant Professor at UConn since 09/2019. He leads the UConn's Ubiquitous and Urban Computing Lab. Before joining UConn, he worked as a postdoctoral research fellow at the Real-Time Computing Lab (RTCL), University of Michigan. His research interests include Human-centered AI, GeoAI, and AI of Things.

Kristen Grauman

Talk Title

From Novice to Expert: Analyzing Skilled Human Activity in Video

Bio

Kristen Grauman is a Professor in the Department of Computer Science at the University of Texas at Austin. Her research in computer vision and machine learning focuses on video understanding and embodied perception. Before joining UT-Austin in 2007, she received her Ph.D. at MIT. She is a AAAS Fellow, IEEE Fellow, AAAI Fellow, Sloan Fellow, and recipient of the 2026 Hill Prize in AI, the 2025 Huang Prize, and the Computers and Thought Award. She and her collaborators have been recognized with several Best Paper awards in computer vision, including a 2011 Marr Prize and a 2017 Helmholtz Prize (test of time award). She has served as Associate Editor-in-Chief for PAMI and Program Chair of CVPR 2015, NeurIPS 2018, and ICCV 2023.

ws_moma_2026_organizers

Organizers

Olivia Nocentini

Olivia Nocentini

PostDoctoral Researcher at the Italian Institute of Technology

e-mail: olivia.nocentini@iit.it

Rishabh Dabral

Research Group Leader at the Max Planck Institute for Informatics

e-mail: rdabral@mpi-inf.mpg.de
Niaz Ahmad

Niaz Ahmad

Postdoctoral Research Fellow in the CVIS Lab at Toronto Metropolitan University

e-mail: niazahmad89@torontomu.ca

Marta Lorenzini

Marta Lorenzini

Senior Technician at the Italian Institute of Technology

e-mail: marta.lorenzini@iit.it
Arash Ajoudani

Arash Ajoudani

Director of the Human-Robot Interfaces and Interaction Laboratory at the Italian Institute of Technology

e-mail: arash.ajoudani@iit.it

ws_moma_2026_aknowledgment

Aknowledgement

 

This work was supported by the Italian Workers’ Compensation Authority INAIL within the VIVA project.