Logo image
HIERARCHICAL ACTIVITY RECOGNITION AND CAPTIONING FROM LONG-FORM AUDIO
Conference proceeding

HIERARCHICAL ACTIVITY RECOGNITION AND CAPTIONING FROM LONG-FORM AUDIO

Peng Zhang, Qingyu Luo, Philip J B Jackson and Wenwu Wang
2026 IEEE International Conference on Acoustics, Speech, and Signal Processing
2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (Barcelona, Spain, 04/05/2026–08/05/2026)
17/01/2026

Abstract

Index Terms— Long-form audio hierarchical activity recognition audio captioning sound events MultiAct
Complex activities in real-world audio unfold over extended durations and exhibit hierarchical structure, yet most prior work focuses on short clips and isolated events. To bridge this gap, we introduce MultiAct, a new dataset and benchmark for multi-level structured understanding of human activities from long-form audio. MultiAct comprises long-duration kitchen recordings annotated at three semantic levels (activities, sub-activities and events) and paired with fine-grained captions and high-level summaries. We further propose a unified hierarchical model that jointly performs classification, detection, sequence prediction and multi-resolution captioning. Experiments on MultiAct establish strong baselines and reveal key challenges in modelling hierarchical and compositional structure of long-form audio. A promising direction for future work is the exploration of methods better suited to capturing the complex, long-range relationships in long-form audio.
pdf
Hierarchical Activity Recognition and Captioning from Long-form Audio402.77 kB
Author's Accepted Manuscript CC BY V4.0 Embargoed Access, Embargo ends: 04/05/2026

Metrics

1 Record Views

Details

Logo image

Usage Policy