HIERARCHICAL ACTIVITY RECOGNITION AND CAPTIONING FROM LONG-FORM AUDIO

Peng Zhang; Qingyu Luo; Philip J B Jackson; Wenwu Wang

Back

Conference proceeding

HIERARCHICAL ACTIVITY RECOGNITION AND CAPTIONING FROM LONG-FORM AUDIO

Peng Zhang, Qingyu Luo, Philip J B Jackson and Wenwu Wang

2026 IEEE International Conference on Acoustics, Speech, and Signal Processing

2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (Barcelona, Spain, 04/05/2026–08/05/2026)

17/01/2026

Abstract

Index Terms— Long-form audio

hierarchical activity recognition

audio captioning

sound events

MultiAct

Complex activities in real-world audio unfold over extended durations and exhibit hierarchical structure, yet most prior work focuses on short clips and isolated events. To bridge this gap, we introduce MultiAct, a new dataset and benchmark for multi-level structured understanding of human activities from long-form audio. MultiAct comprises long-duration kitchen recordings annotated at three semantic levels (activities, sub-activities and events) and paired with fine-grained captions and high-level summaries. We further propose a unified hierarchical model that jointly performs classification, detection, sequence prediction and multi-resolution captioning. Experiments on MultiAct establish strong baselines and reveal key challenges in modelling hierarchical and compositional structure of long-form audio. A promising direction for future work is the exploration of methods better suited to capturing the complex, long-range relationships in long-form audio.

Files and links (1)

pdf

Hierarchical Activity Recognition and Captioning from Long-form Audio402.77 kB

Author's Accepted Manuscript CC BY V4.0, Embargoed Access, Embargo ends: 04/05/2026

Metrics

1 Record Views

Details

Title: HIERARCHICAL ACTIVITY RECOGNITION AND CAPTIONING FROM LONG-FORM AUDIO
Creators: Peng Zhang - University of Surrey, School of Computer Science & Electronic Engineering
Qingyu Luo - University of Surrey, School of Computer Science & Electronic Engineering
Philip J B Jackson - University of Surrey, School of Computer Science & Electronic Engineering
Wenwu Wang - University of Surrey, School of Computer Science & Electronic Engineering
Publication Details: 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing
Conference: 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (Barcelona, Spain, 04/05/2026–08/05/2026)
Publisher: IEEE
Date accepted for publication: 17/01/2026
Grant note: This research was supported by Bang & Olufsen A/S.
Identifiers: 991104995102346
Academic Unit: School of Computer Science & Electronic Engineering
Language: English
Resource Type: Conference proceeding

HIERARCHICAL ACTIVITY RECOGNITION AND CAPTIONING FROM LONG-FORM AUDIO

Abstract

Files and links (1)

Metrics

Details

Usage Policy