Abstract
The robustness of audio pattern recognition systems under varying acoustic conditions and
hardware remains a critical challenge for real-world applications. We examine how room acoustics,
microphone characteristics, and overlapping events affect classification performance for domestic
events. We conducted experiments in four rooms at the University of Surrey—with reverberation
times (RT60: 0.27–0.78 s, 50 Hz–10 kHz) and clarity indices (C50: 11.6–18.5 dB; C80: 13.1–25.9
dB, 500 Hz–1 kHz)—using four microphones: USB Condenser, ICS-43432 stereo, AudioMoth, and
Earthworks M23 reference. For two CNN-14 architectures, baseline performance obtained from
the original audio was used for comparison with different microphone/room configurations. Results
expressed as the percentage of audio frames correctly detected versus ground truth show: First,
high RT60 degraded detection of impulsive events (e.g., door knocks) by approximately 50%, while
sustained events (e.g., speech, music) remained above 90%. Second, overlapping events produced
masking effects that reduced performance by about 20%. Third, while microphone differences
affect accuracy, low-cost devices matched reference performance for speech and music classes.
Both CNN-14 architectures exhibited similar degradation patterns across conditions. These results
underscore the need for improved acoustic characterization and hardware-aware processing. We
suggest that future work should integrate adaptive feature extraction and training strategies to
mitigate reverberation and overlap in complex environments.