Abstract
This paper explores the recognition of expressed emotion from speech and facial gestures for the speaker-dependent case. Experiments were performed on an English audio-visual emotional database consisting of 480 utterances from 4 English male actors in 7 emotions. A total of 106 audio and 240 visual features were extracted and features were selected with Plus l-Take Away r algorithm based on Bhattacharyya distance criterion. Linear transformation methods, principal component analysis (PCA) and linear discriminant analysis (LDA), were applied to the selected features and Gaussian classifiers were used for classification. The performance was higher for LDA features compared to PCA features. The visual features performed better than the audio features and overall performance improved for the audio-visual features. In case of 7 emotion classes, an average recognition rate of 56% was achieved with the audio features, 95% with the visual features and 98% with the audio-visual features selected by Bhattacharyya distance and transformed by LDA. Grouping emotions into 4 classes, an average recognition rate of 69% was achieved with the audio features, 98% with the visual features and 98% with the audio-visual features fused at decision level. The results were comparable to the measured human recognition rate with this multimodal data set.