Abstract
This chapter gives an overview of multimodal information fusion from the machine-learning perspective. Humans interact with each other using different modalities of communication. These include speech, gestures, documents, etc. It is therefore natural that human-computer interaction (HCI) should facilitate the same multimodal form of communication. To capture this information, one uses different types of sensors, i.e., microphones to capture the audio signal, cameras to capture life video images, 3D sensors to directly capture the surface information in real time. In each of these cases, commercial off-the-shelf (COTS) devices are already available and can be readily deployed for HCI applications. Examples of HCI applications include audio-visual speech recognition, gesture recognition, emotional recognition, and person recognition using biometrics. © 2010 Elsevier Ltd All rights reserved.