Abstract
We present an analysis of linear feature extraction techniques to derive a compact and meaningful representation of the articulatory data. We used 14-channel EMA (ElectroMagnetic Articulograph) data from two speakers from the MOCHA database [A.A. Wrench. A new resource for production modelling in speech technology. In Proc. Inst. of Acoust., Stratford-upon-Avon, UK, 2001.]. As representations, we considered the registered articulator fleshpoint coordinates, transformed PCA (Principal Component Analysis) and LDA (Linear Discriminant Analysis) features. Various PCA schemes were considered, grouping coordinates according to correlations amongst the articulators. For each phone, critical dimensions were identified using the algorithm in [Veena D Singampalli and Philip JB Jackson. Statistical identification of critical, dependent and redundant articulators. In Proc. Interspeech, Antwerp, Belgium, pages 70-73, 2007.]: critical articulators with registered coordinates, and critical modes with PCA and LDA. The phone distributions in each representation were modelled as univariate Gaussians and the average number of critical dimensions was controlled using a threshold on the 1-D Kullback Leibler divergence (identification divergence). The 14-D KL divergence (evaluation divergence) was employed to measure goodness of fit of the models to estimated phone distributions. Phone recognition experiments were performed using coordinate, PCA and LDA features, for comparison. We found that, of all representations, the LDA space yielded the best fit between the model and phone pdfs. The full PCA representation (including all articulatory coordinates) gave the next best fit, closely followed by two other PCA representations that allowed for correlations across the tongue. At the threshold where average number of critical dimensions matched those obtained from IPA, the goodness of fit improved by 34% (22%/46% for male/female data) when LDA was used over the best PCA representation, and by 72% (77%/66%) over articulatory coordinates. For PCA and LDA, the compactness of the representation was investigated by discarding the least significant modes. No significant change in the recognition performance was found as the dimensionality was reduced from 14 to 8 (95% confidence t-test), although accuracy deteriorated as further modes were discarded. Evaluation divergence also reflected this pattern. Experiments on LDA features increased recognition accuracy by 2% on average over the best PC representation. An articulatory interpretation of the PCA and LDA modes is discussed. Future work focuses on articulatory trajectory generation in feature spaces guided by the findings of this study.