Comparing Visual Features for Lipreading

For automatic lipreading, there are many competing methods for feature extraction. Often, because of the complexity of the task these methods are tested on only quite restricted datasets, such as the letters of the alphabet or digits, and from only a few speakers. Here, we compare some of the leading methods for lip feature extraction and compare them on the GRID dataset. which uses a constrained vocabulary over, in this case, 15 speakers. Previously the GRID dataset has had restricted attention because of the requirements to track the face and lips accurately. We overcome this via the use of a novel linear predictor (LP) tracker which we use to control an Active Appearance Model (AAM).

Below are two examples of LP tracked landmarks on the GRID dataset.

LP tracked landmarks on the GRID datasetLP tracked landmarks on the GRID dataset

Visual features, such as sieve1d, eigen_lip, and 2DDCT are extracted from lip sub-images. Moreover, using the landmarks on the lips, we can remove affine variations by warping lip region to a reference shape.

Lip sub-image with super imposed landmarksShape normalised image after warping lip sub-image to a reference shape
Lip sub-image with super imposed landmarksShape normalised image after warping lip sub-image to a reference shape

A total of eight different visual features are tested, four of which are AAM-derived features, which including, app and shape, the appearance parameters and the shape parameters of an AAM, and aam_pca and aam_cat, denoting combining shape and appearance parameters by a PCA and by concatenation respectively. We refer to sieve features computed on lip sub-images as sieve1d, and those computed on shape normalised appearance images as app_sieve. Lip sub-images are also used to compute eigen_lip features and 2DDCT features, the latter of which is actually supplied with the GRID dataset. Eigenlips are computed via a PCA of the intensities in the lip-subimage and retaining the eigenvalues that account for 95% of the variation. In all cases the features are augmented with Delta and Delta Delta coefficients (velocity and acceleration)

To test the robustness of the features across speakers, we designed a set of speaker independent experiments using a strategy of 15-fold cross-validation: for each fold, a different speaker is held-out for testing and the classifier is trained on the data of the remaining speakers. Subsequently, in each iteration, any features extracted with application of a PCA need to be recomputed. In the example of AAM, a new AAM model is trained on only the the training speakers during each iteration.

Results from the HMM classifier using the eight different visual features are plotted in the following figure.

Word accuracy rate, excluding silence
Word accuracy rate, excluding silence

References

Yuxuan Lan, Richard Harvey, Barry-John Theobald, Eng-Jon Ong and Richard Bowden. Comparing Visual Features for Lipreading. In Proc. of International Conference on Auditory-visual Speech Processing, Norwich, UK, Sept 2009 Event

See also: Previous work on speaker independence