Speaker Independent Lip Reading

The goal of this research is to design and develop a speaker independent automated visual speech recognition system. An important part for designing such a system is to derive a set of visual features that are independent of speaker’s identity.

Speaker variability

In audio speech recognition, the problem of speaker variability has been well studied. Common approaches to dealing with it include normalising for a speaker's vocal tract length and learning a linear transform that moves the speaker-independent models closer to to a new speaker. In pure lip-reading (no audio) the problem has been less well studied. Results are often presented that are based on speaker-dependent (single speaker) or multi-speaker (speakers in the test-set are also in the training-set) data, situations that are of limited use in real applications.

Most commonly applied visual features, such as AAM, 1D sieve, 2D DCT, are highly speaker correlated. To demonstrate this problem, we design a set of experiments with three different configurations: single speaker, multi-speaker, and speaker independent (speakers in the test-set are not in the training set). By careful choice of features, we show that it is possible for the performance of visual-only lip-reading to be very close to that of audio-only recognition for the single speaker and multi-speaker configurations. However, in the speaker independent configuration, the performance of the visual-only channel degrades dramatically.

Top graph: HMM classification accuracy on multi-speaker configuration. Bottom graph: HMM classification accuracy on speaker independent configuration.

By applying multidimensional scaling (MDS) to both the audio features and visual features, we demonstrate that lip-reading visual features, when compared with the MFCCs commonly used for audio speech recognition, have inherently small variation within a single speaker across all classes spoken. However, visual features are highly sensitive to the identity of the speaker, whereas audio features are relatively invariant.

Below are sammon maps on audio features (top graph), and visual features (bottom graph) of five speakers, A to E, speaking a same utterance.

Next: Comparing visual features for lipreading