Automatic Visual-Only Language Identification

Jacob Newman - University of East Anglia

Introduction

This page outlines a PhD project to develop a system for Automatic Visual-only Language Identification (VLID). The most important question raised by this task is whether visual speech information, such as the movement of the lips, is sufficient for a trained computer system to distinguish the identities of spoken languages. For a detailed literature review and project description see Newman & Cox (2009).

Audio Automatic Language Identification (LID) is a mature technology that can achieve a high identification accuracy from only a few seconds of representative speech. As visual speech processing has developed in the last few years, it is interesting to enquire whether language could be identified purely by visual means. This has practical applications in systems that use either audio-visual speech recognition or pure lip-reading in noisy environments, or in situations where the audio signal is not available.

Various approaches to audio language identification have been considered, and some make use of speech information which is not expressed visually and are therefore unlikely to contribute to the development of a visual only language identification system. Such methods include measuring the spectral similarity between languages, or analysing prosody, which encompasses the stress and pitch of speech. Other audio LID approaches make use of the phonetic and phonotactic characteristics of languages which are proven to be an identifiable discriminatory factor between languages. A viseme is described as the visual appearance of a phoneme, and as such the two are inextricably correlated. LID techniques based on phonemic theory are likely to be applicable to visemic information.

Visual Language Identification Database

Our previous experiments in lip-reading have shown that the features that we extracted for recognition were highly speaker-dependent. Therefore, we decided that until some features that exhibited greater speaker independence had been developed, our task would be to discriminate between two or more languages read by a single speaker. Hence we chose to record an audio-visual database of multilingual speakers. The database recorded contains 21 subjects. These subjects were fluent in at least two different languages, some in three. Typically, these languages consisted of their mother-tongue and a language that they had spoken for several years in an immersive environment.

Single-Speaker VLID

In the system developed, the video data is tracked using an Active Appearance Model (AAM). The vectors this process produces are first clustered using vector quantization (VQ), allowing the training data to be tokenized as VQ symbols and bigram language models to be built from the resulting VQ transcriptions. In testing, the AAM vectors are transcribed in the same way and each language model produces a likelihood which is classified by finding the maximum likelihood. The figures below graphically illustrate these processes.

VQ generation

Bigram generation

Classification

We were able to successfully discriminate between languages spoken by a single speaker. We also found that speaking rate, which is directly related to confidence in a given language, is a sufficient characteristic to allow discrimination between recitals of the same text. Furthermore, we discovered that three recitals of the same text, at the same speed, were correctly classified with above random accuracy, suggesting a non-language effect was biasing the recording sessions. We concluded that subtle recording differences such as speaker pose could be responsible for the bias and that until genuine multi-speaker or speaker independent experiments have been performed we can not average out or normalise for these effects, or be certain that true language identification is being performed. Newman & Cox (2009) describes the results of our experiments in greater detail.

Speaker Independence

A continued area of research in this project is the improvement of the speaker independent performance of our VLID system. This is an especially hard problem because we have shown the features we extract to be highly speaker dependent. Speaker independent testing is where a single speaker's data is held out for testing and the remaining speakers are used for training. The framework developed for these experiments uses 5 bilingual French/English speakers.

Our initial results using AAM features showed random classification accuracy, confirming the speaker dependent nature of the features we are using currently. Further to this proof, we attempted to use higher level features such as the height, width and aspect ratio of the lips. Results from this were more promising, producing statistically significantly above random accuracy, although this was not true for all speakers and the simplicity of the feature, lacking visual appearance such as teeth and tongue, was deemed too simple.

Given that phone recognition generally outperforms sub-phone tokenisation techniques in audio LID, and that preliminary work suggested that using longer temporal information provided greater consistency across speakers, our current work focusses on speaker independent viseme modelling. A disadvantage to this technique is that phonetic transcriptions of the training data are required. The phone modelling technique we use is called Tied State Multiple Mixture Triphones. In this approach, a separate HMM is built for each context for each phone. For example, W-IH+N and B-IH+N are both models for the viseme "IH" in two separate contexts. There would not be enough training data to use every model for each context, so a clustering method is used to tie some of the visually similar HMM states together.

Diagram

To address the speaker independence of our features we implemented two techniques. Firstly, we z-score normalised our AAM features, in which each feature point is recalculated as the number of standard deviations from the mean of that dimension. Secondly, we also weighted each feature dimension by the mutual information. Effectively, this is a measure of dependency between the AAM dimension and the classes (where a viseme is a class). In this way, we hope that we give greater importance to the dimensions which are most speaker independent, and least weighting to those that are least consistent across speakers.

Diagram

The LID system developed is an adaptation of a technique called Parallel Phone recognition followed by Language Modelling. It is ostensibly a number of Single PRLM systems running in parallel with a back-end classifier taking a stacked vector of the likelihoods from those systems giving a decision boundary to classify test data.

Results showed that we could achieve recognition accuracies above 80%, despite very low viseme recognition accuracies of around 30%. We ran a simulation to find out what level of accuracy is required for better language discrimination and found that that viseme recognition in the order of 40% is ideal for this two class case.

Diagram Diagram

New Database

In order to test the findings of our previous work more thoroughly, we decided to record a much larger video dataset. This new dataset comprises 35 speakers in total: 25 English speakers and 10 Arabic speakers. Crucially, all speakers are native to the language of their recital, ensuring that only the genuine cues of language are used to discriminate their speech. Since English and French share similarities in vocabulary and phonetic inventory, we decided to choose two languages which are much further apart linguistically and may be better discriminated. The new video data is recorded at a high frame rate of 60 frames per second and at full HD resolution.

References

Newman, J.L. & Cox, S.J. (2009). Automatic visual-only language identification: A preliminary study. In ICASSP '09: Proceedings of the Acoustics, Speech, and Signal Processing, 2009. on Conference Proceedings., 2009 IEEE International Conference (Paper Accepted), IEEE Computer Society, Washington, DC, USA.