Lip Tracking on the XM2VTS database

This page describes results obtained by tracking of the outer lip contour in image sequences from the The Extended M2VTS database (XM2VTS). The approach is described in the PhD thesis of Ulises Ramos [Postscript] (approximate size: 11Mb).

In the speech shot of the XM2VTS database, every subject was asked to read three sentences at normal pace, to pause briefly at the end of each sentence and to read through the three sentences twice. The first recording in each session is referred to as shot 1 whilst the second is referred to as shot 2. The sessions took place in approximately monthly intervals. The three sentences that remained the same in all four sessions were

  1. "zero one two three four five six seven eight nine"
  2. "five zero six nine two eight one three seven four"
  3. "Joe took fathers green shoe bench out"
Lip tracking experiments were carried out on the first two sentences of each shot for all persons in every session. For some subjects, those chosen to be the clients under the Lausanne protocol, lip tracking in the first two sessions was performed also on sentence 3. (shot 1 only).

Below are some lip tracking results obtained on the XM2VTS database. Click on the sequence ID to view a full tracked sequence in MPEG format [approximately 1~2MB]. Click on the image to see the initial position of contour control points obtained using colour analysis.

030[mpeg] 047[mpeg] 075[mpeg] 140[mpeg] 249[mpeg] 330[mpeg]

Tracking results

The results are very large since there are 2760 (295 x 4 x 2 + 200 x 2) sequences in total and the sequences are long (average 310 frames). Satisfactory tracking results were obtained for more than 98.5% of these sequences. All the tracking results and the accompanying examples, tools and documentation are available as a single (very large! 84Mb) tar file.
