Object Tracking and Asynchrony in Audiovisual Speech Recognition

Mark Hasegawa-Johnson


This talk will describe two sets of experiments in audiovisual speech recognition: first, a set of experiments designed to learn better audio and video feature normalization, and second, a set of experiments designed to better model asynchrony between the lips, tongue, and glottis. The first set of experiments used a database of speech recorded under real-world conditions, in a moving car at 55mph with the windows down. Under these conditions, an explicit head-pose-correction algorithm fails, but a more robust linear regression of lip features vs. head features provides significant word error rate (WER) improvements. The second set of experiments compared different models of the apparent asynchrony between audio and video evidence for any given phoneme. The audio and video evidence for a phoneme are often asynchronous: for example, the lips typically start preparing for the word ``one'' about 50-100ms before voicing starts. The best existing technology for recognizing speech under these conditions is the coupled hidden Markov model (CHMM), a dynamic Bayesian network in which the audio speech production state and the video speech production state may be asynchronous. In a multi-university research workshop at Johns Hopkins this summer, we created an alternative model, in which the asynchrony is between lips and glottis, not between audio and video. Thus, for example, the lips might round for the word ``one" while the glottis remains open for silence; the video observation depends on the lip position, while the audio observation depends on both lips and glottis. The articulatory-feature DBN turns out to have about the same accuracy as the CHMM, but with a different pattern of errors. The lowest word error rate is achieved by allowing the CHMM and the articulatory-feature DBN to vote on the words in the final system output.

Mark Hasegawa-Johnson received his Ph.D. from MIT in 1996; he was a
post-doctoral fellow at UCLA from 1996-1999, and has been on the
faculty at the University of Illinois since 1999.
Dr. Hasegawa-Johnson is author or co-author of 4 patents, 71
peer-reviewed journal and conference papers, and a chapter in the
Wiley Encyclopedia of Telecommunications.  In 2004, he ran a
multi-university research workshop team at Johns Hopkins University,
in which phonological-feature transformations were demonstrated for
the front end of a DBN automatic speech recognizer.  In the 2006
workshop, similar ideas were applied to the task of audiovisual speech
recognition.  Dr. Hasegawa-Johnson's group at the Beckman Institute is
the source of AVICAR, the world's largest (by far) freely available
database of audiovisual speech recorded under real-world noise
conditions.  Dr. Hasegawa-Johnson is a member of the Speech Technical
Committee of the IEEE Signal Processing Society, and a Senior Member
of the IEEE.

Deepak Ramachandran
Last modified: Mon Mar 27 11:38:17 CST 2006