Example of tongue model animations of the GIPSA-Lab articulatory talking head from ultrasound images, using the Integrated Cascaded Gaussian Mixture Regression algorithm for [ata] (top) and [uku] (bottom) sequences. (Credit: Thomas Hueber/GIPSA-Lab/CNRS/Université Grenoble Alpes / Grenoble INP)

A team of researchers has developed a system that can display the movements of our own tongues in real time. Captured using an ultrasound probe placed under the jaw, these movements are processed by a machine learning algorithm that controls an “articulatory talking head.” As well as the face and lips, this avatar shows the tongue, palate, and teeth, which are usually hidden inside the vocal tract.

This “visual biofeedback” system, which ought to be easier to understand and therefore should produce better correction of pronunciation, could be used for speech therapy and for learning foreign languages.

In this new work, the researchers propose to improve visual feedback by automatically animating an articulatory talking head in real time from ultrasound images. This virtual clone of a real speaker produces a contextualized — and therefore more natural — visualization of articulatory movements.

The strength of this new system lies in a machine learning algorithm that researchers have been working on for several years. This algorithm can (within limits) process articulatory movements that users cannot achieve when they start to use the system. This property is indispensable for the targeted therapeutic applications.

The algorithm exploits a probabilistic model based on a large articulatory database acquired from an “expert” speaker capable of pronouncing all of the sounds in one or more languages. This model is automatically adapted to the morphology of each new user, over the course of a short system calibration phase, during which the patient must pronounce a few phrases.