Automatic Wearable Speech Supplement in Face-to-Face and Classroom Situations

Funding:
CITRIS Seed Funded Project

The need for language aids is pervasive in today's world. Millions of individuals with language and speech challenges require additional support for language understanding and learning. Currently, however, these needs are not being met because there are not enough skilled teachers, interpreters, and professionals to give them the one on one attention that they need. Lipreading (speechreading because it involves more than just the lips) allows deaf and hard of hearing individuals to perceive and understand oral language and even to speak. Speechreading seldom disambiguates all of the spoken input, however, and other techniques have been used to allow a richer input. The proposed activity will develop a real-time system to automatically detect robust characteristics of auditory speech and to transform these continuous acoustic features into continuous supplementary visible features. This information combined with watching the speaker's face provides enough information for a person with limited hearing to perceive and understand what is being said. This new technology will allow the design of a wearable computing device that would transform these continuous acoustic features into continuous supplementary visible features and display them on a pair of eyeglasses. This system does not require any learning on the part of the talker and is perceptually and linguistically motivated because it is directly based on acoustic and phonetic properties of speech and gives continuous rather than only categorical information.

The technology we are proposing would be ideally designed for wearable computing so a user could have a face to face conversation while carrying a microphone and wearing a pair of simple glasses, which could also be fitted with the person's normal eye prescripton if necessary. The wearable product would process primitive characteristics of the speech signal such as voicing (the presence of energy at the fundamental frequency such as heard in vowel sounds); frication (high frequency noise like energy characteristic of various consonants such as s, z, and sh; and nasality (which is a unique resonance characteristics as in m, n, and ng). These characteristics would be recognized in real time, and the output delivered simultaneously to a pair of eyeglasses to illuminate small colored spots on the sides of the glasses.

 

The proposed research holds much promise because people naturally integrate auditory and visual information. In addition, the proposed system does not replace auditory information with the supplementary cues but rather supplements the auditory speech that is normally available to the listener. This strategy is particularly effective because of the complementarity of auditory and visual speech. The auditory speech that is robust in the signal and fairly easy to automatically recognize is exactly that which is not visible on the face. This serendipitous occurrence makes it more likely to succeed at automatically recognizing the robust acoustic characteristics and simultaneously presenting them visually.

 

The proposed technology qualifies as a transparent information appliance that adds perceptual and cognitive resources to the listener. We have developed a requirements analysis, a conceptual design, and possible physical designs for this appliance. It consists of a very affordable noninvasive device that is seamlessly integrated with normal dress, adding a pair of glasses (which might be used regardless) and a handheld about the size of mobile phone, iPod, or handheld computer. The augmented-reality device is also available for use 24/7, and requires very little maintenance.