A new speech-recognition tool could help many who had difficulty speaking communicate. (Pexels/cottonbro)
A new approach for automatically recognizing speech based on the waveforms of spoken words could eventually become a game changer for people with severe dysarthria, a disorder frequently occurring after strokes that results in slurred and difficult-to-understand speech.
A study published April 30 in IEEE Transactions on Neural Systems and Rehabilitation Engineering details the deep learning system, Speech Vision, which could someday help people with dysarthria more easily communicate with other people and with computers. But further development is required to make the technology practical, according to the study's author, Seyed Reza Shahamiri, a software engineer and researcher at the University of Auckland in New Zealand. "We need to think outside of the box and come up with new and innovative ways to recognize dysarthric speech, because normal speech recognition technology is not necessarily suitable for [people with dysarthria]," he told The Academic Times.
Dysarthria is a motor speech disorder that prevents clear articulation. People living with the condition are unable to control their tongue or larynx and often have difficulty being understood. The disorder is common among stroke survivors and others living with some degree of paralysis, for whom it represents a major impediment to communication. "In a typical dysarthric scenario, they know what they want to say," Shahamiri said. "Energy production is also not a problem, but because of muscle paralysis, the articulators aren't capable of shaping air energy into [speech]."
Scientists are increasingly applying machine-learning techniques to help people with speech disorders, with U.S. researchers recently inventing a more efficient method of diagnosing speech disorders through digital platforms based on a new algorithm. And speech recognition systems in general have greatly improved in recent years, Shahamiri said: "Today, we can almost say that speech recognition is a solved problem because we are already getting near-human performance in most real-life set-ups."
Existing technologies don't work as well for dysarthric speakers because they rely on identifying sequences of phonemes — the sounds that distinguish one word from another, and the smallest units of language that still affect meaning — and severe dysarthria makes it difficult to intelligibly utter even the smallest of sounds. In fact, the disorder is characterized by poor articulation of phonemes.
"Although we have a different tone, voice, and pronunciation — all these different speech features — at the end of the day, when we say 'Hello,' we all say the same sequence of phonemes," he said. "The whole concept of speech recognition is to identify these phonemes and send them to an AI model called a decoder that says, 'Hey, the probability of this sequence of phonemes being this word is [X].
"For dysarthric speakers, we're feeding the system incorrect phonemes and expecting it to recognize the sequence," he continued. "That's the reason most speech recognition systems don't work [for dysarthria]."
Another challenge in tailoring a speech recognition system for dysarthric speakers is that deep learning algorithms require a lot of data — in this case, sound samples of people speaking — to recognize variations in accents and pronunciation. Even getting enough data from any one person with dysarthria is difficult because the act of speaking is exhausting.
"When it comes to dysarthric speech, you have significantly higher variations in phonemes; between two individuals, they pronounce the phonemes in a different way," Shahamiri said. "At the same time, we don't even have a fraction of the data required to build a proper deep-learning model to understand these variations in sound."
Speech Vision takes a visual approach, employing a novel acoustic modeling feature that extracts "voicegrams" and uses machine learning to identify the shape of words pronounced by people with dysarthria. "A voicegram is the intensity of frequencies over time, and visualizing the intensity gives you something like a heat map," Shahamiri said. "When you look at the sequence of dysarthric phonemes, the shape of the heat map is pretty similar between speakers. The moment we have that similarity of shapes, we can design a visual algorithm to learn those shapes."
Shahamiri has been working on the project since 2011. It took him three years to build the first version of the system, which delivered state-of-the-art results for a small, 25-word vocabulary covering basic commands such as "left," "right," "up" and "down." Speech Vision recognizes a vocabulary of about 250 words, but to be usable in day-to-day life, the technology must achieve near-human performance by accurately recognizing about 90% of words, Shahamiri said. Currently, the system recognizes dysarthric speech about 60% of the time in research settings — significantly more often than existing speech recognition systems, but still not enough to be incorporated into a mobile or web application.
Hoping to push the technology's efficacy over the edge, Shahamiri has made the source code for Speech Vision freely available for other researchers to download and build upon. He expressed frustration that the research community hasn't placed a higher priority on developing a dysarthria-specific speech-recognition system. "The research in this area is incremental and because it's very difficult, we're not seeing many researchers showing interest in the field," he said.
But he still believes it's possible to improve the lives of people living with dysarthria — and related symptoms of isolation and depression — with technology that learns to see what they're saying. "It could be life-changing for them," he said.
The study, "Speech Vision: An end-to-end deep learning-based dysarthric automatic speech recognition system," published April 30 in IEEE Transactions on Neural Systems and Rehabilitation Engineering, was authored by Seyed Reza Shahamiri, University of Auckland.