Blog

 
January 16, 2018
How well does speech recognition cope with noise?
Speech is rapidly becoming a primary interface for the connected world. Apple’s Siri, Amazon’s Alexa services running on Echo devices, and a wealth of cloud-based speech recognition APIs, are all bring speech-based user interfaces into the mainstream. Moreover, the spectrum of information services is likely to expand rapidly over the next couple of years, bringing speech up to the level of mobile touch screen web browsing and phone apps. 

The speech interface revolution is starting for good reason – typing on keyboards, clicking mice and touching glass are neither natural nor efficient for humans. Most importantly, they usually need our eyes engaged too, not just to get the responses, but even to make accurate inputs. Speaking is convenient and quite high bandwidth. Humans can routinely talk at 140-160 words per minute, but few of us can type that fast, especially while driving (though we seem to keep trying). With the advent of good automatic speech recognition (ASR) based on deep neural networks we can start to exploit speech interfaces in all the situations where eyes and hands aren’t available, and where natural language can plausible express the questions, commands, responses and discussion at the heart of key interactions. Many of the worst impediments to using speech widely come from noise problems. Noise impacts the usability of all interfaces and all likely speech application types. Siri and Alexa-based services work remarkably well in a quiet room, but performance degrades significantly as noise goes up. 

My team at Babblabs has just completed a basic assessment of noise tolerance of three popular speech recognition cloud APIs. We are not using the data from this experiment to test, develop, train or distribute any product. We present the data simply as a way to highlight some of the interesting issues. We are not proposing it as a way to choose the “best” speech recognizer. In fact all three recognizers are remarkably powerful tools for a huge range of language-centric applications. Nevertheless, teasing out small difference can trigger some useful discussion about the nuances in behavior and use. As a further caveat, we do not know for certain if we have used each API with optimal settings, so the performance of the recognizers with other setting, with other speech data or noise distributions, might vary from these results. Finally, we do not know if any of these systems used the UW PNNC corpus in their training – such a system might look a little better as a result. 

Our methodology was to take short snippets of speech from the University of Washington’s PNNC spoken English corpus (https://depts.washington.edu/phonlab/resources/pnnc/pnnc2/), add non-stationary noise samples of many types, and scaled to various levels. The noise is taken from MUSAN dataset (https://arxiv.org/pdf/1510.08484.pdf). That corpus contains 929 files of assorted noises, ranging from technical noises, such as DTMF tones, dialtones, fax machine noises, to ambient sounds, such as car idling, thunder, wind, footsteps, paper rustling, rain, animal noises, crowd noises. 

The generated noisy versions ranged in signal to noise ratio(SNR) from about 30dB (barely audible noise) up to almost -5dB (noise overwhelms the original clean signal). We called the APIs with each sound snippet for both the original, clean version and the additive noise case, and compared the returned text transcript against the LibriSpeech reference text. We used the scdiff utility from the NIST SCTK Scoring Tookit to generate a word error rate (WER) for that clean or noisy snippet version. This is not necessarily the same error rate metric used in other publications, so we must be careful in comparing these numbers to other WER rate. We ran the same 1000 snippets on each of three ASR APIs, labeled here as Recognizer A, Recognizer B and Recognizer C.

Here are the results for all 1000 noisy samples, showing our calculation of word error rate for each sample plotted against the injected noise level for that snippet. The graph also shows the average word error rate on the clean versions for that API. I have also included a linear curve fit, with the R2 value of the fit.




This data permits some observations: 

1. All of the speech recognizers do quite well at low noise levels (high SNR). These word error rates may not match the figures cited by these APIs’ publishers, but those small differences are easily explained by differences in the test cases. 

2. Error rates rise significantly as noise increases. In many cases the recognizers are quite confused by the noisy speech and produce nothing or completely garbled versions. 

3. The performance at high noise levels has very high variance. For some samples, the transcripts remain quite accurate. In others the ASR gets quite lost. 

4. It looks like Recognizer A’s system is currently slightly more noise resilient than Recognizer C, which in turn, scores somewhat better than Recognizer B at high noise levels (SNR near 0dB). This suggests that Recognizer A may have used somewhat higher levels of noise and/or more variety of noise during their ASR speech training process. Nevertheless, none of the systems is spectacular when the noise level starts to rival or exceed the level of the clean voice signal. 

We get a little more insight if we look at how each of the systems behaves in each range of noise. We can split the input SNR range of 20dB to -5 dB into six equal bins and look at the noisy ASR in each bucket. 

This shows that Recognizer A does better across most of the range, except at the lower noise levels. At the same time, it suggests that Recognizer A also has the highest variance in results relative to accuracy. This implies that Recognizer A achieves significantly lower error rates on some noisy speech snippets, but on others is no better than, and perhaps worse than the others. Interestingly, Recognizer A’s variance is also noticeably high at low noise levels, suggesting that their performance may be a bit bimodal at low noise – either near-perfect or only so-so while the others fall more consistently in between on error rates. Low error rates are valuable, but low variance is probably quite useful too, as it helps users have more stable expectations from their voice-based systems. 

Speech recognition is making steady progress, but it is far from ideal. Further progress on accuracy and consistency, especially under difficult noise conditions are critically important. But the current enthusiasm over speech recognition may gloss over other big opportunities and important problems in front of us. The opportunities stem from the fact that speech recognition is just the tip of the iceberg of powerful new capabilities and applications for deep speech processing. I see a complex constellation of possibilities in speech analysis for language, accent and emotion, speech enhancement for noise reduction and reconstruction, speech generation, speaker identification and authentication, speech thread separation and speech search. Some of these are built on mapping back and forth between speech audio and speech text, but speech is important even outside of ASR. After all, we still use our phones for live calls – or try to in the face of often miserable speech communications environments. We produce video and audio recordings for playback, and leave voice mail; we want to know who is speaking; we even want know how they are feeling. There is a rich world above and below the water line!

 November 29, 2017 Building a speech start-up

A startup is a source of excitement at any time, but we don't live at just "any time".  Right now, we are experiencing an era of particularly rapid evolution of computing technology, and a period of particularly dramatic evolution of new business models and real-world applications.   The blanket term, "AI", has captured the world's imagination, and it is tempting to dismiss much of the breathless enthusiasm - and doomsaying - as just so much hype.  While there is a large dose of hype circulating, we must not overlook the very real and very potent emergence of deep learning or neural network methods as a substantially fresh approach to computing.  The pace of improvement of algorithms, applications and computing platforms over just the past five years shows that this approach - more statistical, more parallel, and more suitable for complex, essentially ambiguous problems - really is a big deal.


Not surprisingly, a great deal of today's deep learning work is being directed at problems in computer vision - locating, classifying, segmenting, tagging and captioning images and videos.  Roughly half of all deep learning startup companies are focused on one sort of vision problem or another.  Deep learning is a great fit for these problems.  Other developers and researchers have fanned out across a broad range of other complex, data-intensive tasks in modeling financial markets, network security, recruiting, drug discovery, and transportation logistics.  One domain is showing particular promise, speech processing.  Speech has all the key characteristics of big data and deep complexity that suggest potential for neural networks.  And researchers and industry have made tremendous progress on some major tasks like automatic speech recognition and test-to-speech synthesis.  We might even think that the interesting speech problems are all getting solved.


In fact, we have just scratched the surface on speech, especially on the exploitation of the combination of audio signal processing and neural network methods to extract more information, remove ambiguity and improve quality of speech-based systems.  Deep networks can follow not just the sounds, but the essential semantics of the audio trace, providing powerful means to overcome conflicting voices, audio impairments and confusion of meaning.  The applications of improved speech understanding extend well beyond the applications we know today - smart speakers, cloud-based conversation bots and limited vocabulary device command systems.  We expect to see semantic speech processing permeate into cars, industrial control systems, smart home appliances, new kinds of telephony and a vast range of new personal gadgets.  We fully expect to see lots of new cloud APIs for automated speech processing services, new packaged software and new speech-centric devices.  Speech is so natural and so accessible for humans, that I can predict it will be a preferred interface mechanism for hundreds of billions of electronic systems over time.


Ironically, the entrepreneurial world is not pursuing speech opportunities in the way it has been chasing vision applications.  In Cognite Ventures list of the top 300 deep learning startups, for example, about 160 are focused on vision, but only 16 on speech (see www.cogniteventures.com).  While vision IS a fertile ground, speech is equally fertile.  Moreover, most of the innovations in new deep learning network structures, reinforcement learning methods, generative adversarial networks, accelerated neural network training and inference chips, model distillation and new intuitive training frameworks all apply to speech as completely and easily as to vision tasks.  
So this is the reason I am starting a deep learning applications company, focused on advanced speech processing.  The core technology concepts are here.  The neural network and signal processing expertise is here.  The computing power is here.  The market opportunities  are here.  Babblabs is here!