Blog


March 24, 2018
Living in a speech-triggered world

Like many people in the developed world, I spend every day surrounded by technology  - laptops, phones, car touch-screens, networked gadgets and smart TV.  Of course I want to interact with all these things to get my work done and to enjoy their services and entertainment.  And I have learned to use them pretty well – I have learned to type on keyboards, move mice, finger touchpads, pinch, swipe and peck. The devices themselves have not learned much of anything! 

A wave of change is sweeping towards us, with the potentially to dramatically shift the basic nature of interactions between people and our electronic environment.  Deep learning-based algorithms are transforming many aspects of computation, by allowing good solutions bigger, more complex, more ambiguous problems than before.  The transformations are particularly striking across vision, language analysis, pattern finding in large data sets, and speech.  I have written a lot about vision and the potential for neural network vision algorithms to meet or exceed human capabilities in recognizing and tracking objects, summarizing scenes and sequences, and finding safe navigational channels in two and three dimensions.  These vision systems are becoming very substitutes for human vision in tasks like driving, surveillance and inspection.

Deep learning advances in speech have a completely different character from vision – these advances are rarely about substituting for humans, but in pay more attention to humans.  Speech recognition, speaker identification, and speech enhancement are all functions to enhance the human-machine interaction.   The net effect of better human-machine interaction is less about displacing human-to-human interaction but more about displacing the keyboards, mice, remote controls and touchpads we have learned to use, albeit painfully.   In a very real sense, the old interfaces required us to retrain the neurological networks in our brains – new speech interfaces move that neural network effort over onto the computers!

Speech is coming into use in many forms – speech enhancement, generation, translation, and identification, but various flavors of speech recognition are getting the most attention.  Speech recognition seems to be used in at least three distinct ways. 

  1. Recognition is used for large-scale transcription, often for archival purposes, in medical records and in business and legal operations.   This is typically done in the cloud or on PCs. 
  2. We have seen the rapid rise, with Alexa, Siri, Google Voice and others, of recognition for browser-like information queries.  This is almost always cloud-based, not just because the large vocabulary recognizers require, for now, server-based compute resources, but also because the information being sought naturally lives in the cloud.
  3. Recognition drives local systems controls, where the makers of phones, cars, air conditioners and TVs want the convenience of voice command.  In some cases, these may also need voice responses, but for simple commands proper recognition will have obvious natural responses. 

Voice has some key advantages as a means of user-interface control, due to the richness of information.  Not only do the words themselves carry meaning, but the tone and emotion, the speaker, and the sound environment may bring additional clues to the user’s intent and context.  But leveraging all this latent information is tricky.  Noise is particularly disruptive to understanding, as it masks the words and tone, sometimes irretrievably. We also want to the speech for user interfaces to be concise and unambiguous, yet flexible.  This gives us a sophisticated set of demands placed on even these “simple” voice command UIs.

  • Command sets that comprehensively cover all relevant UI operations, sometimes including obscure ones.
  • Coverage of multiple phrases for each operation, for example: “On”, “Turn on”, “Turn on TV”, “Turn on the TV”, Turn on television”, “Turn on the television” (and maybe even “Turn on the damn TV”).  These lists can get pretty long.
  •  Tolerance of background noise, competing voices and, most, especially, the noise of the device itself, especially for music and video output devices.  In some cases, the device can know what interfering noise it is generating and attempt to subtract it out of the incoming speech, but complex room acoustics make this surprisingly tricky
  • Ability to distinguish commands directed to a particular device from similar commands for other voice-triggered devices in the room.

The dilemma of multiple voice-triggered devices in one room is particularly intriguing.  Devices often use a trigger phrase (“Hey Siri” or “Alexa”) to get the attention of the system, before a full set of commands or language can be recognized.  This may be particular practical for cloud-based systems that don’t want or need to be always listening by streaming audio to the cloud. Listening for a single wake-up phrase can take significantly less compute than always listening for a whole vocabulary of phrases.  However, it does increase the human speech effort to get even the simplest service.  Moreover, most device users don’t really want to endure the essential latencies – 1-2 seconds for the simplest commands – inherent in cloud-based recognition. 

As the number of devices in a space increases, it will be more and more difficult to remember all the trigger phrases and the available set of commands for each device.    In theory, a unified system could be created, in which commands are recognized via a single rational interface, and then routed to the specific gadget to be controlled.  While Amazon or Google might be delighted with this approach, but it seems somewhat unlikely that all potential voice-triggered devices will be built or controlled by a single vendor.  Instead, we are likely to see some slightly-contained chaos, in which each device vendor dispenses with wake words for local devices and attempts to make their commands as disjoint as possible from the commands for other devices.  In the cases of conflicts – e.g. when both the music player and TV support a command “Turn volume down” – the devices will work from context, and may even signal to one another that a relevant command has been captured and acted on.  (And rather than relying on a network handshake for synchronization, the devices may signal audibly to indicate a successful command capture.  Both other devices and the user may like discrete notification.)

One great challenge of ubiquitous speech UIs is endowing each device with sufficient smarts. Most devices need far less than the full continuous speech recognition needed for transcription, but even fairly simple devices will likely vocabulary of tens of phrases, often comprised of hundreds of words.   Accurately spotting all those words and phrases, especially in a noisy environment with multiple speakers and the audio blaring, is tricky, especially given reasonable expectations for accuracy.  We simply don’t want to have to repeat ourselves.  This means running fairly sophisticated recognition neural networks, either to pick out words for matching against command patterns or to directly recognize entire command phrases.  Fortunately, neural network inference hardware is getting dramatically more efficient.  Creators of neural network processors and accelerators for embedded systems are now routinely touting efficiencies above 1T operations per watt, often driven by the demands for computer vision.  Short command recognition is likely to take 3-5 orders of magnitude less compute than that.  This implies that as neural network-specific acceleration grows common in embedded platforms, we may well see rich speech interfaces implemented in microwatts of power.  In addition we are likely to see hybrid systems that follow a 90/10 rule – 90% of commands are recognized and acted on locally, to minimize latency, computing cost and network bandwidth consumption, and the most obscure and difficult phrases are passed on to the cloud for interpretation.  A well-designed hybrid system will have the latency of local and the sophistication of cloud.  All this is profoundly good news for users, who hate to get up off the couch or to look for the remote! 

In the last couple of years, handwringing over the potential downsides of AI has become a surefire attention getter.  Some of the concerns are real – we should be worrying about implicit bias in training sets, about the fragility and incomprehensibility of systems in mission-critical situations, and the potential economic disruption from rapid change in some classes of work.   I am less worried about the most apocalyptic “rise of the robot overlords” scenarios.  In fact, I see the end of our fifty years of forced learning to type, swipe, mouse and point.  Perhaps instead of computers forcing us to learn their preferred interface methods, we can teach them ours.  Perhaps we can now finally throw off our robot overlords!


 
January 16, 2018
How well does speech recognition cope with noise?

Speech is rapidly becoming a primary interface for the connected world. Apple’s Siri, Amazon’s Alexa services running on Echo devices, and a wealth of cloud-based speech recognition APIs, are all bring speech-based user interfaces into the mainstream. Moreover, the spectrum of information services is likely to expand rapidly over the next couple of years, bringing speech up to the level of mobile touch screen web browsing and phone apps. 

The speech interface revolution is starting for good reason – typing on keyboards, clicking mice and touching glass are neither natural nor efficient for humans. Most importantly, they usually need our eyes engaged too, not just to get the responses, but even to make accurate inputs. Speaking is convenient and quite high bandwidth. Humans can routinely talk at 140-160 words per minute, but few of us can type that fast, especially while driving (though we seem to keep trying). With the advent of good automatic speech recognition (ASR) based on deep neural networks we can start to exploit speech interfaces in all the situations where eyes and hands aren’t available, and where natural language can plausible express the questions, commands, responses and discussion at the heart of key interactions. Many of the worst impediments to using speech widely come from noise problems. Noise impacts the usability of all interfaces and all likely speech application types. Siri and Alexa-based services work remarkably well in a quiet room, but performance degrades significantly as noise goes up. 

My team at Babblabs has just completed a basic assessment of noise tolerance of three popular speech recognition cloud APIs. We are not using the data from this experiment to test, develop, train or distribute any product. We present the data simply as a way to highlight some of the interesting issues. We are not proposing it as a way to choose the “best” speech recognizer. In fact all three recognizers are remarkably powerful tools for a huge range of language-centric applications. Nevertheless, teasing out small difference can trigger some useful discussion about the nuances in behavior and use. As a further caveat, we do not know for certain if we have used each API with optimal settings, so the performance of the recognizers with other setting, with other speech data or noise distributions, might vary from these results. Finally, we do not know if any of these systems used the UW PNNC corpus in their training – such a system might look a little better as a result. 

Our methodology was to take short snippets of speech from the University of Washington’s PNNC spoken English corpus (https://depts.washington.edu/phonlab/resources/pnnc/pnnc2/), add non-stationary noise samples of many types, and scaled to various levels. The noise is taken from MUSAN dataset (https://arxiv.org/pdf/1510.08484.pdf). That corpus contains 929 files of assorted noises, ranging from technical noises, such as DTMF tones, dialtones, fax machine noises, to ambient sounds, such as car idling, thunder, wind, footsteps, paper rustling, rain, animal noises, crowd noises. 

The generated noisy versions ranged in signal to noise ratio(SNR) from about 30dB (barely audible noise) up to almost -5dB (noise overwhelms the original clean signal). We called the APIs with each sound snippet for both the original, clean version and the additive noise case, and compared the returned text transcript against the LibriSpeech reference text. We used the scdiff utility from the NIST SCTK Scoring Tookit to generate a word error rate (WER) for that clean or noisy snippet version. This is not necessarily the same error rate metric used in other publications, so we must be careful in comparing these numbers to other WER rate. We ran the same 1000 snippets on each of three ASR APIs, labeled here as Recognizer A, Recognizer B and Recognizer C.

Here are the results for all 1000 noisy samples, showing our calculation of word error rate for each sample plotted against the injected noise level for that snippet. The graph also shows the average word error rate on the clean versions for that API. I have also included a linear curve fit, with the R2 value of the fit.










This data permits some observations: 

1. All of the speech recognizers do quite well at low noise levels (high SNR). These word error rates may not match the figures cited by these APIs’ publishers, but those small differences are easily explained by differences in the test cases. 

2. Error rates rise significantly as noise increases. In many cases the recognizers are quite confused by the noisy speech and produce nothing or completely garbled versions. 

3. The performance at high noise levels has very high variance. For some samples, the transcripts remain quite accurate. In others the ASR gets quite lost. 

4. It looks like Recognizer A’s system is currently slightly more noise resilient than Recognizer C, which in turn, scores somewhat better than Recognizer B at high noise levels (SNR near 0dB). This suggests that Recognizer A may have used somewhat higher levels of noise and/or more variety of noise during their ASR speech training process. Nevertheless, none of the systems is spectacular when the noise level starts to rival or exceed the level of the clean voice signal. 

We get a little more insight if we look at how each of the systems behaves in each range of noise. We can split the input SNR range of 20dB to -5 dB into six equal bins and look at the noisy ASR in each bucket. 

This shows that Recognizer A does better across most of the range, except at the lower noise levels. At the same time, it suggests that Recognizer A also has the highest variance in results relative to accuracy. This implies that Recognizer A achieves significantly lower error rates on some noisy speech snippets, but on others is no better than, and perhaps worse than the others. Interestingly, Recognizer A’s variance is also noticeably high at low noise levels, suggesting that their performance may be a bit bimodal at low noise – either near-perfect or only so-so while the others fall more consistently in between on error rates. Low error rates are valuable, but low variance is probably quite useful too, as it helps users have more stable expectations from their voice-based systems. 

Speech recognition is making steady progress, but it is far from ideal. Further progress on accuracy and consistency, especially under difficult noise conditions are critically important. But the current enthusiasm over speech recognition may gloss over other big opportunities and important problems in front of us. The opportunities stem from the fact that speech recognition is just the tip of the iceberg of powerful new capabilities and applications for deep speech processing. I see a complex constellation of possibilities in speech analysis for language, accent and emotion, speech enhancement for noise reduction and reconstruction, speech generation, speaker identification and authentication, speech thread separation and speech search. Some of these are built on mapping back and forth between speech audio and speech text, but speech is important even outside of ASR. After all, we still use our phones for live calls – or try to in the face of often miserable speech communications environments. We produce video and audio recordings for playback, and leave voice mail; we want to know who is speaking; we even want know how they are feeling. There is a rich world above and below the water line!



November 29, 2017Building a speech start-up

A startup is a source of excitement at any time, but we don't live at just "any time".  Right now, we are experiencing an era of particularly rapid evolution of computing technology, and a period of particularly dramatic evolution of new business models and real-world applications.   The blanket term, "AI", has captured the world's imagination, and it is tempting to dismiss much of the breathless enthusiasm - and doomsaying - as just so much hype.  While there is a large dose of hype circulating, we must not overlook the very real and very potent emergence of deep learning or neural network methods as a substantially fresh approach to computing.  The pace of improvement of algorithms, applications and computing platforms over just the past five years shows that this approach - more statistical, more parallel, and more suitable for complex, essentially ambiguous problems - really is a big deal.


Not surprisingly, a great deal of today's deep learning work is being directed at problems in computer vision - locating, classifying, segmenting, tagging and captioning images and videos.  Roughly half of all deep learning startup companies are focused on one sort of vision problem or another.  Deep learning is a great fit for these problems.  Other developers and researchers have fanned out across a broad range of other complex, data-intensive tasks in modeling financial markets, network security, recruiting, drug discovery, and transportation logistics.  One domain is showing particular promise, speech processing.  Speech has all the key characteristics of big data and deep complexity that suggest potential for neural networks.  And researchers and industry have made tremendous progress on some major tasks like automatic speech recognition and test-to-speech synthesis.  We might even think that the interesting speech problems are all getting solved.


In fact, we have just scratched the surface on speech, especially on the exploitation of the combination of audio signal processing and neural network methods to extract more information, remove ambiguity and improve quality of speech-based systems.  Deep networks can follow not just the sounds, but the essential semantics of the audio trace, providing powerful means to overcome conflicting voices, audio impairments and confusion of meaning.  The applications of improved speech understanding extend well beyond the applications we know today - smart speakers, cloud-based conversation bots and limited vocabulary device command systems.  We expect to see semantic speech processing permeate into cars, industrial control systems, smart home appliances, new kinds of telephony and a vast range of new personal gadgets.  We fully expect to see lots of new cloud APIs for automated speech processing services, new packaged software and new speech-centric devices.  Speech is so natural and so accessible for humans, that I can predict it will be a preferred interface mechanism for hundreds of billions of electronic systems over time.


Ironically, the entrepreneurial world is not pursuing speech opportunities in the way it has been chasing vision applications.  In Cognite Ventures list of the top 300 deep learning startups, for example, about 160 are focused on vision, but only 16 on speech (see www.cogniteventures.com).  While vision IS a fertile ground, speech is equally fertile.  Moreover, most of the innovations in new deep learning network structures, reinforcement learning methods, generative adversarial networks, accelerated neural network training and inference chips, model distillation and new intuitive training frameworks all apply to speech as completely and easily as to vision tasks.  
So this is the reason I am starting a deep learning applications company, focused on advanced speech processing.  The core technology concepts are here.  The neural network and signal processing expertise is here.  The computing power is here.  The market opportunities  are here.  Babblabs is here!