Blog

Photo by 수안 최 on Unsplash




 May 22, 2018 What does it sound like to you?  

The Internet has been abuzz this past week with discussion of a single utterance.  People everywhere are talking about “Laurel versus Yanny”, an utterance that sounds more like “Laurel” to some, and more like “Yanny” to others.   See the NY Times article. In fact, you can get from something very clearly “Laurel” to something clearly “Yanny” by changing the gain on different frequency bands.  “Laurel,” emphasized from the high frequencies are attenuated, and “Yanny” stands out when the high frequencies are boosted.  Find a fun comparison tool here. Whether you hear “Laurel” or “Yanny” is partially determined by your auditory systems sensitivity to high frequencies.

This lively but inconsequential debate exposes the tip of an iceberg of more substance to speech experts.  People want clearer speech in every situation where intelligibility or comfort are important.  Nobody wants to listen to such overwhelming loud noise or such heavy reverberation that all intelligibility is lost.  Alexander Graham Bell famously first transmitted voice on a wired system in 1875 and Reginald Aubrey Fessenden transmitted voice by wireless in 1900.  Needless to say the voice quality was lousy.  Since then, speech and communications engineers working on speech-based systems have developed a range of metrics to allow for better comparison of the level of noise and degree of intelligibility of electronic speech reproduction.

What is “good” sound in speech?  Low noise level?  High clarity?  Good comprehensibility?  Speech captured under ideal conditions, - good microphones, anechoic recording environment, and zero additive noise – certainly ranks well on these metrics.  Many environments, however,  make speech capture, transmission and reproduction very difficult.   We need to be prepared for imperfect speech, and to analyze it in a way to gives insights on how to improve our speech systems.   Good metrics are one key to good speech.

Broadly speaking, there are two sets of criteria we may want to apply to speech:

  • Comfort/Quality: How does it feel to listen to the speech?  Is it annoying or uncomfortable? Is the noise or reverberation distracting?
  • Intelligibility:  Regardless of how noisy or unpleasant the speech may be, how completely can you actually make out the words and the speaker’s intent? Can you understand? 

As you’d expect, intelligible speech is often also comfortable to listen to, but the two metrics are not the same.  High noise, especially outside of the typical speech frequency spectrum, can be quite uncomfortable and annoying, but may not interfere much with comprehension.  But impairments don’t need to be annoying to interfere with comprehension.   Moderate noise in the speech frequency bands, for example from other speakers in the background can ruin intelligibility. Reverberation of rapid speech, can also make  comprehend, even without disturbing levels of noise. Over time, many metrics have emerged.  Let me briefly introduce and compare a few of the more relevant ones.  Among the many metrics, the following categories reflect the needs well:

  • MOS: The Mean Opinion Score is the gold standard for speech assessment, because, ironically, it is based on human judgment.  It is a method for aggregating actual human listening experience to determine which speech is more satisfactory, The 1-5 scale is strikingly subjective and that goes to the real heart of human perception of speech – how annoying is it.  Many other metrics are based on trying to predict MOS without the expense of collecting human data.  There are a number of variants of MOS, including comparative scoring of two speech samples, continuous scoring, and scoring of specific impairments (for example, echo) 
Score QualityImpairments 
5Excellent Imperceptible 
Good Perceptible but not annoying 
Fair Slightly annoying 
2Poor Annoying 
Bad Very Annoying 
  
  • SNR: Everyone working on audio or speech knows with Signal to Noise Ratio.  It measures the ratio of the power of the noise to the power of the intended signal.  It may be used the average power over a time period and is measured on a log scale, so that a doubling of the ratio translates to a 3-decibel improvement in the SNR.  This is really a measure of how badly the noise distorts the waveform relative to the clean signal.  It is not well correlated with perception of signal quality or intelligibility. Usually SNR is directly measured by comparing the noisy speech with the same speech without noise, when that clean version is available.  Sometimes SNR is estimated by trying to get an approximate magnitude of the noise, for example during speech gaps, to compare to the total audio magnitude.  
  • PESQ:  Perceptual Evaluation of Speech Quality is a group of standards from the telecommunications industry, standardized as the ITU-T recommendations called P.862.  It attempts to more directly model the assessment of voice quality by humans, so it tries to estimate results on roughly the same  0-5 scale as [KW1] MOS.  PESQ results are computed with a “full-reference” or “intrusive” algorithm. The full-reference version needs access to the original clean version of each speech sample as the ideal, and compares the noisy version to it.  A new variation of PESQ-like metrics is captured in the new P.863 standard, Perceptual Objective Listening Quality Assessment (POLQA).There are analogous “no-reference” or non-intrusive metrics  for some quality measures.  These use no information about the clean version, but can only make estimates of the degradation relative to a hypothetical clean version.
  • NCM and STOI:  Normalized Covariance Metric (NCM) and Short-Time Objective Intelligibility (STOI) are metrics specifically designed to correlate well with human intelligibility.    They are distant descendants of French and Steinberg’s Articulation Index, developed in the 1940s.  NCM was developed in the 1990s and STOI a decade later, as different approximations to intelligibility on a 0-1 scale. Both are full-reference algorithms – they compare clean to dirty signals to estimate the degree of intelligibility.  An improved version of STOI, Extended Short-Time Objective Intelligibility (ESTOI) has come into use in the past couple of years.  ESTOI is design to handle highly modulated and variable noise source, so it may provide good insights for some of the most challenging noise types.[KW2] 
  • WER: On the surface, the emergence of good speech recognition systems seems to offer a fine way to measure speech comprehensibility.  Some researchers have tried to use Word Error Rate (WER), for example, as a stand-in for human comprehension. Unfortunately, automatic recognition systems are trained on finite distributions of speech and noise, and their performance depends heavily on whether a input speech has similar distributions.  Speech processing of the voice signal may actually improve the comfort and the true intelligibility of the speech, but shift the signal pattern, often in inaudible ways, outside the scope of the recognition training, so that the WER metric degrades even while subjective and other objective metrics of intelligibility improve.

I must mention an important caveat about all these objective or algorithmic metrics.  The correlation between the computed metric - say, PESQ - and the human perception tests – say, MOS – depends on the impairments in the speech under assessment. The correlation is excellent for the kinds of impairments originally used in characterizing and tuning the metric, but the correlation accuracy must degrade as new and different distortions are measured. 

The proliferation of speech analysis metrics reflects an inherent conflict of goals for speech work.  On one hand, we want metrics that are easy to compute, reproducible and objective.  On the other hand, we want perfect prediction of human experience.  No algorithmic metric captures all the complexity of human perception for all the possible forms of naturally occurring and artificial impairments.  The only robust way to determine if humans can comprehend a speech sample is to get a statistically significant group of humans to judge it.  The “Laurel” vs. “Yanny” debate further highlights the variability of human perception, so sample sizes must be significant to be truly accurate. The expense and time for subjective scoring force research to keep going back to come up with better methods  - simple formulas that correlate well to real MOS.  And in the meantime, researchers and developers of real speech systems use an ensemble of metrics of comfort, quality and intelligibility, together with human listening, to make real progress on real speech experiences.

BabbleLabs is driving to create a speech-powered world, where humans are universally understood and in command.  We use the full spectrum of metrics, subjective and objective, to measure and optimize speech systems. Using metrics for intelligibility, like NCM and STOI, is essential to improving real intelligibility.  Optimizing only for SNR and PESQ may yield speech with a good sound, but it won’t do much for human intelligibility. The result of all the effort with metrics is worth it: clearer voice, greater comfort, higher comprehension – and ultimately, better understanding among people.


May 4, 2018What's the Big Deal with Speech Enhancement? 


Writer and physician Oliver Wendell Holmes once said, “Speak clearly, if you speak at all; carve every word before you let it fall.“  He, of course, was campaigning for thoughtfulness in speech, but comprehensibility of speech remains crucially important.  We all want dearly to understand and be understood by other people - and increasingly by our devices that connect us to our world. 

Speech has always been humans preferred interface method – in fact the emergence of hominid speech a couple of hundred thousand years ago [] coincided with the origins of the species homo sapiens.But, we live now in sonic chaos - a noisy world that is getting noisier every day.  We talk in cars, in restaurants, on the street, in crowds, in echo-y rooms and in busy kitchens -there are always kids yelling, horns honking or loud music playing.   What can we do to combat the noise, and make speech and understanding easier?

Part of the emerging answer is speech enhancement, the electronic processing of natural human speech to pare back the noise that makes speech hard to comprehend and unpleasant to listen to.    Electronic engineers have been concerned about the speech quality for almost a century, really since the emergence of radio broadcasts.  Most of the speech improvement methods have evolved relatively slowly in the last couple of decades, representing gradual refinement of DSP algorithms looking to separate speech sounds from background noise sounds.  The emergence of deep neural network methods is now helping algorithm designers take a fresh look at the problem, and develop much more ambitious and effective speech enhancement methods to reduce noise, remove reverberation and improve intelligibility of speech.

Speech Enhancement ≠ Speech Recognition

 Speech enhancement is distinct from speech recognition.  Speech recognition is focused on the problem of rendering of human speech audio into equivalent text for purpose of capturing transcriptions and controlling devices and information systems.  It is also being effectively extended into automated translation of languages.  Speech enhancement concentrates instead on the human auditory experience.  That experience suggests two goals for speech enhancement.  First, the enhancement should make the speech easier to understand bringing out the distinguishing sounds from the background of noise and acoustic impairments. Second, the enhancement should make the speech more pleasant to listen to, cut back the distracting, annoying and even painful interference of noise.  These two goals are not as inevitably linked as you might imagine. Intelligibility is actually quite challenging to improve, since the human auditory system and language centers are remarkably adept at piecing together the most likely meaning of even heavily impaired speech sounds. Intelligibility is best measured by human (aka subjective) testing, but it turns out to be well correlated to objective speech quality metrics like  the Normalized Covariance Metric (NCM) and several flavors of the Short-Time Objective Intelligibility (STOI).    Ironically many of the classical signal processing algorithms for speech enhancement do relatively little for intelligibility.   Listening comfort is better modeled by metrics like Perceptual Evaluation of Speech Quality (PESQ), a family of methods captured in the ITU-T recommendation P.862.  It correlated well to reduction of the perceived noise level, but noise reduction is not the same thing as intelligibility improvement.  To illustrate this, the BabbleLabs team has implemented some 19 different speech enhancement algorithms from the literature, and looked at results on noisy speech.  Here are two examples, compared the original noisy example, to the “enhanced” results on the algorithms on the PESQ and NCM.  

(The benefits are so modest, especially on intelligibility, that I haven’t broken them out by name, but we summarize the 19 methods as Martin, MCRA, MCRA2, IMCRA, Doblinger,Hirsch, Conn_FreqKLT, PKLTWiener AS,Audnoise, MMSE,LogMMSE,stsa_weuclid,stsa_wcosh, andstsa_mis.  Let me know if you want to dig deeper on this.)

 Noise plays a leading role as main villain in speech understanding by both humans and machines. As I’ve written about before, today’s speech recognition system are often extremely sensitive to the presence of noise and some of the same neural network training methods for noise resilience in deep learning-based speech enhancement can be effectively applied to speech recognition systems too. Unless we make better speech systems – ones with higher speech quality, better intelligibility, more noise robustness and improved speaker identification – all the talk about a revolution in speech interfaces is just talk.  A good example is the relatively low satisfaction of smartphone users with voice assistants.  PwC reports that only 38% of users are “very satisfied” with mobile voice assistants. . We must deploy significantly improved algorithms to address the serious limitations we see today!

Who Needs Speech Enhancement?

The applications of speech enhancement are endless.  It can play a major role in better telephony systems, public address systems, and ,hearing aids,.  Even if we focus on just one small subset - speech enhancement in internet-based systems, - the range is impressive.

  • Call centers capture voice dialog for lots of reasons, for live interaction, for automated response using speech recognition, for logging and training, and for off-line assessment and response.  
  • Home and office voice mailused to be something hosted on a box owned by family or business, but that service has now moved almost entirely to the cloud.  Voice calls are often captured with both nasty background noise and channel impairments that make voice mails notoriously difficult to understand.
  • Video and audio production– amateur and professional – often confronts tough issues in noisy and reverberant speech material.  When audio professionals have the time, they can manually filter the speech track second-by-second or even phoneme by phoneme, to improve the noise using tolls like Adobe Audition or even re-record the audio under studio conditions. Few amateurs have the tools and skills to do sophisticated speech improvement, so automated speech enhancement makes sense for situations ranging from professional editing of field-recorded audio and video, to parents fixing up the cellphone video of the kids birthday party before posting it for relatives on YouTube.  In fact, an increasing fraction of all social media revolves around video, usually amateur video, on Snapchat, Instagram, Twitter, YouTube, and WeChat.
  • Audio archivingis becoming a critical information access and compliance method in business.  Customers, investors and partners all value having ready access to the best possible recordings of conference calls, earnings calls.  In Europe, new MiFID II/MiFIR finance rules require greater transparency and access by investors to key business information, and audio archiving is likely to be an element of this new regime.
  • Games and entertainmenthave developed rich interactive speech elements.  The Talking Tom franchise and apps like “My Pet Can Talk” turn voice recordings into animal animations.  Massive multiplayer online games (MMOGs) are turning to enhanced speech to make game-play more fun, more comprehensible and more immersive.

The list goes on and on, and the direction is clear.  Enhanced speech is rapidly growing as a key ingredient in speech interaction systems of all kinds.   Better speech systems will both be used by billions of people, and make a critical business difference in any situation where better understanding is needed.

Humans are extremely demanding on the quality of speech they will listen to – after all, we’ve been refining our tastes for hundreds of thousands of years.  Better speech enhancement, especially methods that can improve intelligibility, not just turn down the noise, are critical to progress across these applications. 

Let us "carve every word before we let it fall".


March 24, 2018
Living in a speech-triggered world

Like many people in the developed world, I spend every day surrounded by technology  - laptops, phones, car touch-screens, networked gadgets and smart TV.  Of course I want to interact with all these things to get my work done and to enjoy their services and entertainment.  And I have learned to use them pretty well – I have learned to type on keyboards, move mice, finger touchpads, pinch, swipe and peck. The devices themselves have not learned much of anything! 

A wave of change is sweeping towards us, with the potentially to dramatically shift the basic nature of interactions between people and our electronic environment.  Deep learning-based algorithms are transforming many aspects of computation, by allowing good solutions bigger, more complex, more ambiguous problems than before.  The transformations are particularly striking across vision, language analysis, pattern finding in large data sets, and speech.  I have written a lot about vision and the potential for neural network vision algorithms to meet or exceed human capabilities in recognizing and tracking objects, summarizing scenes and sequences, and finding safe navigational channels in two and three dimensions.  These vision systems are becoming very substitutes for human vision in tasks like driving, surveillance and inspection.

Deep learning advances in speech have a completely different character from vision – these advances are rarely about substituting for humans, but in pay more attention to humans.  Speech recognition, speaker identification, and speech enhancement are all functions to enhance the human-machine interaction.   The net effect of better human-machine interaction is less about displacing human-to-human interaction but more about displacing the keyboards, mice, remote controls and touchpads we have learned to use, albeit painfully.   In a very real sense, the old interfaces required us to retrain the neurological networks in our brains – new speech interfaces move that neural network effort over onto the computers!

Speech is coming into use in many forms – speech enhancement, generation, translation, and identification, but various flavors of speech recognition are getting the most attention.  Speech recognition seems to be used in at least three distinct ways. 

  1. Recognition is used for large-scale transcription, often for archival purposes, in medical records and in business and legal operations.   This is typically done in the cloud or on PCs. 
  2. We have seen the rapid rise, with Alexa, Siri, Google Voice and others, of recognition for browser-like information queries.  This is almost always cloud-based, not just because the large vocabulary recognizers require, for now, server-based compute resources, but also because the information being sought naturally lives in the cloud.
  3. Recognition drives local systems controls, where the makers of phones, cars, air conditioners and TVs want the convenience of voice command.  In some cases, these may also need voice responses, but for simple commands proper recognition will have obvious natural responses. 

Voice has some key advantages as a means of user-interface control, due to the richness of information.  Not only do the words themselves carry meaning, but the tone and emotion, the speaker, and the sound environment may bring additional clues to the user’s intent and context.  But leveraging all this latent information is tricky.  Noise is particularly disruptive to understanding, as it masks the words and tone, sometimes irretrievably. We also want to the speech for user interfaces to be concise and unambiguous, yet flexible.  This gives us a sophisticated set of demands placed on even these “simple” voice command UIs.

  • Command sets that comprehensively cover all relevant UI operations, sometimes including obscure ones.
  • Coverage of multiple phrases for each operation, for example: “On”, “Turn on”, “Turn on TV”, “Turn on the TV”, Turn on television”, “Turn on the television” (and maybe even “Turn on the damn TV”).  These lists can get pretty long.
  •  Tolerance of background noise, competing voices and, most, especially, the noise of the device itself, especially for music and video output devices.  In some cases, the device can know what interfering noise it is generating and attempt to subtract it out of the incoming speech, but complex room acoustics make this surprisingly tricky
  • Ability to distinguish commands directed to a particular device from similar commands for other voice-triggered devices in the room.

The dilemma of multiple voice-triggered devices in one room is particularly intriguing.  Devices often use a trigger phrase (“Hey Siri” or “Alexa”) to get the attention of the system, before a full set of commands or language can be recognized.  This may be particular practical for cloud-based systems that don’t want or need to be always listening by streaming audio to the cloud. Listening for a single wake-up phrase can take significantly less compute than always listening for a whole vocabulary of phrases.  However, it does increase the human speech effort to get even the simplest service.  Moreover, most device users don’t really want to endure the essential latencies – 1-2 seconds for the simplest commands – inherent in cloud-based recognition. 

As the number of devices in a space increases, it will be more and more difficult to remember all the trigger phrases and the available set of commands for each device.    In theory, a unified system could be created, in which commands are recognized via a single rational interface, and then routed to the specific gadget to be controlled.  While Amazon or Google might be delighted with this approach, but it seems somewhat unlikely that all potential voice-triggered devices will be built or controlled by a single vendor.  Instead, we are likely to see some slightly-contained chaos, in which each device vendor dispenses with wake words for local devices and attempts to make their commands as disjoint as possible from the commands for other devices.  In the cases of conflicts – e.g. when both the music player and TV support a command “Turn volume down” – the devices will work from context, and may even signal to one another that a relevant command has been captured and acted on.  (And rather than relying on a network handshake for synchronization, the devices may signal audibly to indicate a successful command capture.  Both other devices and the user may like discrete notification.)

One great challenge of ubiquitous speech UIs is endowing each device with sufficient smarts. Most devices need far less than the full continuous speech recognition needed for transcription, but even fairly simple devices will likely vocabulary of tens of phrases, often comprised of hundreds of words.   Accurately spotting all those words and phrases, especially in a noisy environment with multiple speakers and the audio blaring, is tricky, especially given reasonable expectations for accuracy.  We simply don’t want to have to repeat ourselves.  This means running fairly sophisticated recognition neural networks, either to pick out words for matching against command patterns or to directly recognize entire command phrases.  Fortunately, neural network inference hardware is getting dramatically more efficient.  Creators of neural network processors and accelerators for embedded systems are now routinely touting efficiencies above 1T operations per watt, often driven by the demands for computer vision.  Short command recognition is likely to take 3-5 orders of magnitude less compute than that.  This implies that as neural network-specific acceleration grows common in embedded platforms, we may well see rich speech interfaces implemented in microwatts of power.  In addition we are likely to see hybrid systems that follow a 90/10 rule – 90% of commands are recognized and acted on locally, to minimize latency, computing cost and network bandwidth consumption, and the most obscure and difficult phrases are passed on to the cloud for interpretation.  A well-designed hybrid system will have the latency of local and the sophistication of cloud.  All this is profoundly good news for users, who hate to get up off the couch or to look for the remote! 

In the last couple of years, handwringing over the potential downsides of AI has become a surefire attention getter.  Some of the concerns are real – we should be worrying about implicit bias in training sets, about the fragility and incomprehensibility of systems in mission-critical situations, and the potential economic disruption from rapid change in some classes of work.   I am less worried about the most apocalyptic “rise of the robot overlords” scenarios.  In fact, I see the end of our fifty years of forced learning to type, swipe, mouse and point.  Perhaps instead of computers forcing us to learn their preferred interface methods, we can teach them ours.  Perhaps we can now finally throw off our robot overlords!


 
January 16, 2018
How well does speech recognition cope with noise?

Speech is rapidly becoming a primary interface for the connected world. Apple’s Siri, Amazon’s Alexa services running on Echo devices, and a wealth of cloud-based speech recognition APIs, are all bring speech-based user interfaces into the mainstream. Moreover, the spectrum of information services is likely to expand rapidly over the next couple of years, bringing speech up to the level of mobile touch screen web browsing and phone apps. 

The speech interface revolution is starting for good reason – typing on keyboards, clicking mice and touching glass are neither natural nor efficient for humans. Most importantly, they usually need our eyes engaged too, not just to get the responses, but even to make accurate inputs. Speaking is convenient and quite high bandwidth. Humans can routinely talk at 140-160 words per minute, but few of us can type that fast, especially while driving (though we seem to keep trying). With the advent of good automatic speech recognition (ASR) based on deep neural networks we can start to exploit speech interfaces in all the situations where eyes and hands aren’t available, and where natural language can plausible express the questions, commands, responses and discussion at the heart of key interactions. Many of the worst impediments to using speech widely come from noise problems. Noise impacts the usability of all interfaces and all likely speech application types. Siri and Alexa-based services work remarkably well in a quiet room, but performance degrades significantly as noise goes up. 

My team at BabbleLabs has just completed a basic assessment of noise tolerance of three popular speech recognition cloud APIs. We are not using the data from this experiment to test, develop, train or distribute any product. We present the data simply as a way to highlight some of the interesting issues. We are not proposing it as a way to choose the “best” speech recognizer. In fact all three recognizers are remarkably powerful tools for a huge range of language-centric applications. Nevertheless, teasing out small difference can trigger some useful discussion about the nuances in behavior and use. As a further caveat, we do not know for certain if we have used each API with optimal settings, so the performance of the recognizers with other setting, with other speech data or noise distributions, might vary from these results. Finally, we do not know if any of these systems used the UW PNNC corpus in their training – such a system might look a little better as a result. 

Our methodology was to take short snippets of speech from the University of Washington’s PNNC spoken English corpus (https://depts.washington.edu/phonlab/resources/pnnc/pnnc2/), add non-stationary noise samples of many types, and scaled to various levels. The noise is taken from MUSAN dataset (https://arxiv.org/pdf/1510.08484.pdf). That corpus contains 929 files of assorted noises, ranging from technical noises, such as DTMF tones, dialtones, fax machine noises, to ambient sounds, such as car idling, thunder, wind, footsteps, paper rustling, rain, animal noises, crowd noises. 

The generated noisy versions ranged in signal to noise ratio(SNR) from about 30dB (barely audible noise) up to almost -5dB (noise overwhelms the original clean signal). We called the APIs with each sound snippet for both the original, clean version and the additive noise case, and compared the returned text transcript against the LibriSpeech reference text. We used the scdiff utility from the NIST SCTK Scoring Tookit to generate a word error rate (WER) for that clean or noisy snippet version. This is not necessarily the same error rate metric used in other publications, so we must be careful in comparing these numbers to other WER rate. We ran the same 1000 snippets on each of three ASR APIs, labeled here as Recognizer A, Recognizer B and Recognizer C.

Here are the results for all 1000 noisy samples, showing our calculation of word error rate for each sample plotted against the injected noise level for that snippet. The graph also shows the average word error rate on the clean versions for that API. I have also included a linear curve fit, with the R2 value of the fit.







This data permits some observations: 

1. All of the speech recognizers do quite well at low noise levels (high SNR). These word error rates may not match the figures cited by these APIs’ publishers, but those small differences are easily explained by differences in the test cases. 

2. Error rates rise significantly as noise increases. In many cases the recognizers are quite confused by the noisy speech and produce nothing or completely garbled versions. 

3. The performance at high noise levels has very high variance. For some samples, the transcripts remain quite accurate. In others the ASR gets quite lost. 

4. It looks like Recognizer A’s system is currently slightly more noise resilient than Recognizer C, which in turn, scores somewhat better than Recognizer B at high noise levels (SNR near 0dB). This suggests that Recognizer A may have used somewhat higher levels of noise and/or more variety of noise during their ASR speech training process. Nevertheless, none of the systems is spectacular when the noise level starts to rival or exceed the level of the clean voice signal. 

We get a little more insight if we look at how each of the systems behaves in each range of noise. We can split the input SNR range of 20dB to -5 dB into six equal bins and look at the noisy ASR in each bucket. 



This shows that Recognizer A does better across most of the range, except at the lower noise levels.   Recognizer C shows the highest word error rate across all noise levels.

Speech recognition is making steady progress, but it is far from ideal. Further progress on accuracy and consistency, especially under difficult noise conditions are critically important. But the current enthusiasm over speech recognition may gloss over other big opportunities and important problems in front of us. The opportunities stem from the fact that speech recognition is just the tip of the iceberg of powerful new capabilities and applications for deep speech processing. I see a complex constellation of possibilities in speech analysis for language, accent and emotion, speech enhancement for noise reduction and reconstruction, speech generation, speaker identification and authentication, speech thread separation and speech search. Some of these are built on mapping back and forth between speech audio and speech text, but speech is important even outside of ASR. After all, we still use our phones for live calls – or try to in the face of often miserable speech communications environments. We produce video and audio recordings for playback, and leave voice mail; we want to know who is speaking; we even want know how they are feeling. There is a rich world above and below the water line!



November 29, 2017 Building a speech start-up

A startup is a source of excitement at any time, but we don't live at just "any time".  Right now, we are experiencing an era of particularly rapid evolution of computing technology, and a period of particularly dramatic evolution of new business models and real-world applications.   The blanket term, "AI", has captured the world's imagination, and it is tempting to dismiss much of the breathless enthusiasm - and doomsaying - as just so much hype.  While there is a large dose of hype circulating, we must not overlook the very real and very potent emergence of deep learning or neural network methods as a substantially fresh approach to computing.  The pace of improvement of algorithms, applications and computing platforms over just the past five years shows that this approach - more statistical, more parallel, and more suitable for complex, essentially ambiguous problems - really is a big deal.


Not surprisingly, a great deal of today's deep learning work is being directed at problems in computer vision - locating, classifying, segmenting, tagging and captioning images and videos.  Roughly half of all deep learning startup companies are focused on one sort of vision problem or another.  Deep learning is a great fit for these problems.  Other developers and researchers have fanned out across a broad range of other complex, data-intensive tasks in modeling financial markets, network security, recruiting, drug discovery, and transportation logistics.  One domain is showing particular promise, speech processing.  Speech has all the key characteristics of big data and deep complexity that suggest potential for neural networks.  And researchers and industry have made tremendous progress on some major tasks like automatic speech recognition and test-to-speech synthesis.  We might even think that the interesting speech problems are all getting solved.


In fact, we have just scratched the surface on speech, especially on the exploitation of the combination of audio signal processing and neural network methods to extract more information, remove ambiguity and improve quality of speech-based systems.  Deep networks can follow not just the sounds, but the essential semantics of the audio trace, providing powerful means to overcome conflicting voices, audio impairments and confusion of meaning.  The applications of improved speech understanding extend well beyond the applications we know today - smart speakers, cloud-based conversation bots and limited vocabulary device command systems.  We expect to see semantic speech processing permeate into cars, industrial control systems, smart home appliances, new kinds of telephony and a vast range of new personal gadgets.  We fully expect to see lots of new cloud APIs for automated speech processing services, new packaged software and new speech-centric devices.  Speech is so natural and so accessible for humans, that I can predict it will be a preferred interface mechanism for hundreds of billions of electronic systems over time.


Ironically, the entrepreneurial world is not pursuing speech opportunities in the way it has been chasing vision applications.  In Cognite Ventures list of the top 300 deep learning startups, for example, about 160 are focused on vision, but only 16 on speech (see www.cogniteventures.com).  While vision IS a fertile ground, speech is equally fertile.  Moreover, most of the innovations in new deep learning network structures, reinforcement learning methods, generative adversarial networks, accelerated neural network training and inference chips, model distillation and new intuitive training frameworks all apply to speech as completely and easily as to vision tasks.  
So this is the reason I am starting a deep learning applications company, focused on advanced speech processing.  The core technology concepts are here.  The neural network and signal processing expertise is here.  The computing power is here.  The market opportunities  are here.  BabbleLabs is here!