Many of us have gotten used to talking with our gadgets in a more or less natural way, and expecting the machine to understand. But it's not like our computers or smartphones have grammars inside that work the same as ours. So how does speech recognition work? We can break down our speech into various waveforms that have different properties, and then teach the computers to track those. But there are tiny variations in the way we produce things, both because we're not perfect, and because sounds get influenced by the phonetic environment they show up in. Keeping track of all of that requires a lot of processing power, but it does mean the program can get more and more accurate. And for extra accuracy, you can add in syntactic information, to help decide between acoustically similar, but semantically different, strings of sounds.
But it requires a lot of human assistance to plug in all the relevant rules, which makes it very time consuming. Fortunately, of late, speech recognition software has been making use of deep learning. In deep learning models, computers go through vast quantities of data to sort out for themselves what the relevant patterns are. Though these models need to be watched to make sure they don't go off the rails, they have the potential to be much more accurate, the functions making them up mimicking what our own brains do. So while things might not be perfect yet for speech recognition, we've come a long way, and we can guess computers will continue to get better.
Speech recognition is a very interesting topic, but it's only one side of the coin. Anyone who’s used text readers or Siri or such know that they don’t sound really all that natural. So why is that?
This is a more involved question than you might think! So the easiest part of this is to say that the initial synthesized speech work was actually done by dealing with making different sounds by measuring in the acoustics of different sounds and then glomming them together. But one problem with this is that though the hallmark parts of phonemes in terms of their acoustics are usually fairly constant, what actually makes it sound like a person is the higher levels of the acoustics that are shaped by the specific vocal tract of the person talking. And keeping those constant, so it sounds like the same person, is hard. So this is often the Speak and Spell kind of thing - you can hear what it’s saying, but it’s weird and not rich.
So most of the speech synthesis stuff now is built off actual people’s speech; for example, Siri’s American female voice is based off of Susan Bennett’s. For this, the speaker is recorded speaking a ton, and then those words are torn apart into their different phonemes and allophones, stored, and then put back together to make new speech. The more speech you have to begin with, the better this can sound. This is also behind what Roger Ebert had in the later years of his life, where he couldn’t speak himself anymore.
While clearly, this’ll make the voice sound more human - after all, it was human at one point - this doesn’t solve the bigger part of the problem. The really big problem is that we also really care about intonation patterns over larger sets of words. We need to understand the tone of a sentence (someone saying “It snowed all night” will probably feel differently if they now get a day off from school or they have to drive into work), and produce the right patterns. And that has to be layered on top of the actual sounds themselves - so all the individual sounds need to be shaded correctly, and they have to sound correct when put together. And you have to understand the environment it’s being produced in.
So you put that all together, and we’re talking massive, stupendous amounts of processing necessary. Beyond which, we don’t even have all the ideas in place for recognizing all that’s potentially required in any situation - emotion, kind of sentence, words that could have multiple pronunciations, etc. But it’s just so much information to sort through and put together quickly, it’s too much of a task still. Having it sound more flat or degraded or otherwise unnatural is the price we pay for having the technology working at all for now. It’s cool that we can even do as much as we have so far! But we’ve still got quite a long way to go.
(Originally published as a Tumblr post. It seemed so topical here, it was a shame not to reuse it.)
So how about it? What do you all think? Let us know below, and we’ll be happy to talk with you about how our computers understand us, and how they've changed over time. There’s a lot of interesting stuff to say, and we want to hear what interests you!