The good news is that we were able to improve intelligibility slightly compared with filling with zeros (it's also a lot less annoying to listen to). The bad news is that you can only do so much with PLC, which is why we then pursued the Deep Redundancy (DRED) idea.
When it all comes together it's kind of a nightmare for an ASR model. There were plenty of times in reviewing the recordings and ASR output where I'd listen to the audio and have no idea what they said.
I'm not sure which contributes most but I know from my prior experiences with ASR for telephony even clean speech on pristine connections does much worse with models trained on 16 kHz being fed native 8 kHz audio that gets resampled.
I've done some early work with Whisper in the telephony domain (transcribing voicemails on Asterisk and Freeswitch) and the accuracy already seems to be quite a bit worse.
Actually, what we're doing from DRED isn't that far from what you're suggesting. The difference is that we keep more information about the voice/intonation and we don't need the latency that would otherwise be added by an ASR. In the end, the output is still synthesized from higher-level, efficiently compressed information.
Added an enhancement request, issue 1064[1], to github, asking for the clients to support longer audio clips.
I can't promise when we'll get to it, as from now until new year is a bit of a wash.
I don't know of any detailed comparisons of commercial solutions. However, with respect to pure word error rate, the article[2] does a comparison of several engines as of circa 2015.
Uh.. isn’t that pretty much what things like lpcnet and wavenet and codec2 are (discussed here) doing? These are codecs that are speech specific. I don’t understand what unique idea you are trying to “spitball” here.
Lots of advanced research has been done in this area. These are the results as of now.
Very cool achievement! IIUC, the author was targeting a digital replacement for analog SSB voice comms.
I don't expect to see widespread adoption in other areas though, such as cellular phones and similar. To my ears, I doubt the codec will compete with common voice intelligibility metrics, such as Diagnostic Rhyme Test [1] or PESQ [2].
However, it may be an improvement over SSB voice intelligibility!
Thanks so much for pointing me to that. If I had an always-air-gapped ORWL, I'd probably put a generator on it to be able to quickly generate them; I generate an absurd number of accounts each week, since I don't tie into SSO offers over public sites.
Only concern I need to solve for is sifting these lists for confusing words like homophones that are difficult to clearly say the words over an audio-only connection. This drives down the convenience and increases the time to call out the passphrase, since I then revert to the phonetic alphabet. Unfortunately, I've only found various lists [1], but not something that grabs the list, looks up the ExtIPA pronunciation [2] for each word, then applies rules to rank the clarity of each word when spoken over the phone.
That last part is where I am stumbling. I can't find the rules that govern what we know about clarity over telephone links, though AT&T must have has studied this at some point. I can come close, with this article [3] about factors that explain why the sounds “f”, “th”, and “s” are difficult for hearing-impaired to hear. Ideally, a ranking of not just individual words in a selected Diceware list, but also ranking of the clarity of a selected passphrase as well, would come out of such a systematic categorization.
I'd be a little worried about the feature that fills in missing voice chunks. It sounds ripe for accidentally replacing one lost word with a completely different word that could also make sense in context. Almost like the issue where Xerox copiers would sometimes replace one character with another. Hopefully the filling in of missing chunks is done in a way that doesn't allow it to fill in whole words, but rather just short sub-syllable chunks of audio?
I'm guessing that given the difficulty of characterizing voiced speech is still inaccurate, the more difficult task of translating neural impulses to speech will result in even further degradation of performance. Unsolved: At the moment, the device has a limited vocabulary: 150 words and phrases.
Thanks for all the hard work you have put in so far @reubenmorais
+1000 to @mostlyjason's comment - Great latency figures mean nothing if the word error rate is high, since it dents confidence in the output (so why use DeepSpeech?) and (as the parent comment notes) necessitates manual error correction.
I would love to see a future release focus on optimizing WER for these reasons.
You can't just drop compatibility. We will have A.I. trained voice systems that mimic natural speech just enough to be understood by duplex while compressing the exchange to a minimum. Data transfer will be measured in microwords per second. Future versions of duplex will of course detect this kind of compressed speech and reply in kind, falling back to normal speech only if the immediate response is similar to north american confusion.
There are other studies that show we can understand speech at a much higher rate. The critical bandwidth constraint seems to be on stream out, not stream in.
The low WER numbers you've probably seen are for conversational telephone speech with constrained subjects. ASR is much harder when the audio source is farther away from the microphone and when the topic isn't constrained.
> There are alternatives that share Whisper internal states with an LLM to improve ASR, as well as approaches that sample N-best hypotheses from Whisper and fine-tune an LLM to distill the hypotheses into a single output. Haven't looked too much into these yet given how expensive each component is to run independently.
Tangent here: really? I've found base Whisper has concerning error rates for non-US English accents; I imagine the same is true for other languages with a large regional mode to the source dataset.
Whisper + an LLM can recover some of the gaps by filling in contextually plausible bits, but then it's not a transcript and may contain hallucinations.
There are alternatives that share Whisper internal states with an LLM to improve ASR, as well as approaches that sample N-best hypotheses from Whisper and fine-tune an LLM to distill the hypotheses into a single output. Haven't looked too much into these yet given how expensive each component is to run independently.
Once automatic speech recognition (ASR) gets closer to bullet-proof, I expect this to become a huge thing, but right now, it seems like you're getting better error rates than typical.
Any input method where you frequently have to repeat yourself and undo things won't get mainstream. I'd bet people's mainstream tolerance for errors would have to be like one per five to ten minutes before you could get them to really adopt something like this (barring disability reasons, like RSI). Until then, the tech and market don't match.
The good news is that we were able to improve intelligibility slightly compared with filling with zeros (it's also a lot less annoying to listen to). The bad news is that you can only do so much with PLC, which is why we then pursued the Deep Redundancy (DRED) idea.
reply