Hacker Read

jmvalin · 2024-03-04 20:32:09

As part of the packet loss challenge, there was an ASR word accuracy evaluation to see how PLC impacted intelligibility. See https://www.microsoft.com/en-us/research/academic-program/au...

The good news is that we were able to improve intelligibility slightly compared with filling with zeros (it's also a lot less annoying to listen to). The bad news is that you can only do so much with PLC, which is why we then pursued the Deep Redundancy (DRED) idea.

reply

kkielhofner | karma 4469 | avg karma 3.68 · | 2022-11-18 19:08:13

When it all comes together it's kind of a nightmare for an ASR model. There were plenty of times in reviewing the recordings and ASR output where I'd listen to the audio and have no idea what they said.

I'm not sure which contributes most but I know from my prior experiences with ASR for telephony even clean speech on pristine connections does much worse with models trained on 16 kHz being fed native 8 kHz audio that gets resampled.

I've done some early work with Whisper in the telephony domain (transcribing voicemails on Asterisk and Freeswitch) and the accuracy already seems to be quite a bit worse.

reply

Coragerard | karma 1 | avg karma 1.0 · | 2021-11-03 05:39:52

Yeah, I am familiar with all that ASR thing. But people are still not sure if it is perfect or not. It proved to be not as robust as it should be.

jmvalin | karma 373 | avg karma 8.29 · | 2024-03-05 18:02:22

Actually, what we're doing from DRED isn't that far from what you're suggesting. The difference is that we keep more information about the voice/intonation and we don't need the latency that would otherwise be added by an ASR. In the end, the output is still synthesized from higher-level, efficiently compressed information.

kdavis | karma 1422 | avg karma 13.42 · | 2017-12-03 23:27:27+00:00

Added an enhancement request, issue 1064[1], to github, asking for the clients to support longer audio clips.

I can't promise when we'll get to it, as from now until new year is a bit of a wash.

I don't know of any detailed comparisons of commercial solutions. However, with respect to pure word error rate, the article[2] does a comparison of several engines as of circa 2015.

[1] https://github.com/mozilla/DeepSpeech/issues/1064

[2] https://arxiv.org/abs/1412.5567

reply

badamp | karma 72 | avg karma 2.25 · | 2019-06-25 15:53:31+00:00

Uh.. isn’t that pretty much what things like lpcnet and wavenet and codec2 are (discussed here) doing? These are codecs that are speech specific. I don’t understand what unique idea you are trying to “spitball” here.

Lots of advanced research has been done in this area. These are the results as of now.

Things will hopefully continue to improve.

reply

JoeDaDude | karma 2183 | avg karma 2.91 · | 2017-01-14 16:02:27+00:00

Very cool achievement! IIUC, the author was targeting a digital replacement for analog SSB voice comms.

I don't expect to see widespread adoption in other areas though, such as cellular phones and similar. To my ears, I doubt the codec will compete with common voice intelligibility metrics, such as Diagnostic Rhyme Test [1] or PESQ [2].

However, it may be an improvement over SSB voice intelligibility!

[1] http://www.dynastat.com/Speech%20Intelligibility.htm

[2] https://en.wikipedia.org/wiki/PESQ

reply

MacsHeadroom | karma 2958 | avg karma 2.23 · | 2023-06-07 08:10:44

> Big caveat, it won't help ASR model, e.g., Whisper.

Why not?

reply

yourapostasy | karma 3037 | avg karma 1.75 · | 2017-09-29 10:55:39

Thanks so much for pointing me to that. If I had an always-air-gapped ORWL, I'd probably put a generator on it to be able to quickly generate them; I generate an absurd number of accounts each week, since I don't tie into SSO offers over public sites.

Only concern I need to solve for is sifting these lists for confusing words like homophones that are difficult to clearly say the words over an audio-only connection. This drives down the convenience and increases the time to call out the passphrase, since I then revert to the phonetic alphabet. Unfortunately, I've only found various lists [1], but not something that grabs the list, looks up the ExtIPA pronunciation [2] for each word, then applies rules to rank the clarity of each word when spoken over the phone.

That last part is where I am stumbling. I can't find the rules that govern what we know about clarity over telephone links, though AT&T must have has studied this at some point. I can come close, with this article [3] about factors that explain why the sounds “f”, “th”, and “s” are difficult for hearing-impaired to hear. Ideally, a ranking of not just individual words in a selected Diceware list, but also ranking of the clarity of a selected passphrase as well, would come out of such a systematic categorization.

[1] http://www.stlcc.edu/Student_Resources/Academic_Resources/Wr...

[2] https://en.wikipedia.org/wiki/Extensions_to_the_Internationa...

[3] https://www.hearingaidknow.com/words-difficult-to-understand...

reply

rcthompson | karma 7252 | avg karma 4.02 · | 2018-11-15 19:06:32+00:00

I'd be a little worried about the feature that fills in missing voice chunks. It sounds ripe for accidentally replacing one lost word with a completely different word that could also make sense in context. Almost like the issue where Xerox copiers would sometimes replace one character with another. Hopefully the filling in of missing chunks is done in a way that doesn't allow it to fill in whole words, but rather just short sub-syllable chunks of audio?

aswanson | karma 5136 | avg karma 1.92 · | 2008-03-28 21:24:32

I'm guessing that given the difficulty of characterizing voiced speech is still inaccurate, the more difficult task of translating neural impulses to speech will result in even further degradation of performance. Unsolved: At the moment, the device has a limited vocabulary: 150 words and phrases.

A long way to go.

reply

whyleyc | karma 2966 | avg karma 6.7 · | 2019-12-05 23:53:31

Thanks for all the hard work you have put in so far @reubenmorais

+1000 to @mostlyjason's comment - Great latency figures mean nothing if the word error rate is high, since it dents confidence in the output (so why use DeepSpeech?) and (as the parent comment notes) necessitates manual error correction.

I would love to see a future release focus on optimizing WER for these reasons.

reply

josefx | karma 4586 | avg karma 2.35 · | 2018-05-09 20:57:17

You can't just drop compatibility. We will have A.I. trained voice systems that mimic natural speech just enough to be understood by duplex while compressing the exchange to a minimum. Data transfer will be measured in microwords per second. Future versions of duplex will of course detect this kind of compressed speech and reply in kind, falling back to normal speech only if the immediate response is similar to north american confusion.

monocasa | karma 27236 | avg karma 2.94 · | 2021-05-03 04:01:26

There are other studies that show we can understand speech at a much higher rate. The critical bandwidth constraint seems to be on stream out, not stream in.

khc | karma 2620 | avg karma 5.76 · | 2016-06-28 20:41:30

> The paper Phonotactic Reconstruction of Encrypted VoIP Conversations gives a technique for reconstructing speach from an encrypted VoIP call.

The technique to reconstructing speech clearly had its limitations.

reply

adrianbg | karma 208 | avg karma 1.48 · | 2017-10-24 19:42:29

The low WER numbers you've probably seen are for conversational telephone speech with constrained subjects. ASR is much harder when the audio source is farther away from the microphone and when the topic isn't constrained.

htrp | karma 3449 | avg karma 2.55 · | 2024-05-03 13:44:32

> There are alternatives that share Whisper internal states with an LLM to improve ASR, as well as approaches that sample N-best hypotheses from Whisper and fine-tune an LLM to distill the hypotheses into a single output. Haven't looked too much into these yet given how expensive each component is to run independently.

Any recommended ones you've looked at?

reply

popinman322 | karma 448 | avg karma 2.85 · | 2024-05-03 08:06:16

Tangent here: really? I've found base Whisper has concerning error rates for non-US English accents; I imagine the same is true for other languages with a large regional mode to the source dataset.

Whisper + an LLM can recover some of the gaps by filling in contextually plausible bits, but then it's not a transcript and may contain hallucinations.

There are alternatives that share Whisper internal states with an LLM to improve ASR, as well as approaches that sample N-best hypotheses from Whisper and fine-tune an LLM to distill the hypotheses into a single output. Haven't looked too much into these yet given how expensive each component is to run independently.

reply

danmaz74 | karma 8829 | avg karma 2.77 · | 2011-08-29 11:58:54+00:00

It will be very interesting to see how this approach will work with dictated speech.

Also let's not forget that the word-level error rate can be reduced by using statistical data about words sequences.

reply

6gvONxR4sf7o | karma 9787 | avg karma 3.43 · | 2022-01-28 11:50:19

Once automatic speech recognition (ASR) gets closer to bullet-proof, I expect this to become a huge thing, but right now, it seems like you're getting better error rates than typical.

Any input method where you frequently have to repeat yourself and undo things won't get mainstream. I'd bet people's mainstream tolerance for errors would have to be like one per five to ten minutes before you could get them to really adopt something like this (barring disability reasons, like RSI). Until then, the tech and market don't match.

reply