Is this in the realm of aspiration or something you've actually worked on? Because Whisper is incredibly difficult (I'd say impossible) to use in a real time conversational setting. The transcription speed is too slow for interactive use even on a GPU once you step up above tiny or base. And when you step down this low the accuracy is attrocious (especially in noisy settings or with accented voices) and then you have to post process the output with a good NLP to make it usable in whatever actions you're driving.
Look, it's nice that it's out there and free to use. For the sake of my wallet I hope it gets really good. But it isn't competitive with top of the line commercial offerings if you need to ship something today.
And there are open-source alternatives but I don’t think the quality is super good.
There’s also enough information out there to do this yourself with a bunch of GPU time, I have some ideas I want to try out but don’t have the (GPU) time.
Whisper is pretty good for speech to text, and can be done with in a resource constrained environment. I tried a demo running in a browser using WASM on my phone and even the tiny model is not bad.
The accuracy of whisper seems to be the best available for mere mortals, but the resource requirements are fairly high. It doesn't seem like it is possible to use it for real-time transcription.
What solutions are Twitter spaces and the like using for real-time live captions?
One thing I've found challenging about the Whisper APIs is that it performs quite poorly when trying to do "realtime transcription" - I played around with some of the whisper.cpp stuff to get it running, and with the tiny model, I was almost able to get reliable transcriptions, but it seems like other than static mp3 files, it is a Hard Problem [tm] that will need further work to get really good.
My use case was to try to make an AI assistant that would transcribe my audio requests and then turn that into a payload for one of the GPT-X APIs
Someone else in this thread[0] said Whisper was running at 17x real time for them. So, even a weak machine might be able to do an acceptable approximation of real time with Whisper.
Also, I feel like shipping to the cloud and back has been shown to be just as fast as on device transcription in a lot of scenarios. Doing it on device is primarily a benefit for privacy and offline, not necessarily latency. (Although, increasingly powerful smartphone hardware is starting to give the latency edge to local processing.)
Siri's dictation has had such terrible accuracy for me (an American English speaker without a particularly strong regional accent) and everyone else I know for so many years that it is just a joke in my family. Google and Microsoft have much higher accuracy in their models. The bar is so low for Siri that I automatically wonder how much Whisper is beating Siri in accuracy... because I assume it has to be better than that.
I really wish there was an easy demo for Whisper that I could try out.
Whisper democratises high-quality transcription of the languages I personally care about, whether using a CPU or a GPU. It's FOSS, self-hostable, and very easy to use.
That's a big deal for someone who wants to build something using speech recognition (voice assistant, transcription of audio or video) without resorting to APIs.
I would not use whisper as a good yardstick for English language transcription. I'm not sure what the hubbab is all about but for myself who is not a native English speaker, Whisper is not very impressive. There are engines out there that produce a far better word error rate on my speech than Whisper does.
Maybe it works well with native speakers? But since it's supposed to be so multilingual I hoped that it would work well with my accented speech... maybe that's a wrong conclusion to draw.
I recently swapped out the AI model for voice transcription on revoldiv.com and replaced it with Whisper. The results have been truly impressive - even the smaller models outperform and generalize better than any other options on the market. If you want to give it a try, our model is capable of faster transcription by utilizing multiple GPUs and some other enhancements, and it is all free
In most real world settings, at least in my personal use, latency to a remote AI comprises most of the usability difficulty with automated speech recognition. The larger whisper models can be run directly on a laptop using multi threading and achieve speech to text transcription that is fully sufficient to almost completely write whole emails, papers, documents with them. In fact, I've written most of this comment using an ASR system on my phone that uses whisper. While the smaller models (like the one user here) can need some correction, the bigger ones are almost perfect. They are both very sufficient and for realtime interactive use I see no future market for paid APIs.
Yesterday I wrote virtually all the prose in the manuscript while walking around with a friend and discussing it. We didn't even look at the phone.
Obviously there's an academic element here because I'm saying I'm using it for writing. But it's more of a human-centric computing thing. I'm replacing a lot of time that my thumbs are spent tapping on keys, my fingers are spent tapping on keyboard, and my eyes are spent staring at the words that are appearing, looking for typographical errors to correct, with time organizing my thoughts in a coherent way that can be spoken and read easily. I'm basically using whisper to create a new way to write that's more fluid, direct, and flows exactly as my speech does. I've tried this for years with all of the various ASR models on all the phones I've had and never been satisfied in the same way.
We use Whisper for automatic translation, supposedly SotA, but we have to fix its output, I would say, very often. It repeats things, translates things for no reason, has trouble with numbers.. it's improved in leaps and bounds but I'd say that speech recognition doesn't seem to be there yet.
Problem with whisper is its not really optimized for command recognition versus general dictation.
- Whisper processes 30 second audio chunks. So if you process 5 seconds of audio you have to pad it out with 25 seconds of silence. Hence a loss of efficiency with wasted CPU / GPU cycles on 25 seconds per chunk in the case above.
- Whisper most likely can't handle hundreds of commands much less than a thousand performantly.
- Whisper doesn't handle short commands very well with a degree of accuracy post processing commands from free dictation utterances.
Command dictation should be weighted higher than general dictation when decoding.
I work with a little under 1500 of commands dragon naturally speaking. DNS is hot garbage as a program despite it has the best accuracy to date with the feature of commands and dictation in one utterance. You get to pay $750 for the privilege m
I've yet to see a free and open source speech recognition engine that can handle both dictation and commands with a high degree of accuracy.
Please please let me know if there's alternatives out there. I would definitely pay to support an open source project like this that focuses on command and dictation.
Most solutions out there that are open source nowadays focus so much on iot command recognition with intents. That's not well suited for controlling your computer with grammars containing voice commands.
True, but note that they're using the tiny model. In my experimentation, you need at least the small model to get transcription I'd call "good", which is still a bit slower than you'd like on a moderately fast laptop from 2019.
That said, whisper is incredible and the era of very good local speech-to-text on moderate hardware is basically here, or will be in the next year.
Rev AI will also create a transcription separated by multiple speakers, which it doesn't appear Whisper can do (yet). I expect that Whisper will overtake the alternatives soon, given that it's open source, but today it's not there yet.
have you tried it?
i mean for fun, it wouldn't hurt for sure and ggerganov is doing amazing stuff. kudos to him.
but whisper is designed to process audio files in 30-second batches if I'm not mistaken. it's been a while since whisper released, lol. These workarounds make the window smaller but it doesn't change the fact that they're workarounds. you can adjust, modify, or manipulate the model. You can't write or train it from scratch. check out the issues referring to the real-time transcription in the repo.
can you use it? yes
would it perform better than Deepgram? -although it's an API and probably not the best API- I am not sure.
would i use it in my money-generating application? absolutely not.
Thanks! I actually initially had it doing transcription in 5- and 10-second chunks for close-to-realtime results, but the CPU usage on my laptop (which admittedly doesn't have the best specs) was a bit higher than i wanted. 30-second blocks gave me the best balance of semi-real-time and good performance, especially since the whisper model is built for 30-second chunks. If you get real-time working smoothly though, i'd love to take a look!
Look, it's nice that it's out there and free to use. For the sake of my wallet I hope it gets really good. But it isn't competitive with top of the line commercial offerings if you need to ship something today.
reply