Rev AI will also create a transcription separated by multiple speakers, which it doesn't appear Whisper can do (yet). I expect that Whisper will overtake the alternatives soon, given that it's open source, but today it's not there yet.
Is this in the realm of aspiration or something you've actually worked on? Because Whisper is incredibly difficult (I'd say impossible) to use in a real time conversational setting. The transcription speed is too slow for interactive use even on a GPU once you step up above tiny or base. And when you step down this low the accuracy is attrocious (especially in noisy settings or with accented voices) and then you have to post process the output with a good NLP to make it usable in whatever actions you're driving.
Look, it's nice that it's out there and free to use. For the sake of my wallet I hope it gets really good. But it isn't competitive with top of the line commercial offerings if you need to ship something today.
I recently swapped out the AI model for voice transcription on revoldiv.com and replaced it with Whisper. The results have been truly impressive - even the smaller models outperform and generalize better than any other options on the market. If you want to give it a try, our model is capable of faster transcription by utilizing multiple GPUs and some other enhancements, and it is all free
Whisper doesn’t do streaming transcription or translation. It’s one of its architectural disadvantages over more commonly used ASR approaches, such as transducers (see e.g. Microsoft Research’s system trained on ~350k hours that does streaming transcription and translation: https://arxiv.org/abs/2211.02499).
With so much recent focus by OpenAI/Google on AI's visual capabilities, does anyone know when we might see an OCR product as good as Whisper for voice transcription? (Or has that already happened?) I had to convert some PDFs and MP3s to text recently and was struck by the vast difference in output quality. Whisper's transcription was near-flawless, all the OCR softwares I tried struggled with formatting, missed words, and made many errors.
I didn’t know whisper could differentiate voices for the per speaker transcription. Is that new? Is it also available in the command line whisper builds?
Some of you have probably seen the recently released whisper model from openai. Having this go two ways would open up for some neat conversational ai ideas, as in having two AIs discussing with each other, say backed by GPT3 but the output form being perfect audio/speech
So is there something as "state of the art" as whisper but for text to speech/audio?
We use Whisper for automatic translation, supposedly SotA, but we have to fix its output, I would say, very often. It repeats things, translates things for no reason, has trouble with numbers.. it's improved in leaps and bounds but I'd say that speech recognition doesn't seem to be there yet.
The accuracy of whisper seems to be the best available for mere mortals, but the resource requirements are fairly high. It doesn't seem like it is possible to use it for real-time transcription.
What solutions are Twitter spaces and the like using for real-time live captions?
Given Whisper is open source, I'd be surprised if it's not. It would be cool for Web Speech API's SpeechRecognition to simply use it, though that would make browser downloads a little beefier.
> Once we have individual tracks to work with, we begin transcription. This is the most resource intensive part of the process. We rely on the Whisper AI transcription model from OpenAI, via WhisperX. The WhisperX project also uses wave2vec2 to provide accurate word-level timestamps, which is important for sentence-level synchronization. The transcription process is fairly standard; the only interesting addition to the process that Storyteller makes is to supply an "initial prompt" to the transcription model, outlining its task as transcribing an audiobook chapter and providing a list of words from the book that don't exist in the English dictionary as hints.
For the transcription part, we are looking into W2v-BERT 2.0 as well and will make it available in a live-streaming context. That said, Whisper, especially small (<50ms), is not as compute-heavy; right now, most of the compute is consumed by the LLM.
reply