Hacker Read

phren0logy · 2022-09-22 09:49:18

Rev AI will also create a transcription separated by multiple speakers, which it doesn't appear Whisper can do (yet). I expect that Whisper will overtake the alternatives soon, given that it's open source, but today it's not there yet.

abraxas | karma 2908 | avg karma 3.94 · | 2023-03-30 00:20:24

Is this in the realm of aspiration or something you've actually worked on? Because Whisper is incredibly difficult (I'd say impossible) to use in a real time conversational setting. The transcription speed is too slow for interactive use even on a GPU once you step up above tiny or base. And when you step down this low the accuracy is attrocious (especially in noisy settings or with accented voices) and then you have to post process the output with a good NLP to make it usable in whatever actions you're driving.

Look, it's nice that it's out there and free to use. For the sake of my wallet I hope it gets really good. But it isn't competitive with top of the line commercial offerings if you need to ship something today.

reply

userhacker | karma 44 | avg karma 1.22 · | 2022-12-06 18:17:07

I recently swapped out the AI model for voice transcription on revoldiv.com and replaced it with Whisper. The results have been truly impressive - even the smaller models outperform and generalize better than any other options on the market. If you want to give it a try, our model is capable of faster transcription by utilizing multiple GPUs and some other enhancements, and it is all free

allisdust | karma 444 | avg karma 1.51 · | 2023-01-22 02:40:55

Does whisper model support real time transcription?

woodson | karma 700 | avg karma 2.02 · | 2023-03-29 23:37:15

Whisper doesn’t do streaming transcription or translation. It’s one of its architectural disadvantages over more commonly used ASR approaches, such as transducers (see e.g. Microsoft Research’s system trained on ~350k hours that does streaming transcription and translation: https://arxiv.org/abs/2211.02499).

aragonite | karma 2617 | avg karma 4.09 · | 2024-05-14 19:46:58

With so much recent focus by OpenAI/Google on AI's visual capabilities, does anyone know when we might see an OCR product as good as Whisper for voice transcription? (Or has that already happened?) I had to convert some PDFs and MP3s to text recently and was struck by the vast difference in output quality. Whisper's transcription was near-flawless, all the OCR softwares I tried struggled with formatting, missed words, and made many errors.

AMICABoard | karma 638 | avg karma 4.31 · | 2023-10-03 09:43:48

Okay speech recognition will be implemented in future, maybe with openai whisper.

nikvaes | karma 25 | avg karma 2.5 · | 2022-10-01 12:24:43

I don't think it will be easy to make Whisper work for streaming speech recognition. Whisper has been trained on 30 second input chunks.

mosselman | karma 4644 | avg karma 2.98 · | 2023-08-23 11:35:05

I didn’t know whisper could differentiate voices for the per speaker transcription. Is that new? Is it also available in the command line whisper builds?

ggerganov | karma 1822 | avg karma 6.95 · | 2022-11-10 03:38:50

In my experience, OpenAI's Whisper speech recognition is beyond anything currently out there. Likely Github will use it on the backend.

tikkun | karma 2720 | avg karma 3.01 · | 2023-04-12 20:40:20

It’s quite a bit better than Whisper in my experience.

What API do you expect they’re using, or did they build a custom transcription model?

Google speech to text, assembly, AWS?

reply

swix | karma 128 | avg karma 2.91 · | 2022-11-10 06:40:17

Some of you have probably seen the recently released whisper model from openai. Having this go two ways would open up for some neat conversational ai ideas, as in having two AIs discussing with each other, say backed by GPT3 but the output form being perfect audio/speech

So is there something as "state of the art" as whisper but for text to speech/audio?

reply

radarsat1 | karma 3766 | avg karma 3.03 · | 2024-03-23 06:53:03

We use Whisper for automatic translation, supposedly SotA, but we have to fix its output, I would say, very often. It repeats things, translates things for no reason, has trouble with numbers.. it's improved in leaps and bounds but I'd say that speech recognition doesn't seem to be there yet.

fareesh | karma 3904 | avg karma 2.85 · | 2023-02-12 07:05:53

The accuracy of whisper seems to be the best available for mere mortals, but the resource requirements are fairly high. It doesn't seem like it is possible to use it for real-time transcription.

What solutions are Twitter spaces and the like using for real-time live captions?

reply

petercooper | karma 48277 | avg karma 5.12 · | 2022-12-07 10:32:11

Given Whisper is open source, I'd be surprised if it's not. It would be cool for Web Speech API's SpeechRecognition to simply use it, though that would make browser downloads a little beefier.

georgel | karma 175 | avg karma 2.54 · | 2023-03-01 13:34:16

I wish Whisper offered speaker diarization. That would be a full game changer for the speech-to-text space.

atmosx | karma 6099 | avg karma 1.75 · | 2023-11-14 12:23:13

Same here. Whisper is really good at transcribing Greek but no diarization support, which makes it less than ideal for most use cases.

jpcl | karma 82 | avg karma 2.65 · | 2024-01-18 05:20:47

Yup, we are using Whisper to transcribe automatically so we can train the model on just speech recordings, without human transcripts.

This works for any language that is well supported by the OpenAI Whisper model.

reply

DecoPerson | karma 821 | avg karma 3.93 · | 2023-12-24 03:41:16

> Once we have individual tracks to work with, we begin transcription. This is the most resource intensive part of the process. We rely on the Whisper AI transcription model from OpenAI, via WhisperX. The WhisperX project also uses wave2vec2 to provide accurate word-level timestamps, which is important for sentence-level synchronization. The transcription process is fairly standard; the only interesting addition to the process that Storyteller makes is to supply an "initial prompt" to the transcription model, outlining its task as transcribing an audiobook chapter and providing a list of words from the book that don't exist in the English dictionary as hints.

https://smoores.gitlab.io/storyteller/docs/how-it-works/the-...

reply

renus | karma 40 | avg karma 2.22 · | 2024-01-29 10:37:07

For the transcription part, we are looking into W2v-BERT 2.0 as well and will make it available in a live-streaming context. That said, Whisper, especially small (<50ms), is not as compute-heavy; right now, most of the compute is consumed by the LLM.