Hacker Read

mariusio · 2023-03-30 03:03:15

Did anyone find a solution to have whisper differentiate between multiple speakers in a conversation and mark them in the written output?

mayeaux | karma 381 | avg karma 3.37 · | 2022-11-18 18:51:48

It sounds pretty good, this is my first time hearing about it but it looks good. Even if it does detect that they are separate entities talking, how does it label it in a way that's helpful/useful for you? I guess it comes out as 'Speaker 1', 'Speaker 2', etc in the end? And you can find/replace the speakers with the actual people?

stiffler01 | karma 5 | avg karma 0.83 · | 2024-01-29 11:35:18

We thought about doing this in Whisper itself, since its already working in the audio space.

jamesbriggs | karma 193 | avg karma 3.94 · | 2022-12-21 09:07:21

Very cool, how are you doing the speech-to-text part, with Whisper?

peglasaurus | karma 46 | avg karma 1.31 · | 2020-03-16 07:36:02+00:00

If you're wanting a lot of people to use your solution as you described, recognition of who is speaking could add a lot of extra possibilities.

freeqaz | karma 2992 | avg karma 4.03 · | 2023-03-30 03:06:49

Yeah there are some models that I played with that can do this. They only work for 2 or 3 speakers currently though. They term for this is "diarization".

https://huggingface.co/pyannote/speaker-diarization

reply

eternityforest | karma 3505 | avg karma 1.05 · | 2023-03-15 04:33:40

Seems like the multiple microphone beamforming source separation algorithms are getting pretty good these days, maybe just adding a lot more mics would help?

Could you have an AI model that extracts some characteristics of the speaker's voice for each individual word, then translates that to color and font?

If the model was not confident about a word it could show slightly blurred, if it was loud it could be bold, perhaps(Although there's some stereotype issues) you could use different fonts for different pitches, whispers could be grey, quiet could be transparent.

Maybe there's a language model that can pick up overlapping words if you don't have the constraint of needing to sort them out into who said it, just show all the possibile words that could have been said by anyone stacked together, in a "not sure" color, and maybe the wearer would eventually learn to figure it out without much effort?

You could also try to stay consistent so the same speaker gets the same colors I'd possible, and also not reuse colors for new speakers that have been recently used, to best make use of the limited bits of data in font and color.

Maybe just by showing all the words from every speaker all together like that, the wearer would be able to figure it out even if it made mistakes in the speaker identification?

reply

lordswork | karma 723 | avg karma 3.43 · | 2022-12-06 20:36:04

Has anyone combined Whisper with a model that can assign owners to voices yet?

For example, instead of:

    Hello there!
    Hi 
    How are you?
    Good, and you?

You get something like:

    Voice A: Hello there!
    Voice B: Hi 
    Voice A: How are you?
    Voice B: Good, and you?

chromakode | karma 1173 | avg karma 7.02 · | 2023-06-19 13:04:37

I've been working on an open source audio editor which uses Whisper to slice speech. Very exciting to see more capabilities on the horizon!

skeledrew | karma 180 | avg karma 0.76 · | 2024-06-30 12:30:52

Yeah I get that. I tried another open source voice input that was real time and the quality was horrible. But this is something that can be worked around for Whisper. One thing that comes to mind is an option to append and reprocess the audio every few centiseconds (needs a fairly powerful device though), and update the text output as needed. This could also open the door for an edit-by-voice feature.

thimkerbell | karma 169 | avg karma 0.57 · | 2024-01-20 13:07:42

WhisperSpeech – An open source text-to-speech system built by inverting Whisper https://news.ycombinator.com/item?id=39036796

blindgeek | karma 382 | avg karma 9.32 · | 2024-02-21 15:17:26

My friend group and I have been playing with LLMs: https://news.ycombinator.com/item?id=39208451. We tend to hang out in multi-user voice chat sometimes, and I've speculated that it would be interesting to hook an LLM to ASR and TTS and bring it into our voice chat. Yeah, naive, and to be honest, I'm not even sure where to start. Have you tried bringing your conversational LLM into a multi-person conversation?

jw1224 | karma 1747 | avg karma 5.69 · | 2022-11-18 18:38:10

Free startup idea: Use Whisper with pyannote-audio[0]’s speaker diarization. Upload a recording, get back a multi-speaker annotated transcription.

Make a JSON API and I’ll be your first customer.

[0] https://github.com/pyannote/pyannote-audio

reply

artimaeis | karma 1263 | avg karma 3.44 · | 2020-01-08 13:46:33+00:00

> ML multi-speaker speech-to-text every conversation

Neat idea, do you know of any software that's capable of taking an audio file and producing multi-user text from it? Seems like it would be useful in a wide variety of situations.

reply

schappim | karma 8798 | avg karma 6.11 · | 2023-01-22 02:38:18

I actually just did something similar with whisper.cpp, and hooked it up to GPT-3 Davinci-3 via the API, and then the answer piped via Microsoft's text to speech. The mic I'm using is a cheap USB omni-mic designed for conference calls.

abidlabs | karma 76 | avg karma 0.97 · | 2022-09-21 16:11:36

Here [1] is a video tutorial on building a web UI that accepts microphone input and runs it through Whisper for speech transcription

[1] https://www.youtube.com/watch?v=ywIyc8l1K1Q&ab_channel=1litt...

reply

totetsu | karma 3420 | avg karma 1.8 · | 2023-10-08 05:53:16

Using whisper api to do real time transcription of what your interlocutor is saying and feeding it into wingmanGPT with system prompt to output what you should say to score, then sending back to an ear piece via bark with a real chad voice prompt.

dale_glass | karma 7333 | avg karma 4.33 · | 2024-01-18 03:48:32

How tunable is the voice?

I'm interested in applying TTS to a chat system, and one important feature for that is that there should be as many as possible distinct voices, so that each person would have their own.

Would this, or something else be able to do that?

reply

LelouBil | karma 1001 | avg karma 2.62 · | 2023-04-28 00:49:50

Are you looking into making fine tuning easy for the speech to text model ? I feel like that's the only thing missing before it's perfect.

My entity names can contain the English or other text that just won't be picked up by the speech to text, so if there was a way to record a couple of pronunciations for an entity name, and home assistant would fine-tune whisper on them in the background it would be wonderful !

reply

freedomben | karma 22645 | avg karma 3.58 · | 2024-01-18 09:59:44

Will something like whisper.cpp be possible for whisper speech?