Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Did anyone find a solution to have whisper differentiate between multiple speakers in a conversation and mark them in the written output?


sort by: page size:

It sounds pretty good, this is my first time hearing about it but it looks good. Even if it does detect that they are separate entities talking, how does it label it in a way that's helpful/useful for you? I guess it comes out as 'Speaker 1', 'Speaker 2', etc in the end? And you can find/replace the speakers with the actual people?

We thought about doing this in Whisper itself, since its already working in the audio space.

Very cool, how are you doing the speech-to-text part, with Whisper?

If you're wanting a lot of people to use your solution as you described, recognition of who is speaking could add a lot of extra possibilities.

Yeah there are some models that I played with that can do this. They only work for 2 or 3 speakers currently though. They term for this is "diarization".

https://huggingface.co/pyannote/speaker-diarization


Seems like the multiple microphone beamforming source separation algorithms are getting pretty good these days, maybe just adding a lot more mics would help?

Could you have an AI model that extracts some characteristics of the speaker's voice for each individual word, then translates that to color and font?

If the model was not confident about a word it could show slightly blurred, if it was loud it could be bold, perhaps(Although there's some stereotype issues) you could use different fonts for different pitches, whispers could be grey, quiet could be transparent.

Maybe there's a language model that can pick up overlapping words if you don't have the constraint of needing to sort them out into who said it, just show all the possibile words that could have been said by anyone stacked together, in a "not sure" color, and maybe the wearer would eventually learn to figure it out without much effort?

You could also try to stay consistent so the same speaker gets the same colors I'd possible, and also not reuse colors for new speakers that have been recently used, to best make use of the limited bits of data in font and color.

Maybe just by showing all the words from every speaker all together like that, the wearer would be able to figure it out even if it made mistakes in the speaker identification?


Has anyone combined Whisper with a model that can assign owners to voices yet?

For example, instead of:

    Hello there!
    Hi 
    How are you?
    Good, and you?
You get something like:

    Voice A: Hello there!
    Voice B: Hi 
    Voice A: How are you?
    Voice B: Good, and you?

I've been working on an open source audio editor which uses Whisper to slice speech. Very exciting to see more capabilities on the horizon!

Yeah I get that. I tried another open source voice input that was real time and the quality was horrible. But this is something that can be worked around for Whisper. One thing that comes to mind is an option to append and reprocess the audio every few centiseconds (needs a fairly powerful device though), and update the text output as needed. This could also open the door for an edit-by-voice feature.

WhisperSpeech – An open source text-to-speech system built by inverting Whisper https://news.ycombinator.com/item?id=39036796

My friend group and I have been playing with LLMs: https://news.ycombinator.com/item?id=39208451. We tend to hang out in multi-user voice chat sometimes, and I've speculated that it would be interesting to hook an LLM to ASR and TTS and bring it into our voice chat. Yeah, naive, and to be honest, I'm not even sure where to start. Have you tried bringing your conversational LLM into a multi-person conversation?

Free startup idea: Use Whisper with pyannote-audio[0]’s speaker diarization. Upload a recording, get back a multi-speaker annotated transcription.

Make a JSON API and I’ll be your first customer.

[0] https://github.com/pyannote/pyannote-audio


> ML multi-speaker speech-to-text every conversation

Neat idea, do you know of any software that's capable of taking an audio file and producing multi-user text from it? Seems like it would be useful in a wide variety of situations.


I actually just did something similar with whisper.cpp, and hooked it up to GPT-3 Davinci-3 via the API, and then the answer piped via Microsoft's text to speech. The mic I'm using is a cheap USB omni-mic designed for conference calls.

Here [1] is a video tutorial on building a web UI that accepts microphone input and runs it through Whisper for speech transcription

[1] https://www.youtube.com/watch?v=ywIyc8l1K1Q&ab_channel=1litt...


Using whisper api to do real time transcription of what your interlocutor is saying and feeding it into wingmanGPT with system prompt to output what you should say to score, then sending back to an ear piece via bark with a real chad voice prompt.

How tunable is the voice?

I'm interested in applying TTS to a chat system, and one important feature for that is that there should be as many as possible distinct voices, so that each person would have their own.

Would this, or something else be able to do that?


Are you looking into making fine tuning easy for the speech to text model ? I feel like that's the only thing missing before it's perfect.

My entity names can contain the English or other text that just won't be picked up by the speech to text, so if there was a way to record a couple of pronunciations for an entity name, and home assistant would fine-tune whisper on them in the background it would be wonderful !


Will something like whisper.cpp be possible for whisper speech?
next

Legal | privacy