It sounds pretty good, this is my first time hearing about it but it looks good. Even if it does detect that they are separate entities talking, how does it label it in a way that's helpful/useful for you? I guess it comes out as 'Speaker 1', 'Speaker 2', etc in the end? And you can find/replace the speakers with the actual people?
Yeah there are some models that I played with that can do this. They only work for 2 or 3 speakers currently though. They term for this is "diarization".
Seems like the multiple microphone beamforming source separation algorithms are getting pretty good these days, maybe just adding a lot more mics would help?
Could you have an AI model that extracts some characteristics of the speaker's voice for each individual word, then translates that to color and font?
If the model was not confident about a word it could show slightly blurred, if it was loud it could be bold, perhaps(Although there's some stereotype issues) you could use different fonts for different pitches, whispers could be grey, quiet could be transparent.
Maybe there's a language model that can pick up overlapping words if you don't have the constraint of needing to sort them out into who said it, just show all the possibile words that could have been said by anyone stacked together, in a "not sure" color, and maybe the wearer would eventually learn to figure it out without much effort?
You could also try to stay consistent so the same speaker gets the same colors I'd possible, and also not reuse colors for new speakers that have been recently used, to best make use of the limited bits of data in font and color.
Maybe just by showing all the words from every speaker all together like that, the wearer would be able to figure it out even if it made mistakes in the speaker identification?
Yeah I get that. I tried another open source voice input that was real time and the quality was horrible. But this is something that can be worked around for Whisper. One thing that comes to mind is an option to append and reprocess the audio every few centiseconds (needs a fairly powerful device though), and update the text output as needed. This could also open the door for an edit-by-voice feature.
My friend group and I have been playing with LLMs: https://news.ycombinator.com/item?id=39208451.
We tend to hang out in multi-user voice chat sometimes, and I've speculated
that it would be interesting to hook an LLM to ASR and TTS and bring
it into our voice chat. Yeah, naive, and to be honest, I'm not even sure
where to start. Have you tried bringing your conversational LLM into
a multi-person conversation?
> ML multi-speaker speech-to-text every conversation
Neat idea, do you know of any software that's capable of taking an audio file and producing multi-user text from it? Seems like it would be useful in a wide variety of situations.
I actually just did something similar with whisper.cpp, and hooked it up to GPT-3 Davinci-3 via the API, and then the answer piped via Microsoft's text to speech. The mic I'm using is a cheap USB omni-mic designed for conference calls.
Using whisper api to do real time transcription of what your interlocutor is saying and feeding it into wingmanGPT with system prompt to output what you should say to score, then sending back to an ear piece via bark with a real chad voice prompt.
I'm interested in applying TTS to a chat system, and one important feature for that is that there should be as many as possible distinct voices, so that each person would have their own.
Are you looking into making fine tuning easy for the speech to text model ? I feel like that's the only thing missing before it's perfect.
My entity names can contain the English or other text that just won't be picked up by the speech to text, so if there was a way to record a couple of pronunciations for an entity name, and home assistant would fine-tune whisper on them in the background it would be wonderful !
reply