For you or anyone else reading this I recently ran across this video documenting setting up and using whisper. It's probably a little overdetailed, but I found the github docs a little underdetailed so might be useful. Whisper is pretty powerful. One of the more useful open source ai tools available right now.
But as you implied in your comment, it should be possible to do it quite well with any video by transcripting with whisper and then sending the text to gpt or another LLM to summarize.
This is awesome to see! Our team at Shipyard [1] has been creating a lot of solution videos on YouTube recently to show teams how they can build A -> B solutions in a few minutes. We've been meaning to provide captions or transcripts for the backlog, but the overhead was either pretty high or too expensive.
Tested this out in the span of a few hours and got a solution up and running to download the video from Youtube, spit out the transcription and upload the resulting transcription file externally. We're still missing a piece to upload directly to YouTube, but it's a start!
As a part of this experiment, we built out some templates that will allow anyone to play around with Whisper in our platform. If you're interested in seeing it, we built a video for doing the process with our templates [2], or directly with Python [3].
You still interested in this? I'd be keen to chat to you, worked on a searchable transcript provider for educational youtube videos (likewise, unfortunately pre-whisper, so I did a lot of work with sentence completion perplexity and rpunct to try and improve transcript quality from youtube automatic transcriptions). Can be contacted at revision.ai and demo what we were able to do till now, would be great to hear your thoughts.
Whisper (OpenAI speech-to-text) is already trained on YT content; amusingly, if you mumble incoherently, its most-probable completion for noise is “thanks for watching!”
If you're grappling with the slow march from cool tech demos to real-world language model apps, you might wanna check out WhisperLive. It's this rad open-source project that’s all about leveraging Whisper models for slick live transcription. Think real-time, on-the-fly translated captions for those global meetups. It's a neat example of practical, user-focused tech in action. Dive into the details on their GitHub page
We built an AI video dubbing app by hacking together ASR, Google Translate, and TTS systems.
Added some features to sync the video with the audio to make the outputs consumable.
Couldn't resist ourselves after OpenAI released Whisper. Hacked it in under a day on our app and now we go live.
Next step is to integrate it to our video dubbing flow wherein we take a video, convert it into English text using Whisper, then localize it in the required language.
Some great stuff here. Been thinking about doing this for enterprise grade software at work. What I want to do is feed it docs (pdf), text from support queries with answers and videos. What types of approaches should I be considered here? Have just started using Whisper to do ASR on videos.
SubEasy.ai is an all-in-one platform where you can create automatic subtitles, AI translations, transcriptions with speaker names, chat with the transcription, and export it as a video or text file document.
Transcribe:
1. Powered by Whisper: We leverage OpenAI’s Whisper model, which supports many languages with high accuracy, especially in multilingual scenarios. This gives us a competitive edge against ‘traditional’ transcribe services.
2. Enhanced Accuracy and Readability : Whisper isn’t perfect, so we aimed to maximize its potential. We implemented the following:
- Clear +: Whisper can pick up background noises in audio/videos, like passerby voices, music, and even honking. Using Clear +, we remove these noises with DEMUCS and normalize the audio before sending it to Whisper for transcription.
- Subtitle Reflow: Many audio/video-to-subtitle applications group large blocks of text within the same timeframe, resulting in overly long subtitles on the screen. With our exclusive Subtitle Reflow feature, you can have context-aware cutting and time-aware segmentations, improving the viewing experience. We actually use smaller NLP models to achieve this, if you’re interested in tech spec. (Just to say don’t use LLM everywhere, it’s just too expensive and very unpredictable)
3. Enhanced Transcription View: We turn audio into well-constructed articles with punctuation, sentences, and paragraphs, useful for previewing podcasts, long audios and videos, and meeting minutes.
- Speaker Recognition: This feature identifies different speakers in a multi-speaker conversation, making it easier to follow who’s speaking. We use NVIDIA Nemo toolkit for state-of-the-art accuracy in Speaker Recognition.
What Makes it Next-Gen?
1. Context-Aware AI Translation: Most translation services work sentence by sentence, missing context-specific meanings. Using modern AI models, we create context-aware and highly accurate translations. We also introduced a second round of refinement and proofreading, launching AI Plus translation, which can sometimes outperform human translators.
2. Chat with the Transcript: We integrated GPTs with our platform, allowing users to interact with their documents with natural language. You can summarize, and rewrite transcripts and much more on ChatGPT. Since ChatGPT now roll out a lot of features(previous plus-only) to free users, actually you can use this feature with extra cost!
3. Integrated AI Companion: You can create summaries, meeting minutes, show notes, and social media content with one click without leaving the page. Regardless of the transcript language, you can always get AI content in English(Or other languages you prefer).
What Makes the Product More Than Good:
We offer a WYSIWYG video preview with multiple subtitle styles, a lightning-fast subtitle/transcript editing interface, document management system, search, video output, multi-format document output, and more. We believe we have the best overall performance and experience in this specific field.
Final Thoughts
Creating SubEasy.ai has been an incredible journey, inspired by a simple yet profound desire to make my wife's viewing experience more enjoyable. It started as a personal project but quickly evolved into something much larger, driven by the potential to help others facing similar challenges with transcriptions and subtitle translations.
For those who need reliable transcription and translation services, I invite you to give SubEasy.ai a try. You might be pleasantly surprised by its capabilities and the seamless experience it offers. Whether you're curious about the technical aspects, the cost, or just want to provide feedback, I'd love to hear from you. Your insights will help us continue to improve and innovate.
Thank you for taking the time to read about our journey and the creation of SubEasy.ai!
It should not be too hard! From what I saw Whisper is similar to Bart, and we have Bart. The missing piece is a library for audio processing to tensors.
I really liked the idea from an earlier post with continuous recording + Whisper for transcription + keyword based actions. The drawback is asynchronous execution of your actions, but that setup seems very flexible!
This is awesome! I've been recording myself (video/audio) for the last couple years on and off (thousands of hours) and have no efficient way of processing the info. Was not aware of Whisper and what he's done is exactly what I'm looking to pull off.
The GPT-3 idea is scary and most certainly the future. I can't stand the world of never ending 'Moviefone' menus and chatbots, but when it's me that gets to be the machine response the future doesn't seem so annoying. Would be nice to have my own GPT-3 model that I can use to "get to a real person" when calling places.
Using whisper api to do real time transcription of what your interlocutor is saying and feeding it into wingmanGPT with system prompt to output what you should say to score, then sending back to an ear piece via bark with a real chad voice prompt.
Thanks! Whisper is a lot of fun but it didn't take long before I wanted to build a frontend. And then I built something that I think came out super nice so why not share it with people. I used to pay $100/month for transcriptions and this works a lot better for me so might as well open-source transcription if I can, but I give all the credit to Whisper that module they put out is amazing
I too am excited about voice inferencing. I wrote my own Websocket Faster whisper implementation before OpenAI's gpt4o release . They steamrolled my interview coach concept https://intervu.trueforma.ai and https://sales.trueforma.ai - sales pitch coach implementations. I defaulted to Push to talk implementation as I couldn't get VAD to work reliably. I run it all on a panda Latte :) Was looking to implement Groq's hosted whisper. I love the idea of having Llama3 uncensored on Groq as the LLM as I'm tired of the boring corporate conversations. I hope to reduce my latency and learn from your examples - Kudos to your efforts. I wish I could try the demo - seems to be over subscribed as I can't get in to talk to the bot. I'm sure my latte Panda would melt if just 3 people try to inference at the same time :)
Whisper-UI is also looking really nice lately but I think it's still pretty early in development. The ability to click on the transcript and hear the sound of that particular moment is great.
https://github.com/hayabhay/whisper-ui
https://www.youtube.com/watch?v=XX-ET_-onYU
But as you implied in your comment, it should be possible to do it quite well with any video by transcripting with whisper and then sending the text to gpt or another LLM to summarize.
reply