Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

> In fact (as I'm sure you know), one of the most beloved speech synthesizers among English-speaking blind users is a closed-source product called ETI-Eloquence that has been basically dead for nearly 20 years.

This is exactly the speech synthesizer I use daily. I've gotten so used to it over the years that switching away from it is painful. On Apple platforms, though, using it is not an option. So I use Karen. Used to use Alex, but Karen appears to be slightly more responsive and tries to do less human stuff when reading. Responsiveness is a very important factor, actually. Probably more so than people might realize. Eloquence and ESpeak react pretty much instantly whereas other voices might take 100 MS or so. This is a very big deal for me. Just like how one would like instant visual feedback on their screen, it's the same for me with speech. The less latency, the better. My problem with ESpeak is that it sounds very rough and metallic whereas Eloquence has a much warmer sound to it. I pitch mine down slightly to get an even warmer sound. Being pleasant on the ears is super important if you listen to the thing many, many hours a day.



sort by: page size:

> This particular synth isn't great.

eSpeak has a lot of settings, in the demo I think I use the defaults, but maybe things can be tweaked for better results.

(I don't really know much about speech synthesis or eSpeak, I just compiled it.)


> I wonder if a style-transfer style algorithm could be used to map the intent of a sentence to a simulated voice.

There's definitely research/proprietary software that can enable a person speaking in desired manner to have their voice control the expression of the generated speech.

Here's a related issue on a Open Source text to speech project which I only learned of today: https://github.com/neonbjb/tortoise-tts/issues/34#issue-1229...

> I tend to view most of these things through the perspective of what would help mod-maker's for video games

Yeah, I think there's some really cool potential for indie creatives to have access to (even lower quality) voice simulation--for use in everything from the initial writing process (I find it quite interesting how engaging it is to hear one's words if that's going to be the final form--and even synthesis artifacts can prompt an emotion or thought to develop); to placeholder audio; and, even final audio in some cases.

> (and I suspect various open source voice sample sets would become pretty popular).

That's definitely a powerful enabler for Free/Open Source speech systems. There's a list of current data sets for speech at the "Open Speech and Language Resources" site: https://openslr.org/resources.php

Encouraging people to provide their voice for Public Domain/Open Source use does come with some ethical aspects that I think people need to be made aware of so they can make informed decisions about it.

Given your interest in this topic you might be interested in this (rough) tool I finally released last week: https://rancidbacon.itch.io/dialogue-tool-for-larynx-text-to...


>The most impressive part is that the voice uses the right feelings and tonal language during the presentation.

Consequences of audio2audio (rather than audio >text text>audio). Being able to manipulate speech nearly as well as it manipulates text is something else. This will be a revelation for language learning amongst other things. And you can interrupt it freely now!


> The system also sounds more natural thanks to the incorporation of speech disfluencies (e.g. “hmm”s and “uh”s). These are added when combining widely differing sound units in the concatenative TTS or adding synthetic waits, which allows the system to signal in a natural way that it is still processing. (This is what people often do when they are gathering their thoughts.) In user studies, we found that conversations using these disfluencies sound more familiar and natural.

This part stuck out to me during the Google I/O demo, as an intentional deficiency is an interesting design decision.


> For normal speech, text is essentially a 1:1 recreation.

More of a 1:0.5 or even 1:0.25. There is a lot lost in text to speech,such as tone and volume two very aspect of speech, that affect what is said and attempted to be conveyed. Not to mention their are certain little tidbits you can do in speech that is not a lossless translation. I don’t disagree that voice notes take up more time than text, but Speech and text are far from 1:1.


> Perhaps real human readers can help?

Certainly, and they do, but computers can annotate much faster, without attrition, and cost much less.

Generally you'll have human recordings for what you can, and TTS for anything missing.

There are a lot of books, live streams, podcasts, articles, etc. in the world.

> There's no need to have model that can do that.

I wouldn't say there's no need, or otherwise we wouldn't talk that way. It's a human element and the generated speech is for human ears.

> A robotic accent is best.

Best is subjective because that's a human preference. That being said, I'd say your preference is a far outlier.

Most people's "best" would be what they are used to hearing; the speech of native speakers.


> you can make anyone say anything as long as you have some previous recordings of their voice.

That's not what this is doing. They're simply resynthesizing exactly what the person said, in the same voice. It's essentially cheating because they can use the real person's inflection. Generating correct inflection is the hardest part of speech synthesis because doing it perfectly requires a complete understanding of the meaning of the text.

The top two are representative of what it sounds like when doing true text to speech. The middle five are just resynthesis of a clip saying the exact same thing. And even in that case, it doesn't always sound good. The fourth one is practically unintelligible. But it's interesting because it demonstrates an upper bound on the quality of the voice synthesis possible with their system given perfect inflection as input.

To clarify, this is cool work, the real-time aspect sounds great, and I'm sure it will lead to even more impressive results in the future. But I don't want people to think that all of the clips on this page represent their current text-to-speech quality.


> Existing physical speech synthesis models are pretty crude and sound quite robotic.

Could you provide some links/publications please?


> Whether and how much the skill transfers to normal human speech, or even between synths, is person-specific. I can't do Youtube at much beyond 2x. Others can. It's definitely a learned skill.

I find that the maximum understandable rate varies a lot between speakers. For some speakers 2.5x is possible, but just 1.5x for others.

One advantage synths has, is that they can more easily control the speed at which words are spoken, and the pauses between words independently. When watching/listening pre-recorded content I often find that I'd want to speed up the pauses more than the words (because speeding up everything until the pauses are sufficiently short make the words intelligible).

If someone knows of a program or algorithm that can play back audio/video using different rates for speech and silence, please share.


> Bafflingly, though, Dessa said that its team created the Rogan replica voice with a text-to-speech deep learning system they developed called RealTalk, which generates life-like speech using only text inputs.

I'm missing something here. How could they replicate the sound of his voice using only text inputs[emphasis theirs]?


> Personal Voice: When my girlfriend texts me when I'm out with my AirPods, I think we'd both like me to hear her message in her actual voice rather than Siri's. (This feature doesn't allow for that yet, but the pieces are all there.)

Neat idea, same with carplay, having it reliably imitate the voice of the person who sent the message would make it a lot nicer.

Though they would need to get all the TTS and intonation right first, which IME is not the case, I think having the right voice but the wrong intonation entirely would be one hell of an uncanny valley.


> especially so for audio books

Perhaps real human readers can help?

> The correct inflections, pauses

Because a model that can imitate a voice is still not capable of that. There's no need to have model that can do that. A robotic accent is best. Or perhaps you like to see your politicians make all kind of bizarre statements on youtube.


> The synthesis still doesn't know where to place emphasis.

True. And yet, these samples aren't in a monotone! That's an enormous improvement.

> You may be unable to distinguish between POOR human voice work and TTS, but not GOOD human reading.

I think these are indistinguishable from AVERAGE human voice work. Keep in mind that the POOR voice work you may have in mind is probably still being done by someone who is, at least nominally, a paid professional.

*> Try e.g. Michelle at: https://cloudpolly.berkine.space/

I'm not able to select a voice other than Oscar (which is definitely worse than this).


MacinTalk 3 (what you probably think of when you say early 90s Mac), i.e. voices like Fred, is technology closely related to DECTalk, and in fact the audio sample in the article sounds eerily close to a Fred-turned-orthodox.

But the article's description of how Siri synthesizes speech is grossly inaccurate.


>One thing these trained voices make clear is that it's a tts engine generating ChatGPT-4o's speech, same as before. The whole omni-modal spin suggesting that the model is natively consuming and generating speech appears to be bunk.

This doesn't make any sense. If it's a speech to speech transformer then 'training' could just be a sample at the beginning of the context window. Or it could one of several voices used for the Instruct-tuning or RLHF process. Either way, it doesn't debunk anything.


>Am I hallucinating or didn't several of the examples have background audio artifacts, like it's been trained on speech with noisy backgrounds, I'm guessing audio from movies paired with subtitles? Having random background audio can make it quite hard to use in production.

The other side of that problem is an opportunity. That's why the same model can also generate music, background noise and sound effects. And it's just because the prompt specifies those things explicitly. The input is truly semantic, so the output is rich and reflects that context. Is your input text sounds like it came from a speech, then there's a high chance your output audio will sound like a megaphone in a public space with crowd reactions and maybe even applause.


> On the original recording the voice sounds soft, but after separation it sounds like it is synthesized or passed through a vocoder

funny, the original recording seemed kind of robotic to me! maybe not robotic, but like it's been filtered somehow. but that might just be my not-so-great headphones


I wish for the "naturalspeech versus recording" comparisons they'd used a different voice for the synthesized speech. Otherwise the fact that we may not be able to tell them apart by ear (in a blindfold test) doesn't tell us much about how good it is as a speech synth engine with that evidence alone.

> Even if you do have the time to do it, training a current TTS engine in your voice takes a significant amount of time and the results are often poor.

I wonder what sort of recordings and other data you'd need to get this right assuming what TTS might look like in a few years' time

next

Legal | privacy