Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
Speech Synthesis on Linux (2020) (darkshadow.io) similar stories update story
147 points by ducktective | karma 7634 | avg karma 7.47 2021-09-25 03:42:44 | hide | past | favorite | 76 comments



view as:

Having played around with flite a decade ago, and at that time feeling that it was already then nowhere close to the fidelity of other speech synthesis examples. I find it surprising that there still isn't anything better than festival/flite? It sounded then like a clunky robot, and still does today. Surely some of the many research projects have released their work as open source?

Work like, say https://arxiv.org/abs/1806.04558 [paper]

https://github.com/CorentinJ/Real-Time-Voice-Cloning [repo]


The HTS voice from NIT recommended in the article (voice_cmu_us_slt_arctic_hts) actually sounds much better than the clunky robot from a decade ago. Hear it here:

https://youtu.be/MmcLFJQpv2o?t=85

Edit: or on the online demo; select "HMM-based method (HTS 2011) - Combilex" > "SLT (English American female)".

https://www.cstr.ed.ac.uk/projects/festival/onlinedemo.html


It does indeed sound much better yes. But, that voice was already there a decade ago. It's not... hm. Let me just say that I don't wish to disparage the work done on those projects, as I do think it is great. Maybe it better illustrates my point by taking a listen to this video which showcases the project I mentioned, as machine learning techniques have progress immensely the last decade: https://www.youtube.com/watch?v=-O_hYhToKoA

There are of course great benefits to something simple to use. I remember cross-compiling flite to run on a custom android/windows/linux project to generate voice lines intended for a in-game robot companion (nothing came of it though) based on SDL. It probably would not be nearly as feasible to do the same for some dependency-heavy machine learning library.

Now, I haven't done any research to find better examples of projects. I was just surprised how identical the article describes the options, to what was available 12 years ago.


Yes, I think that consumer-level state-of-the-art speech synthesis is still pretty far from acceptable. Amazon Polly doesn't sounds too great and presumably that should have more than enough big data to leverage and cloud computign to work with.

https://aws.amazon.com/polly/ https://www.youtube.com/watch?v=00D0YZ9GQX4

Either we're just not there yet technologically (hard to believe), or there isn't a will to make good speech synthesis available to commoners.


I find these voices much more intelligible than say, Stephen Hawking's TTS.

Some of the Amazon's voices sound amazing to me, I've actually tested few on them and people couldn't tell they're synthetic. Watson's voices are nice too. (AFAIK Amazon bought Polish company IVONA for their Polly TTS system, which was long regarded to be one of the best)

How do you think Readspeakers voices compare? https://www.readspeaker.com

I thought I'd just add another fun fact/data point here. This is obviously my personal opinion. I have to use TTS to use my computer with a screen reader, and for that, I mostly prefer more synthetic speech. When I read long form text like books, articles, etc. I do prefer more natural voices, but for doing actual work like reading code or simply using user interfaces, I like the predictability of more synthetic/algorithmic speech. Apple added the neural Siri voices to the new VoiceOver. They sound incredible but the quality of the voice also brings latency with it. Something like ESpeak is much, much more performant and predictable, and it speeds up much better. I use my TTS at a very fast rate and I find that the more natural a voice, the harder it is to understand at that speech rate. Neural voices speak the same phrase of text differently every time it's uttered. Slightly different intonation, slightly different speech rhythm. This makes it hard to listen out for patterns. So for me there's definitely still a place for synthetic speech.

In fact (as I'm sure you know), one of the most beloved speech synthesizers among English-speaking blind users is a closed-source product called ETI-Eloquence that has been basically dead for nearly 20 years. (It was ported to Android several years ago, but that port was discontinued because they couldn't update it for 64-bit.) No recent speech synthesizer has quite matched its consistent intelligibility, particularly at high speeds. espeak-ng comes close, but it has a bad reputation (mostly, I think, leftover from earlier versions of espeak that really weren't very good).

Edit: Sample of ETI-Eloquence at my preferred speed: https://mwcampbell.us/audio/eloquence-sample-2021-09-25.mp3 (yes, it mispronounces "espeak")

Edit 2: To elaborate on what I mean by "mostly dead": In 2009 I was tasked with adding support for ETI-Eloquence to a Windows screen reader I developed. At that time, Nuance was still selling Eloquence to companies like the one I worked for back then. When I got the SDK, the timestamps on the files, particularly the main DLLs, were from 2002. As far as I know, an updated SDK for Windows was never released. I'm thankful for Windows's legendary emphasis on backward compatibility, particularly compared to Apple platforms and even Android.

Finally, a sample of espeak-ng (in the NVDA screen reader) at my preferred speed: https://mwcampbell.us/audio/espeak-ng-sample-2021-09-25.mp3 I use the default British pronunciation even though I'm American, because the American pronunciation is noticeably off.


> In fact (as I'm sure you know), one of the most beloved speech synthesizers among English-speaking blind users is a closed-source product called ETI-Eloquence that has been basically dead for nearly 20 years.

This is exactly the speech synthesizer I use daily. I've gotten so used to it over the years that switching away from it is painful. On Apple platforms, though, using it is not an option. So I use Karen. Used to use Alex, but Karen appears to be slightly more responsive and tries to do less human stuff when reading. Responsiveness is a very important factor, actually. Probably more so than people might realize. Eloquence and ESpeak react pretty much instantly whereas other voices might take 100 MS or so. This is a very big deal for me. Just like how one would like instant visual feedback on their screen, it's the same for me with speech. The less latency, the better. My problem with ESpeak is that it sounds very rough and metallic whereas Eloquence has a much warmer sound to it. I pitch mine down slightly to get an even warmer sound. Being pleasant on the ears is super important if you listen to the thing many, many hours a day.


I agree with you that Eloquence sounds warmer than eSpeak. I wish there was an open-source speech synthesizer comparable to Eloquence or even DECtalk. That approach to speech synthesis is old enough now that I'm sure there are published algorithms whose patents have expired. The problem, of course, would be funding the work on a good open-source implementation.


That is a bit too fast for me. I am surprised your brain can process at such high speed. I guess it is a matter of practice.

I’ve heard exactly this from a couple of blind people I’ve interacted with too.

Great insight, it is monotonous but it really helps to be in the flow for a few hours of productive work.

I don't understand this approach when audio deepfakes exist that can quite realistically make Ayn Rand read arbitrary texts[1]. — Is it simply a matter of processing power?

[1]: https://www.youtube.com/watch?v=hDVuh4A-q3Q&ab_channel=Vocal...


https://github.com/coqui-ai/TTS the continuation of Mozilla TTS produces quite nice results I you pick the right models

How far away are we from a debian package that will work with speechd?

A real Debian package is unlikely, to start with Debian doesn't have a GPU farm for retraining the model from the source data, then probably the training requires proprietary GPU drivers, and probably a subset of the source data is proprietary.

https://salsa.debian.org/deeplearning-team/ml-policy

Stuffing an existing model into a .deb is of course fairly easy.


Speaking of retraining, I think it's also potentially a bit hairy with regards to reproducible builds don't you think? My impression is that machine learning models are often initialized with random values before training begins. And some of them may use additional random data while training as well, I think.

You can fix the seed for the pseudorandom number generator.

I think a bigger issue is that retraining isn't deterministic (ie bitwise identical) across different hardware, IIRC due to differing float behaviour. I guess folks interested in ML reproducible builds will have to ignore small differences in model float values, but I wonder what affect those differences on model outputs.

I can confirm this. The setup was also relative easy for me.

Cool article, I do like the extensibility provided via the unix philosophy. Another thing you can do nowadays is use off-the-shelf deep neural networks to do tts eg

https://github.com/NVIDIA/tacotron2

https://github.com/mozilla/TTS

https://github.com/CorentinJ/Real-Time-Voice-Cloning

https://github.com/coqui-ai/TTS

They're not all easy to setup however


Why so many ML-based projects don't release a binary or package?

Because they wouldn't know the specific a.b.i. of your system which on many system changes in a rolling way as well.

They would have to release a great many different ones or alternatively bundle all libraries with it.


That's why all major software releases exclusive as source code /s

They have the time and money to research this issue on every system, and they typically do bundle libraries.

I once saw a comparison with LibreOffice that showed that the the package Debian itself provided was 20% of the size of the package LibreOffice provided targeting Debian, — which would not receive the same benefits of security bugfixes to libraries, but of course also not the same problems that often arise on Debian when they arrogantly patch libraries they barely understand and create their own unique security problems.


An overwhelming number of reasons.

Because you need data, trained models, etc.

Because data scientists aren't typically product people or software engineers with UX in mind.

Because ML packages are brittle and tied to specific hardware configurations.

Because the ML world is evolving rapidly. It's quick, dirty, and messy.

View these as stepping stones for research and product development.

(I created https://vo.codes using a lot of these, fwiw, in an attempt to make it easy.)


How to get some decent sounding English voices into Firefox, to use in reader view and with Speech API? By default on Fedora 34 Firefox offers hundreds variations (?) of the same sounding robotic voice.

I had the same question some months ago, as it helps me to focus and read long articles. You have to do some fiddling but it's doable.

https://askubuntu.com/questions/953509/how-can-i-change-the-...


This article would be better if they provided some speech examples. Demonstrate to the reader what they will get before they go through all the trouble of installing the software.

I followed the instructions and I'm getting:

  Warning: HTS_fopen: Cannot open hts/htsvoice.
  aplay: main:666: bad speed value 0
  aplay: main:666: bad speed value 0
  nil
I'm using Pop OS 20.04(based on the same version of Ubuntu) and Festival 2.5.0

Perhaps these lines from Debian 11's /etc/festival.scm are not added to your install and may help?

    ;; Debian-specific: Use aplay to play audio
    (Parameter.set 'Audio_Command "aplay -q -c 1 -t raw -f s16 -r $SR $FILE")
    (Parameter.set 'Audio_Method 'Audio_Command)
In any case, I do not see this error on Debian.

I did however go through the motion of installing the nitech voices before reading that they only work with older versions of festival. Doh!


Yep. This one of of the primary reasons I keep a 2010-era Ubuntu 10.04 box around: a working platform for festival 1.9x and high quality TTS voices. Modern distros packing 2.x are a big step backwards. I've tried to get things working as well as they did in Ubuntu 10.04 on Ubuntu 14.04 and it's not really possible; even worse on Debian 10/11.

Those lines were already in the config file. I'll try compiling Festival from source when I get home. Some people say that worked for them.

The current state of open source )or even freeware) speech synthesis is pretty sad, to be honest.

You have eSpeak, which is GPL V3, so including it in your own software is a problem. RH Voice can be compiled without GPL code, but its language support is pretty limited. There's also SAM, which is incredibly easy to port and incredibly light on resources, but its licensing status is unknown, it's English only and it just sounds bad, even to somebody used to robotic synths.

If you're developing for a popular platform, it probably has something built-in, but if you're developing for embedded, you need to pay thousands of dollars to Cerrence (formerly Nuance) to even get started.


Do you happen to know if ETI-Eloquence is now owned by Cerence, or if Microsoft got it with the Nuance acquisition? I'm afraid I was a little too eager to suggest that Microsoft open-source Eloquence when the news of the Nuance acquisition first came out.

Vocalizer is definitely Cerence, that I'm sure of. All the automotive stuff, which Vocalizer is a part of, was spun off before the acquisition.

As an aside, because of Vocalizer's use in automotive, it will probably be the only high-ish quality speech engine that won't become fully cloud-based. VFO's claims about the continued use of Vocalizer in JAWS seem to confirm that.

Regarding Eloquence itself, its status is not really known. I would be extremely surprised if it was owned by Microsoft, though. There's a hypothesis that nobody really knows who actually owns it, there were multiple companies that assisted in its development, including IBM. The product was so unimportant to Nuance these days that they might not even have considered it when doing the spinoff, leaving its ownership uncertain. If this hypothesis is untrue, though, I'd strongly suspect that Cerence is the owner, not Microsoft.


> You have eSpeak, which is GPL V3, so including it in your own software is a problem.

Can't you just include it like a separate module and provide any improvements to it specifically upstream?


If you're doing this on Linux, you get IPC-related (or worse, process-creation-related) latency. This is not a problem for i.e. occasional weather announcements, but is a big issue when . creating things for the blind. If the speech synthesizer speaks each time you press a key, and you generally need to know what it said to. decide if you want to scroll further or open the focused item, every bit of latency matters. That's one of the reasons why blind people prefer robotic-sounding speech synthesizers; they're usually less CPU-intensive, which increases responsiveness.

If the device you're developing for uses some proprietary firmware, a custom module might not even be an option.


If you make a product then why not run the GPL code as a background application and you send it commands on what to speak? It would be fair to contribute back any improvements if you would able to add to the GPL stuff.

I did a lot of work researching all the available text to speech systems a couple of years ago.

The cloud based systems from Google, Microsoft, Amazon and IBM are much better than anything else, and within them, the neural network based systems, which appear to be a sort of different product category, are far and away the best of all. The neural voices are approaching natural voice intonation and have an almost believable ability to read text.

The ones that sounded most natural were IBM Watson and Googles neural voices.

Amazon Polly appeared to be the furthest behind of all the cloud systems…. a really average sounding product.

Of the local TTS systems, the one built into MacOS sounds the best… but they were all very average at best. All the linux ones frankly sounded like garbage relative to the state of the art.

Things might have advanced with the cloud systems over the past couple of years but I didn’t get the impression the cloud companies were putting much effort into research and development.


I searched for a TTS service recently and found wellsaidlabs. It’s a saas product but the quality is astonishing. It’s also fast to render the audio, approximately 2 times the length of the audio file. Here is an article of the mit technology review magasine about it https://www.technologyreview.com/2021/07/09/1028140/ai-voice...

I had reason to sample the IBM performance recently. It is imressive. Do you know if NN based systems have been trained on, say, audio books for which text is also available?

I find that people who have never used text to speech think that the closer it is to real speech the better it is.

Which is simply not the case.

Artificial speech is to human speech what typography is to handwriting.

For example espeak is by far my number one choice for reading anything, because the voice models it uses can be sped up to 1k wpm and still be understandable. This is basically a superpower when skimming boring documentation of any type. Throw in basic tesseract OCR and in a 45 minute sitting I can go through 30k words of any document that can be displayed on a computer screen.

It's not that I'm stuck with a terrible robotic voice, it's that I don't want anything "better" in the same way that I don't see much value going past the command line for most tools when you can use ncurses.


Same here, so for accessibility reasons current voices are good enough, but I admit at the beginning I lost a lot of time trying to find good voices, until I trained myself with faster speeds.

So probably most people here researched this topic not for accessibility reasons but for "commercial" stuff like creating some kind of service where chat bots could speak to you or transcribe articles for some regular people(without eye problems) to listen to them.


It depends on the use case. For a lot of people who don't use TTS directly, we're exposed to it through public announcement systems on transit/airports and phone systems, or sometimes voice verification codes.

More natural speech patterns would be useful in those venues.


I wonder if people will hit an uncanny valley here: my experience with video animation was that sometime around “The Polar Express” animated movie makers realized that audiences didn’t really want more and more realistic animation.

I think there's certainly a good enough point, yeah. But last time I was on a CalTrain platform, their TTS announcement still couldn't pronounce CalTrain, unless it's expected to be pronounced call train.

Would this be true for things like audiobooks and particularly interesting long-form articles on the web that you actually want to listen to as opposed to "getting through"?

I find that listening to them multiple times is much better value for time than just doing it once.

Anyone knows how to get that voice to work in Reader Mode in Firefox on Debian?

There's also pico2wave (libttspico), which, to me, with the "-l=en-GB" flag, sounds the best by far of any offline TTS that I've tried.

You can hear it in this video: https://www.youtube.com/watch?v=tfcme7maygw&t=131s


I agree, I use Pico2wave too after testing other TTS and pico2wave has the best voice in offline systems. I use it combined with home assistant: whenever a window trigger is fired, pico2wave generate a wav file and it is read by aplay command and transmitted to an 90' stereo HI-FI. The result is: the window x it is opening because x.

The italian voice sounds great


A nice enhancement for the system is having TTS read out the currently selected text, triggered by a key shortcut.

I tried festival and it too complicated and my version was too to run the better voices model.

Instead I've used this repo to use upgraded flite: https://github.com/kastnerkyle/hmm_tts_build/

I have mapped keyboard shortcuts Win+1 for normal speed, Win+2 for faster and Win+3 for really fast reading speed. I can use it while reading, to enhance my focus. Neat.


I created Larynx (https://github.com/rhasspy/larynx) to address shortcomings I saw in Linux speech synthesis:

* Licensing (MIT)

* Quality (judge for yourself: https://rhasspy.github.io/larynx/)

* Speed (faster than real-time on amd64/aarch64)

* Voices/language support (9 languages, 50 voices)

I'm working now on integrating Larynx with speech-dispatcher/Orca. The next version of Larynx will also support a subset of SSML :)


Can it run on a raspberry pi?

Yes! There's a Docker image and Debian package for both 32-bit and 64-bit ARM. The 64-bit version is significantly faster (especially with low quality set).

That's fantastic, someone ought to integrate this with Mycroft, stat.

Some of those sound really good.

I was going to comment that you didn't have any en_gb listed, but it seems there's a bunch under en_us :)

Some rather good brit'ish accents in there me old mate!


Thanks! They seemed to work fine with en_us phonemes, so I haven't created a separate en_gb set yet. Maybe someday :)

cmu_jmk (glow_tts) under en_us seems to be quite nice

I like "ek", but I'm not a native speaker, so my preference for British English voice might not be the same as for anglophones.

Thought I'd give this a go, but getting lots errors along the lines of 'Expected shape from model of {...} does not match actual shape of {...} for output audio. Tried the debian and python methods of installation on an AMD Ryzen X13.

EDIT: despite those errors I can create output.wav. However, interactive mode crashes with "No such file or directory: 'play'".


The shape warnings don't seem to matter (something to do with the onnx runtime). Interactive mode needs sox installed or for you to specify a --play-command

Sweet, I've been hoping for good Linux TTS!

It's super awesome. Just wondering if it can simultaneously use play-command to play sample and in the same time render the next to eliminate pauses? Also wondering if it can work through spd-say (speech dispatcher). I will probably be able to figure out both just checking in case if there are ready solutions.

Thanks! Try the --raw-stream option for listening to long texts: https://github.com/rhasspy/larynx#long-texts

For speech-dispatcher, I'd start a Larynx HTTP server and use curl to get audio. I have an undocumented --daemon flag that does something like this.


Thank you! Streaming output to aplay works pretty well for me.

I did a bunch of work with TTS about 15 years ago for a project and had landed on Festival and the Nitech voices as the best free option at that time. It's interesting that this seems to still be the best free non-cloud option available.

What a lot of people don't realize is that Festival is intended for creating new TTS voices based on your own voice. The fact that it generates TTS is an artifact of it's main function. I've never messed with that functionality myself but I always wonder if someone could train a synthetic voice to sound better with a larger sample set. The Nitech voices are definitely better so it's certainly possible to encourage Festival do a better job.


Listening to the 2001 examples (I'm sorry Dave...) I wonder: would it be possible to train an IA to copy a voice based only on a few samples. It'd had to "model" the voice on a few minutes of speech only... But I'd love my computer to use HAL's voice for sure !

I worked with this a bit not that long ago. For cloud services, quality of Google and Azure "neural" voices are tough to beat. Interestingly I experienced significant latency for all of the Azure services regardless of region, configuration, etc. Never dug deep enough to figure out what was going on there. Also of note, Azure will let you run their implementation on a local container with the usual "contact us" stuff. Not sure of the terms and pricing on that.

For local, Mozilla TTS was best from a quality standpoint but the GPU inference support was a bit dicey and (possibly) not really supported at all.

For more complex and bespoke applications the Nvidia (I know, I know) NeMO toolkit [0] is very powerful but requires more effort than most to get up and running. However, it provides the ability to do very interesting things with additional training and all things speech.

In the Nvidia world there's also their Riva [1] (formerly Jarvis) solution that works with Triton [2] to build out an architecture for extremely performant and high-scale speech applications with things like model management, revision control, deployment, etc.

[0] https://github.com/NVIDIA/NeMo

[1] https://developer.nvidia.com/riva

[2] https://developer.nvidia.com/nvidia-triton-inference-server


Legal | privacy