Hacker Read

Hacker Read top | best | new | newcomments | leaders | about | bookmarklet

login

		Speech Synthesis on Linux (2020) (darkshadow.io) similar stories update story
		147 points by ducktective \| karma 7634 \| avg karma 7.47 2021-09-25 03:42:44 \| hide \| past \| favorite \| 76 comments

view as:

okamiueru | karma 2186 | avg karma 2.84 2021-09-25 03:59:48 | [–] similar comments

Having played around with flite a decade ago, and at that time feeling that it was already then nowhere close to the fidelity of other speech synthesis examples. I find it surprising that there still isn't anything better than festival/flite? It sounded then like a clunky robot, and still does today. Surely some of the many research projects have released their work as open source?

Work like, say https://arxiv.org/abs/1806.04558 [paper]

https://github.com/CorentinJ/Real-Time-Voice-Cloning [repo]

danuker | karma 6850 | avg karma 2.78 2021-09-25 04:06:09 | [–] similar comments

The HTS voice from NIT recommended in the article (voice_cmu_us_slt_arctic_hts) actually sounds much better than the clunky robot from a decade ago. Hear it here:

https://youtu.be/MmcLFJQpv2o?t=85

Edit: or on the online demo; select "HMM-based method (HTS 2011) - Combilex" > "SLT (English American female)".

https://www.cstr.ed.ac.uk/projects/festival/onlinedemo.html

okamiueru | karma 2186 | avg karma 2.84 2021-09-25 04:15:20 | [–] similar comments

It does indeed sound much better yes. But, that voice was already there a decade ago. It's not... hm. Let me just say that I don't wish to disparage the work done on those projects, as I do think it is great. Maybe it better illustrates my point by taking a listen to this video which showcases the project I mentioned, as machine learning techniques have progress immensely the last decade: https://www.youtube.com/watch?v=-O_hYhToKoA

There are of course great benefits to something simple to use. I remember cross-compiling flite to run on a custom android/windows/linux project to generate voice lines intended for a in-game robot companion (nothing came of it though) based on SDL. It probably would not be nearly as feasible to do the same for some dependency-heavy machine learning library.

Now, I haven't done any research to find better examples of projects. I was just surprised how identical the article describes the options, to what was available 12 years ago.

usui | karma 1303 | avg karma 3.76 2021-09-25 04:06:35 | [–] similar comments

Yes, I think that consumer-level state-of-the-art speech synthesis is still pretty far from acceptable. Amazon Polly doesn't sounds too great and presumably that should have more than enough big data to leverage and cloud computign to work with.

https://aws.amazon.com/polly/ https://www.youtube.com/watch?v=00D0YZ9GQX4

Either we're just not there yet technologically (hard to believe), or there isn't a will to make good speech synthesis available to commoners.

danuker | karma 6850 | avg karma 2.78 2021-09-25 04:14:49 | [–] similar comments

I find these voices much more intelligible than say, Stephen Hawking's TTS.

inside_out_life | karma 21 | avg karma 0.95 2021-09-25 05:46:53 | [–] similar comments

Some of the Amazon's voices sound amazing to me, I've actually tested few on them and people couldn't tell they're synthetic. Watson's voices are nice too. (AFAIK Amazon bought Polish company IVONA for their Polly TTS system, which was long regarded to be one of the best)

ObscureScience | karma 184 | avg karma 1.48 2021-09-25 08:52:27 | [–] similar comments

How do you think Readspeakers voices compare? https://www.readspeaker.com

ClawsOnPaws | karma 1547 | avg karma 20.63 2021-09-25 05:16:35 | [–] similar comments

I thought I'd just add another fun fact/data point here. This is obviously my personal opinion. I have to use TTS to use my computer with a screen reader, and for that, I mostly prefer more synthetic speech. When I read long form text like books, articles, etc. I do prefer more natural voices, but for doing actual work like reading code or simply using user interfaces, I like the predictability of more synthetic/algorithmic speech. Apple added the neural Siri voices to the new VoiceOver. They sound incredible but the quality of the voice also brings latency with it. Something like ESpeak is much, much more performant and predictable, and it speeds up much better. I use my TTS at a very fast rate and I find that the more natural a voice, the harder it is to understand at that speech rate. Neural voices speak the same phrase of text differently every time it's uttered. Slightly different intonation, slightly different speech rhythm. This makes it hard to listen out for patterns. So for me there's definitely still a place for synthetic speech.

mwcampbell | karma 10942 | avg karma 3.15 2021-09-25 05:55:21 | [–] similar comments

In fact (as I'm sure you know), one of the most beloved speech synthesizers among English-speaking blind users is a closed-source product called ETI-Eloquence that has been basically dead for nearly 20 years. (It was ported to Android several years ago, but that port was discontinued because they couldn't update it for 64-bit.) No recent speech synthesizer has quite matched its consistent intelligibility, particularly at high speeds. espeak-ng comes close, but it has a bad reputation (mostly, I think, leftover from earlier versions of espeak that really weren't very good).

Edit: Sample of ETI-Eloquence at my preferred speed: https://mwcampbell.us/audio/eloquence-sample-2021-09-25.mp3 (yes, it mispronounces "espeak")

Edit 2: To elaborate on what I mean by "mostly dead": In 2009 I was tasked with adding support for ETI-Eloquence to a Windows screen reader I developed. At that time, Nuance was still selling Eloquence to companies like the one I worked for back then. When I got the SDK, the timestamps on the files, particularly the main DLLs, were from 2002. As far as I know, an updated SDK for Windows was never released. I'm thankful for Windows's legendary emphasis on backward compatibility, particularly compared to Apple platforms and even Android.

Finally, a sample of espeak-ng (in the NVDA screen reader) at my preferred speed: https://mwcampbell.us/audio/espeak-ng-sample-2021-09-25.mp3 I use the default British pronunciation even though I'm American, because the American pronunciation is noticeably off.

ClawsOnPaws | karma 1547 | avg karma 20.63 2021-09-25 12:04:17 | [–] similar comments

> In fact (as I'm sure you know), one of the most beloved speech synthesizers among English-speaking blind users is a closed-source product called ETI-Eloquence that has been basically dead for nearly 20 years.

This is exactly the speech synthesizer I use daily. I've gotten so used to it over the years that switching away from it is painful. On Apple platforms, though, using it is not an option. So I use Karen. Used to use Alex, but Karen appears to be slightly more responsive and tries to do less human stuff when reading. Responsiveness is a very important factor, actually. Probably more so than people might realize. Eloquence and ESpeak react pretty much instantly whereas other voices might take 100 MS or so. This is a very big deal for me. Just like how one would like instant visual feedback on their screen, it's the same for me with speech. The less latency, the better. My problem with ESpeak is that it sounds very rough and metallic whereas Eloquence has a much warmer sound to it. I pitch mine down slightly to get an even warmer sound. Being pleasant on the ears is super important if you listen to the thing many, many hours a day.

mwcampbell | karma 10942 | avg karma 3.15 2021-09-25 12:59:25 | [–] similar comments

I agree with you that Eloquence sounds warmer than eSpeak. I wish there was an open-source speech synthesizer comparable to Eloquence or even DECtalk. That approach to speech synthesis is old enough now that I'm sure there are published algorithms whose patents have expired. The problem, of course, would be funding the work on a good open-source implementation.

app4soft | karma 10934 | avg karma 3.12 2021-09-25 16:48:32 | [–] similar comments

What about RHVoice?[0,1]

[0] https://github.com/RHVoice/RHVoice

[1] https://rhvoice.org/en-voices/

[2] https://f-droid.org/en/packages/com.github.olga_yakovleva.rh...

machawinka | karma 43 | avg karma 0.83 2021-09-25 13:23:09 | [–] similar comments

That is a bit too fast for me. I am surprised your brain can process at such high speed. I guess it is a matter of practice.

chrismorgan | karma 22283 | avg karma 4.08 2021-09-25 06:01:16 | [–] similar comments

I’ve heard exactly this from a couple of blind people I’ve interacted with too.

machawinka | karma 43 | avg karma 0.83 2021-09-25 12:38:44 | [–] similar comments

Great insight, it is monotonous but it really helps to be in the flow for a few hours of productive work.

Blikkentrekker | karma 1363 | avg karma 1.0 2021-09-25 06:24:53 | [–] similar comments

I don't understand this approach when audio deepfakes exist that can quite realistically make Ayn Rand read arbitrary texts[1]. — Is it simply a matter of processing power?

[1]: https://www.youtube.com/watch?v=hDVuh4A-q3Q&ab_channel=Vocal...

felixr | karma 1523 | avg karma 7.58 2021-09-25 04:41:46 | [–] similar comments

https://github.com/coqui-ai/TTS the continuation of Mozilla TTS produces quite nice results I you pick the right models

tomcumming | karma 4 | avg karma 1.0 2021-09-25 05:47:49 | [–] similar comments

How far away are we from a debian package that will work with speechd?

pabs3 | karma 43824 | avg karma 6.39 2021-09-25 05:50:09 | [–] similar comments

A real Debian package is unlikely, to start with Debian doesn't have a GPU farm for retraining the model from the source data, then probably the training requires proprietary GPU drivers, and probably a subset of the source data is proprietary.

https://salsa.debian.org/deeplearning-team/ml-policy

Stuffing an existing model into a .deb is of course fairly easy.

codetrotter | karma 16631 | avg karma 3.67 2021-09-25 08:35:12 | [–] similar comments

Speaking of retraining, I think it's also potentially a bit hairy with regards to reproducible builds don't you think? My impression is that machine learning models are often initialized with random values before training begins. And some of them may use additional random data while training as well, I think.

DiogenesKynikos | karma 1978 | avg karma 0.88 2021-09-25 08:52:11 | [–] similar comments

You can fix the seed for the pseudorandom number generator.

pabs3 | karma 43824 | avg karma 6.39 2021-09-25 19:34:24 | [–] similar comments

I think a bigger issue is that retraining isn't deterministic (ie bitwise identical) across different hardware, IIRC due to differing float behaviour. I guess folks interested in ML reproducible builds will have to ignore small differences in model float values, but I wonder what affect those differences on model outputs.

karussell | karma 1353 | avg karma 2.49 2021-09-25 06:49:40 | [–] similar comments

I can confirm this. The setup was also relative easy for me.

pizza | karma 13392 | avg karma 3.99 2021-09-25 04:43:04 | [–] similar comments

Cool article, I do like the extensibility provided via the unix philosophy. Another thing you can do nowadays is use off-the-shelf deep neural networks to do tts eg

https://github.com/NVIDIA/tacotron2

https://github.com/mozilla/TTS

https://github.com/CorentinJ/Real-Time-Voice-Cloning

https://github.com/coqui-ai/TTS

They're not all easy to setup however

ducktective | karma 7634 | avg karma 7.47 2021-09-25 05:29:39 | [–] similar comments

Why so many ML-based projects don't release a binary or package?

Blikkentrekker | karma 1363 | avg karma 1.0 2021-09-25 06:20:26 | [–] similar comments

Because they wouldn't know the specific a.b.i. of your system which on many system changes in a rolling way as well.

They would have to release a great many different ones or alternatively bundle all libraries with it.

moron4hire | karma 11137 | avg karma 1.67 2021-09-25 07:29:07 | [–] similar comments

That's why all major software releases exclusive as source code /s

Blikkentrekker | karma 1363 | avg karma 1.0 2021-09-25 09:08:10 | [–] similar comments

They have the time and money to research this issue on every system, and they typically do bundle libraries.

I once saw a comparison with LibreOffice that showed that the the package Debian itself provided was 20% of the size of the package LibreOffice provided targeting Debian, — which would not receive the same benefits of security bugfixes to libraries, but of course also not the same problems that often arise on Debian when they arrogantly patch libraries they barely understand and create their own unique security problems.

echelon | karma 19387 | avg karma 2.75 2021-09-25 08:49:40 | [–] similar comments

An overwhelming number of reasons.

Because you need data, trained models, etc.

Because data scientists aren't typically product people or software engineers with UX in mind.

Because ML packages are brittle and tied to specific hardware configurations.

Because the ML world is evolving rapidly. It's quick, dirty, and messy.

View these as stepping stones for research and product development.

(I created https://vo.codes using a lot of these, fwiw, in an attempt to make it easy.)

butz | karma 3410 | avg karma 2.76 2021-09-25 05:48:32 | [–] similar comments

How to get some decent sounding English voices into Firefox, to use in reader view and with Speech API? By default on Fedora 34 Firefox offers hundreds variations (?) of the same sounding robotic voice.

ajot | karma 717 | avg karma 2.54 2021-09-25 08:30:06 | [–] similar comments

I had the same question some months ago, as it helps me to focus and read long articles. You have to do some fiddling but it's doable.

https://askubuntu.com/questions/953509/how-can-i-change-the-...

amelius | karma 42902 | avg karma 1.63 2021-09-25 05:55:28 | [–] similar comments

This article would be better if they provided some speech examples. Demonstrate to the reader what they will get before they go through all the trouble of installing the software.

TylerLives | karma 235 | avg karma 1.54 2021-09-25 06:38:21 | [–] similar comments

I followed the instructions and I'm getting:

  Warning: HTS_fopen: Cannot open hts/htsvoice.
  aplay: main:666: bad speed value 0
  aplay: main:666: bad speed value 0
  nil

I'm using Pop OS 20.04(based on the same version of Ubuntu) and Festival 2.5.0

frumiousirc | karma 312 | avg karma 1.68 2021-09-25 06:47:16 | [–] similar comments

Perhaps these lines from Debian 11's /etc/festival.scm are not added to your install and may help?

    ;; Debian-specific: Use aplay to play audio
    (Parameter.set 'Audio_Command "aplay -q -c 1 -t raw -f s16 -r $SR $FILE")
    (Parameter.set 'Audio_Method 'Audio_Command)

In any case, I do not see this error on Debian.

I did however go through the motion of installing the nitech voices before reading that they only work with older versions of festival. Doh!

superkuh | karma 13075 | avg karma 3.32 2021-09-25 07:46:13 | [–] similar comments

Yep. This one of of the primary reasons I keep a 2010-era Ubuntu 10.04 box around: a working platform for festival 1.9x and high quality TTS voices. Modern distros packing 2.x are a big step backwards. I've tried to get things working as well as they did in Ubuntu 10.04 on Ubuntu 14.04 and it's not really possible; even worse on Debian 10/11.

TylerLives | karma 235 | avg karma 1.54 2021-09-25 09:42:17 | [–] similar comments

Those lines were already in the config file. I'll try compiling Festival from source when I get home. Some people say that worked for them.

miki123211 | karma 7465 | avg karma 6.12 2021-09-25 07:01:14 | [–] similar comments

The current state of open source )or even freeware) speech synthesis is pretty sad, to be honest.

You have eSpeak, which is GPL V3, so including it in your own software is a problem. RH Voice can be compiled without GPL code, but its language support is pretty limited. There's also SAM, which is incredibly easy to port and incredibly light on resources, but its licensing status is unknown, it's English only and it just sounds bad, even to somebody used to robotic synths.

If you're developing for a popular platform, it probably has something built-in, but if you're developing for embedded, you need to pay thousands of dollars to Cerrence (formerly Nuance) to even get started.

mwcampbell | karma 10942 | avg karma 3.15 2021-09-25 07:10:32 | [–] similar comments

Do you happen to know if ETI-Eloquence is now owned by Cerence, or if Microsoft got it with the Nuance acquisition? I'm afraid I was a little too eager to suggest that Microsoft open-source Eloquence when the news of the Nuance acquisition first came out.

miki123211 | karma 7465 | avg karma 6.12 2021-09-25 07:30:16 | [–] similar comments

Vocalizer is definitely Cerence, that I'm sure of. All the automotive stuff, which Vocalizer is a part of, was spun off before the acquisition.

As an aside, because of Vocalizer's use in automotive, it will probably be the only high-ish quality speech engine that won't become fully cloud-based. VFO's claims about the continued use of Vocalizer in JAWS seem to confirm that.

Regarding Eloquence itself, its status is not really known. I would be extremely surprised if it was owned by Microsoft, though. There's a hypothesis that nobody really knows who actually owns it, there were multiple companies that assisted in its development, including IBM. The product was so unimportant to Nuance these days that they might not even have considered it when doing the spinoff, leaving its ownership uncertain. If this hypothesis is untrue, though, I'd strongly suspect that Cerence is the owner, not Microsoft.

sigg3 | karma 670 | avg karma 2.8 2021-09-25 08:49:35 | [–] similar comments

> You have eSpeak, which is GPL V3, so including it in your own software is a problem.

Can't you just include it like a separate module and provide any improvements to it specifically upstream?

miki123211 | karma 7465 | avg karma 6.12 2021-09-25 11:19:57 | [–] similar comments

If you're doing this on Linux, you get IPC-related (or worse, process-creation-related) latency. This is not a problem for i.e. occasional weather announcements, but is a big issue when . creating things for the blind. If the speech synthesizer speaks each time you press a key, and you generally need to know what it said to. decide if you want to scroll further or open the focused item, every bit of latency matters. That's one of the reasons why blind people prefer robotic-sounding speech synthesizers; they're usually less CPU-intensive, which increases responsiveness.

If the device you're developing for uses some proprietary firmware, a custom module might not even be an option.

simion314 | karma 7313 | avg karma 1.67 2021-09-25 09:17:35 | [–] similar comments

If you make a product then why not run the GPL code as a background application and you send it commands on what to speak? It would be fair to contribute back any improvements if you would able to add to the GPL stuff.

57844743385 | karma 116 | avg karma 1.84 2021-09-25 08:10:16 | [–] similar comments

I did a lot of work researching all the available text to speech systems a couple of years ago.

The cloud based systems from Google, Microsoft, Amazon and IBM are much better than anything else, and within them, the neural network based systems, which appear to be a sort of different product category, are far and away the best of all. The neural voices are approaching natural voice intonation and have an almost believable ability to read text.

The ones that sounded most natural were IBM Watson and Googles neural voices.

Amazon Polly appeared to be the furthest behind of all the cloud systems…. a really average sounding product.

Of the local TTS systems, the one built into MacOS sounds the best… but they were all very average at best. All the linux ones frankly sounded like garbage relative to the state of the art.

Things might have advanced with the cloud systems over the past couple of years but I didn’t get the impression the cloud companies were putting much effort into research and development.

bluebirdfirewin | karma 10 | avg karma 0.53 2021-09-25 10:09:04 | [–] similar comments

I searched for a TTS service recently and found wellsaidlabs. It’s a saas product but the quality is astonishing. It’s also fast to render the audio, approximately 2 times the length of the audio file. Here is an article of the mit technology review magasine about it https://www.technologyreview.com/2021/07/09/1028140/ai-voice...

a9h74j | karma 1430 | avg karma 1.33 2021-09-25 12:05:22 | [–] similar comments

I had reason to sample the IBM performance recently. It is imressive. Do you know if NN based systems have been trained on, say, audio books for which text is also available?

kongin | karma 41 | avg karma 0.19 2021-09-25 08:14:32 | [–] similar comments

I find that people who have never used text to speech think that the closer it is to real speech the better it is.

Which is simply not the case.

Artificial speech is to human speech what typography is to handwriting.

For example espeak is by far my number one choice for reading anything, because the voice models it uses can be sped up to 1k wpm and still be understandable. This is basically a superpower when skimming boring documentation of any type. Throw in basic tesseract OCR and in a 45 minute sitting I can go through 30k words of any document that can be displayed on a computer screen.

It's not that I'm stuck with a terrible robotic voice, it's that I don't want anything "better" in the same way that I don't see much value going past the command line for most tools when you can use ncurses.

simion314 | karma 7313 | avg karma 1.67 2021-09-25 09:25:43 | [–] similar comments

Same here, so for accessibility reasons current voices are good enough, but I admit at the beginning I lost a lot of time trying to find good voices, until I trained myself with faster speeds.

So probably most people here researched this topic not for accessibility reasons but for "commercial" stuff like creating some kind of service where chat bots could speak to you or transcribe articles for some regular people(without eye problems) to listen to them.

toast0 | karma 25207 | avg karma 2.17 2021-09-25 10:44:59 | [–] similar comments

It depends on the use case. For a lot of people who don't use TTS directly, we're exposed to it through public announcement systems on transit/airports and phone systems, or sometimes voice verification codes.

More natural speech patterns would be useful in those venues.

fiddlerwoaroof | karma 6113 | avg karma 2.37 2021-09-25 12:06:01 | [–] similar comments

I wonder if people will hit an uncanny valley here: my experience with video animation was that sometime around “The Polar Express” animated movie makers realized that audiences didn’t really want more and more realistic animation.

toast0 | karma 25207 | avg karma 2.17 2021-09-25 12:49:53 | [–] similar comments

I think there's certainly a good enough point, yeah. But last time I was on a CalTrain platform, their TTS announcement still couldn't pronounce CalTrain, unless it's expected to be pronounced call train.

rockemsockem | karma 880 | avg karma 2.1 2021-09-25 15:36:49 | [–] similar comments

Would this be true for things like audiobooks and particularly interesting long-form articles on the web that you actually want to listen to as opposed to "getting through"?

kongin | karma 41 | avg karma 0.19 2021-09-25 19:18:43 | [–] similar comments

I find that listening to them multiple times is much better value for time than just doing it once.

hjek | karma 2054 | avg karma 3.07 2021-09-25 09:00:32 | [–] similar comments

Anyone knows how to get that voice to work in Reader Mode in Firefox on Debian?

smcameron | karma 1782 | avg karma 3.97 2021-09-25 09:17:56 | [–] similar comments

There's also pico2wave (libttspico), which, to me, with the "-l=en-GB" flag, sounds the best by far of any offline TTS that I've tried.

You can hear it in this video: https://www.youtube.com/watch?v=tfcme7maygw&t=131s

giuseppeciuni | karma 7 | avg karma 0.78 2021-09-25 09:33:41 | [–] similar comments

I agree, I use Pico2wave too after testing other TTS and pico2wave has the best voice in offline systems. I use it combined with home assistant: whenever a window trigger is fired, pico2wave generate a wav file and it is read by aplay command and transmitted to an 90' stereo HI-FI. The result is: the window x it is opening because x.

The italian voice sounds great

jmiskovic | karma 1391 | avg karma 2.69 2021-09-25 09:35:08 | [–] similar comments

A nice enhancement for the system is having TTS read out the currently selected text, triggered by a key shortcut.

I tried festival and it too complicated and my version was too to run the better voices model.

Instead I've used this repo to use upgraded flite: https://github.com/kastnerkyle/hmm_tts_build/

I have mapped keyboard shortcuts Win+1 for normal speed, Win+2 for faster and Win+3 for really fast reading speed. I can use it while reading, to enhance my focus. Neat.

synesthesiam | karma 348 | avg karma 4.19 2021-09-25 10:17:14 | [–] similar comments

I created Larynx (https://github.com/rhasspy/larynx) to address shortcomings I saw in Linux speech synthesis:

* Licensing (MIT)

* Quality (judge for yourself: https://rhasspy.github.io/larynx/)

* Speed (faster than real-time on amd64/aarch64)

* Voices/language support (9 languages, 50 voices)

I'm working now on integrating Larynx with speech-dispatcher/Orca. The next version of Larynx will also support a subset of SSML :)

phkahler | karma 20899 | avg karma 2.69 2021-09-25 10:49:10 | [–] similar comments

Can it run on a raspberry pi?

synesthesiam | karma 348 | avg karma 4.19 2021-09-25 11:02:42 | [–] similar comments

Yes! There's a Docker image and Debian package for both 32-bit and 64-bit ARM. The 64-bit version is significantly faster (especially with low quality set).

moffkalast | karma 7759 | avg karma 1.88 2021-09-25 18:34:13 | [–] similar comments

That's fantastic, someone ought to integrate this with Mycroft, stat.

tailspin2019 | karma 4433 | avg karma 3.89 2021-09-25 12:16:03 | [–] similar comments

Some of those sound really good.

I was going to comment that you didn't have any en_gb listed, but it seems there's a bunch under en_us :)

Some rather good brit'ish accents in there me old mate!

synesthesiam | karma 348 | avg karma 4.19 2021-09-25 12:58:15 | [–] similar comments

Thanks! They seemed to work fine with en_us phonemes, so I haven't created a separate en_gb set yet. Maybe someday :)

posguy | karma 1400 | avg karma 2.9 2021-09-25 18:01:49 | [–] similar comments

cmu_jmk (glow_tts) under en_us seems to be quite nice

c6401 | karma 17 | avg karma 0.59 2021-09-25 20:55:43 | [–] similar comments

I like "ek", but I'm not a native speaker, so my preference for British English voice might not be the same as for anglophones.

dm319 | karma 2355 | avg karma 2.68 2021-09-25 13:22:14 | [–] similar comments

Thought I'd give this a go, but getting lots errors along the lines of 'Expected shape from model of {...} does not match actual shape of {...} for output audio. Tried the debian and python methods of installation on an AMD Ryzen X13.

EDIT: despite those errors I can create output.wav. However, interactive mode crashes with "No such file or directory: 'play'".

synesthesiam | karma 348 | avg karma 4.19 2021-09-25 14:11:21 | [–] similar comments

The shape warnings don't seem to matter (something to do with the onnx runtime). Interactive mode needs sox installed or for you to specify a --play-command

FeepingCreature | karma 5051 | avg karma 2.16 2021-09-25 14:49:42 | [–] similar comments

Sweet, I've been hoping for good Linux TTS!

c6401 | karma 17 | avg karma 0.59 2021-09-25 21:59:07 | [–] similar comments

It's super awesome. Just wondering if it can simultaneously use play-command to play sample and in the same time render the next to eliminate pauses? Also wondering if it can work through spd-say (speech dispatcher). I will probably be able to figure out both just checking in case if there are ready solutions.

synesthesiam | karma 348 | avg karma 4.19 2021-09-26 07:32:00 | [–] similar comments

Thanks! Try the --raw-stream option for listening to long texts: https://github.com/rhasspy/larynx#long-texts

For speech-dispatcher, I'd start a Larynx HTTP server and use curl to get audio. I have an undocumented --daemon flag that does something like this.

c6401 | karma 17 | avg karma 0.59 2021-09-26 14:22:20 | [–] similar comments

Thank you! Streaming output to aplay works pretty well for me.

dosman33 | karma 329 | avg karma 3.78 2021-09-25 10:45:39 | [–] similar comments

I did a bunch of work with TTS about 15 years ago for a project and had landed on Festival and the Nitech voices as the best free option at that time. It's interesting that this seems to still be the best free non-cloud option available.

What a lot of people don't realize is that Festival is intended for creating new TTS voices based on your own voice. The fact that it generates TTS is an artifact of it's main function. I've never messed with that functionality myself but I always wonder if someone could train a synthetic voice to sound better with a larger sample set. The Nitech voices are definitely better so it's certainly possible to encourage Festival do a better job.

wiz21c | karma 3856 | avg karma 1.82 2021-09-25 13:25:56 | [–] similar comments

Listening to the 2001 examples (I'm sorry Dave...) I wonder: would it be possible to train an IA to copy a voice based only on a few samples. It'd had to "model" the voice on a few minutes of speech only... But I'd love my computer to use HAL's voice for sure !

kkielhofner | karma 4469 | avg karma 3.68 2021-09-25 18:49:06 | [–] similar comments

I worked with this a bit not that long ago. For cloud services, quality of Google and Azure "neural" voices are tough to beat. Interestingly I experienced significant latency for all of the Azure services regardless of region, configuration, etc. Never dug deep enough to figure out what was going on there. Also of note, Azure will let you run their implementation on a local container with the usual "contact us" stuff. Not sure of the terms and pricing on that.

For local, Mozilla TTS was best from a quality standpoint but the GPU inference support was a bit dicey and (possibly) not really supported at all.

For more complex and bespoke applications the Nvidia (I know, I know) NeMO toolkit [0] is very powerful but requires more effort than most to get up and running. However, it provides the ability to do very interesting things with additional training and all things speech.

In the Nvidia world there's also their Riva [1] (formerly Jarvis) solution that works with Triton [2] to build out an architecture for extremely performant and high-scale speech applications with things like model management, revision control, deployment, etc.

[0] https://github.com/NVIDIA/NeMo

[1] https://developer.nvidia.com/riva

[2] https://developer.nvidia.com/nvidia-triton-inference-server

Legal | privacy