CPU usage in the decoder is currently higher than Vorbis, though most of the difference is due to having less optimizations. As for the encoder, it's already faster than Vorbis and should become even faster in the future. I haven't checked, but I suspect the same is true when compared to MP3. That being said, the complexity is pretty much negligible on a desktop machine.
There is no release yet, but Cisco is saying they will be releasing the source code of the full implementation, not just nvidia-style hooks to a binary blob. AFAIK, the plan is for users to be able to replicate the build and see that the binary they got from Cisco indeed corresponds to the source code.
Well, DRM is completely independent from the codec. You can put DRM on a free codec and you can have a non-free codec without DRM. As for Hollywood... note that the Alliance also includes Netflix.
There is an alternative: you host your thesis (and/or papers) on your own website, or you put them on arXiv.org. It's called self-archiving and it's allowed by most publishers. Funny thing is: making your papers available for free also increases the number of citations, something academics really care about.
Obviously, the rates for these images are the same across all codecs. Choosing the rate for such comparison is also a little tricky. If you choose a rate that people would use to make their image look really good (which how most people compress images), then you have a realistic use case, but nobody can see much difference. If you choose too low a rate, then all images look really ugly. So I tried picking a rate where the resulting images are tolerable, but low enough that differences across codecs are easily noticeable.
At low bit-rate, lame applies a low-pass and resamples to a lower sampling rate. If it didn't to that, MP3 would end up completely starved of bits (trying to code all frequencies) and have really atrocious artefacts. Based on personal experience and the results from a few listening tests I saw, I think you can assume that Opus can now give you about the same quality as MP3 for about 60% of the bits. For example, 76 kb/s Opus should be around the same quality as 128 kb/s MP3.
The threshold for "transparency" varies a great deal across users. Some people I know cannot ABX 64 kb/s Opus from the uncompressed original, yet once in a while we hear of someone being able to ABX one particular sample all the way up 192 kb/s. There's also the issue of whether you have the reference to compare. Personally, I stop being able to ABX Opus somewhere around 128 kb/s, but I won't be able to tell that a song has been compressed with Opus at 96 kb/s unless I have the original to compare it to.
All Opus releases since 1.0 are fully compatible with the specification (RFC 6716) and all future releases will also be. That's why we have standards. If we ever decided to make something that's not compatible, it would be called a different name.
Even that should happen relatively quickly since so far we've also preserved API and ABI compatibility (since 1.0). Also, we're working closely with the Firefox and Chrome teams so they know what's going on with Opus.
The "distortion" you're hearing is actually just the leftover noise. When there's noise and speech at the same time and frequency, then you can year some of both and it sounds a bit harsh -- like distortion.
In general, this problem is tough because when there's a change in the signal, you have just 10 ms to decide whether it's the noise or the signal changing.
Interesting -- and unexpected. I also wrote the Speex suppressor and one of the things that specifically annoyed me about it was the robotic noise and the pseudo-reverberation it adds to the speech, but it seems like some people (like you) like that. Trying to understand exactly what you don't like about RNNoise... is it how the remaining background noise sounds or how sharply it turns on/off?
I did a quick hack to RNNoise to smooth out the attenuation and prevent it from cancelling more than 30 dB. I'd be curious if it improves or makes things worse for you (compared to the samples in the demo):
https://jmvalin.ca/misc_stuff/rnn_hack1/
I'm not sure how the audacity works exactly, but keep in mind that one goal of RNNoise is real-time, so it cannot look ahead when denoising. OTOH, if you're denoising an entire file, then you should look at the whole file at a time. This makes it easier to make accurate decisions about what to keep and what to discard.
For training I've had to use some non-free data, but there's also some free stuff around. The speech from the examples is from SQAM (https://tech.ebu.ch/publications/sqamcd) and I've also used a free speech database from McGill (http://www-mmsp.ece.mcgill.ca/Documents/Data/). Hopefully if a lot of people "donate their noise", I can make a good free noise database.
What you're describing is more or less why noise suppression algorithms in general cannot really improve intelligibility of the speech. Unless they're given extra cues (like with a microphone array), there's nothing they can do in real-time that will beat what the brain is capable of with "delayed decision" (sometimes you'll only understand a word 1-2 seconds after it's spoken). So the goal of noise suppression is really just making the speech less annoying when the SNR is high enough not to affect intelligibility.
That being said, I still have control over the tradeoffs the algorithm makes by changing the loss function, i.e. how different kinds of mistakes are penalized.
My comment about intelligibility refers to a human (with normal audition) directly listening to the output. When the output is used in a hearing aid, a cochlear implant, or a low bitrate vocoder, then noise suppression may be able to help intelligibility too.
The main problem here is that you're depending on implementation-specific behaviour. If you train on a device, you have to run on a device with exactly the same behaviour. On top of that, some FPUs have very slow (trapping) denormal handling. I'm also unsure how accurate the gradient computation can be when the signal itself has numerical issues.
I don't deny it's a cool hack, but beyond that I don't think I see the point or the problem this is trying to solve.
A lot of people get the impression it's only cancelling where there's no speech, but it's also cancelling during speech -- just not as much. If you look at the spectrogram at the top of the demo, you can see HF noise being attenuated when there's LF speech and vice versa.
There's a good reason all the listening tests have stopped at 96 kb/s. Above that, the quality of Opus, Vorbis and AAC is so close to transparency that it's pretty much impossible to get statistically significant results. Even the latest 96 kb/s test was really stretching things.
The C part is mostly for low-level functions and was brought in to help bootstrap development (it's easier to work on improving a working encoder than one that doesn't work yet). The amount of Rust code is expected to increase a lot over time, while the amount of C code is expected to either decrease or remain constant.
Getting something like AOM would have been easy back in 1993 because the costs would have been much lower. Back then complexity had to be really low, which means most of the complicated modern tools were off the table. Coming up with something equivalent to MPEG-1 would have required just a handful of engineers over maybe a year. In terms of IPR, there would also have been much less to check than today. OTOH, the minefield was moving really fast at the time, which could have added some complications. In the end, I think the main reason nobody bothered with something AOM-like is that few people realized the huge problem of patents on standards.
The reason we are not calling it Opus 2 is that it could confuse some people into thinking we broke compatibility. Opus 1.3 is perfectly compatible with Opus 1.0, and all future releases will keep that compatibility.
Like many other audio codecs, Opus lets the encoder decide how to spend the bits is has -- on what frame and on what frequency bands. On top of that is has a few special features that also require decisions from the encoder. So while the decoder doesn't change, the encoder can be improved to make better decisions. While the format itself is not perfect, I have not come across any particular thing that would be worth breaking compatibility over. I prefer working within the constraints of the bitstream to keep improving the quality.
Actually, this won't work at all for music because it makes fundamental assumptions that the signal is speech. For normal conversations, it should work, though for now the models are not yet as robust as I'd like (in case of noise and reverberation). That's next on the list of things to improve.
Keep in mind that the very first CELP speech codec (in 1984) used to take 90 seconds to encode just 1 second of speech... on a Cray supercomputer. Ten years later, people had that running in their cell phones. It's not just that hardware keeps getting faster, but algorithms are also getting more efficient. LPCNet is already 1/100 the complexity of the original WaveNet (which is just 2 years old) and I'm pretty sure it's still far from optimal.
Well, in the case of music, what happens is that due to the low bit-rate there are many different signals that can produce the same features. The LPCNet model is trained to reproduce whatever is the most likely to be a single person speaking. The more advanced the model, the more speech-like the music is likely to turn
When it comes to noisy speech, it should be possible to improve things by actually training on noisy speech (the current model is trained only on clean speech). Stay tuned :-)
Iridium appears to be using a vocoder called AMBE. Its quality is similar to the one of the MELP codec from the demo and it also runs at 2.4 kb/s. LPCNet at 1.6 is a significant improvement over that -- if you can afford the complexity of course (at least it'll work on a phone now).
Actually, what's in the demo already includes pruning (through sparse matrices) and indeed, it does keep just 1/10 of the weights as non-zero. In practice it's not quite a 10x speedup because the network has to be a bit bigger to get the same performance. It's still a pretty significant improvement. Of course, the weights are pruned by 16x1 blocks to avoid hurting vectorization (see the first LPCNet paper and the WaveRNN paper for details).
The ceptrum that takes up most of the bits (or the LSPs in other codecs) is actually a model of the larynx -- another reason why it doesn't do well on music. Because of the accuracy needed to exactly represent the filter that the larynx makes, plus the fact that it can more relatively quickly, there's indeed a significant number of bits involved here.
The bitrate could definitely be reduced (possibly by 50%+) by using packets of 1 seconds along with entropy coding, but the resulting codec would not be very useful for voice communication. You want packets short enough to get decent latency and if you use RF, then VBR makes things a lot more complicated (and less robust).
In theory, it wouldn't be too hard to implement with an neural network. In theory. In practice, the problem is figuring out how to do the training because I don't have 2 hours of your voice saying the same thing as the target voice and with perfect alignment. I suspect it's still possible, but it's not a simple thing either.
I didn't say "impossible", merely "not simple". The minute you bring in a GAN, things are already not simple. Also, I'm not aware of any work on a GAN that works with a network that does conditional sampling (like LPCNet/WaveNet), so it would mean starting from scratch.
All major browsers now implement WebRTC, including Opus support. Also, most browsers now support Opus playback in HTML5, though AFAIK Safari only supports it in the CAF container. See https://caniuse.com/#search=opus
No, exactly none of that data was used for training. The training was done before the demo that was asking for noise contributions. The contributions are CC0, but were never used (i.e. totally unknown dataset quality).
> This is entirely my fault, and I take all the blame for that.
You shouldn't be blaming yourself, it was the best thing to do. Some people may have been confused over who the "good guys" were in this mess. By taking over all these channels you made everything perfectly clear. No amount of arguing could have made things clearer than your actions.
The good news is that we were able to improve intelligibility slightly compared with filling with zeros (it's also a lot less annoying to listen to). The bad news is that you can only do so much with PLC, which is why we then pursued the Deep Redundancy (DRED) idea.
Quoting from our paper, training was done using "205 hours of 16-kHz speech from a combination of TTS datasets including more than 900 speakers in 34 languages and dialects". Mostly tested with English, but part of the idea of releasing early (none of that is standardized) is for people to try it out and report any issues.
There's about equal male and female speakers, though codecs always have slight perceptual quality biases (in either direction) that depend on the pitch. Oh, and everything here is speech only.
Well, there's different ways to make things up. We decided against using a pure generative model to avoid making up phoneme or words. Instead, we predict the expected acoustic features (using a regression loss), which means that model is able to continue a vowel. If unsure it'll just pick the "middle point", which won't be something recognizable as a new word. That's in line with how traditional PLCs work. It just sounds better. The only generative part is the vocoder that reconstructs the waveform, but it's constrained to match the predicted spectrum so it can't hallucinate either.
What the PLC does is (vaguely) equivalent to momentarily freezing the image rather than showing a blank screen when packets are lost. If you're in the middle of a vowel, it'll continue the vowel (trying to follow the right energy) for about 100 ms before fading out. It's explicitly designed not to make up anything you didn't say -- for obvious reasons.
Actually, what we're doing from DRED isn't that far from what you're suggesting. The difference is that we keep more information about the voice/intonation and we don't need the latency that would otherwise be added by an ASR. In the end, the output is still synthesized from higher-level, efficiently compressed information.