Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

"fine-tuned on images of spectrograms paired with text"

How many paired training images / text and what was the source of your training data? Just curious to know how much fine tuning was needed to get the results and what the breadth / scope of the images were in terms of original sources to train on to get sufficient musical diversity.



sort by: page size:

"fine-tuned on images of spectrograms paired with text"

How many paired training images / text and what was the source of your training data? Just curious to know how much fine tuning was needed to get the results and what the breadth / scope of the images were in terms of original sources to train on to get sufficient musical diversity.


Producing images of spectrograms is a genius idea. Great implementation!

A couple of ideas that come to mind:

- I wonder if you could separate the audio tracks of each instrument, generate separately, and then combine them. This could give more control over the generation. Alignment might be tough, though.

- If you could at least separate vocals and instrumentals, you could train a separate model for vocals (LLM for text, then text to speech, maybe). The current implementation doesn't seem to handle vocals as well as TTS models.


Awesome work.

Would you be willing to share details about the fine-tuning procedure, such as the initialization, learning rate schedule, batch size, etc.? I'd love to learn more.

Background: I've been playing around with generating image sequences from sliding windows of audio. The idea roughly works, but the model training gets stuck due to the difficulty of the task.


Amazing work! Did you use CLIP or something like that to train genre + mel-spectrogram? What datasets did you use?

How would you use spectrogram diffs for training?

I'm not sure what would be useful "subpatterns" of sound. In language modeling, there are word based, and character based models. Given enough text, an RNN can be trained on either, and I'm not sure which approach is better. For music the closest equivalent of a word is (probably) a chord, and the closest equivalent of a character is (probably) a single note, but perhaps it should be something like a harmonic, I don't know.

Unlike faces, music is a sequence (of sounds). It's closer to video than to an image. So we need to chop it up and to encode each chunk.

Ultimately, I believe that we just need a lot of data. Given enough data, we can train a model which is large enough to learn everything it needs in the end to end fashion. Primary achievement of GPT-2 paper is training a big model on lots of data. In this work, it appears they only used a couple of available midi datasets for training, which is probably not enough. Training on all available audio recordings (either raw, or converted to symbolic format) would probably be a game changer.


Somewhat silly question: To what extent does analysis of music/sound by looking at spectrogram images provide "enough" information for usage in deep learning systems (like ResNet) compared to something like MusicNet?

once image models are good enough, you can train them to generate spectrograms images, and convert them back to audio to listen to the generated music.

I wonder if you could get interesting results by applying neural style transfer to spectrograms.

If it can do music, can we train better models for different kinds of music? Or different models for different instruments makes more sense? For different instruments we can get better resolution by making the spectrogram represent different frequency ranges. This is terribly exciting, what a time to be alive.

The audio sounds a bit lossy, would it be possible to create high quality spectograms from music, downsample them, and use that as training data for a spectogram upscaler?

It might be the last step this AI needs to bring some extra clarity to the output.


Fun! I tried something similar with DCGAN when it first came out, but that didn't exactly make nice noises. The conversion to and from Mel spectrograms was lossy (to put it mildly), and DCGAN, while impressive in its day, is nothing like the stuff we have today.

Interesting that it gets so good results with just fine tuning the regular SD model. I assume most of the images it's trained on are useless for learning how to generate Mel spectrograms from text, so a model trained from scratch could potentially do even better.

There's still the issue of reconstructing sound from the spectrograms. I bet it's responsible for the somewhat tinny sound we get from this otherwise very cool demo.


Did you have a data set for training the relationship between words and the resulting sound?

Wow, I find it incredible that this works. As I understand it, the approach is to do a Fourier transform on a couple seconds of the song to create a 128x128 pixel spectrogram. Each horizontal pixel represents a 20 ms slice in time, and each vertical pixel represents 1/128 of the frequency domain.

Then treating these spectrograms as images, train a neural net to classify them using pre-labelled samples. Then take samples from the unknown songs, and let it classify them. I find it incredible that 2.5 seconds of sound represented as a tiny picture captures information enough for reliable classification, but apparently it does!


I'm super excited about the Audio AI space, as it seems permanently a few years behind image stuff - so I think we're going to see a lot more of this.

If you're interested, the idea of applying Image processing techniques to Spectrograms of audio is explored in brief in the first lesson of one of the most recommended AI courses on HN: Practical Deep Learning for Coders https://youtu.be/8SF_h3xF3cE?t=1632


This article seems to approach at a pretty low level. For another take, I recently worked on a hobby project that built a multi-label deep learning classifier using CNNs and got about 95% accuracy for the validation set. I am somewhat familiar with signal processing with my background, but really wanted to just scratch the surface with deep learning. But to be clear, it was just a fun project for me and I don't pretend to understand everything I did.

My goal was to detect instruments in complex, full-length music, as opposed to single sources with no background noise. My approach was to generate Mel spectrograms for small sections of songs and then run the deep learning classifier on these images to build a list of labels for later use, e.g. generating playlists. I more than doubled the accuracy by adding to and curating my own dataset. I used Resnet50 as the base model and found decent results even when applied to spectrograms!

I still want to do more analysis to figure out what kind of small features the model was actually picking up on. I also want to try experiments like scrambling the spectrogram time-wise and seeing if the results are still as accurate.


"at around 3:50 it even attempts some singing."

Mmmmm... that sounds like overfitting. That's not "attempting some singing", that's "playing back one of the things it trained on". Which really raises questions about the rest of what you hear, too; it seems like what is being produced is probably in some sense the "average" of the training data, rather than something able to generate new samples from it. But it's a very interesting "average" full of interesting information.

Since I wouldn't expect this to produce much else, I'm not being critical about the effort, just pointing it out so others understand what they are hearing. It was an interesting and worthy experiment that I wondered about myself.


> image representations of audio are horrible representations of it

The spectrogram is just a series of FFTs taken over time; encoding it as a bitmap doesn't really change this, aside from precision issues. Any other representation of the audio is derived from either the original time-domain signal or the FFT.

Indeed, humans can't reliably map raw waveforms or spectrograms to intuitive musical phenomena. But a CNN should be able to derive meaningful features from these basic representations on its own.


I used the spectrogram just to get the audio in a 2d representation that was insensitive to timeshifts. The image processing part was the `match_template` function from skimage.

Although normalized correlation (which is what `match_template` uses) has been applied outside of image processing before. I also tried other image processing techniques like harris corner detection and SURF feature detection. I didn't write them up but you can see the code on github:

https://github.com/jminardi/audio_fingerprinting


Is it easy to recreate audio from, say, an image file of the spectrogram results?
next

Legal | privacy