Finding the genre of a song with Deep Learning

nkurz | karma 53095 | avg karma 9.3 · 2016-12-02 01:04:53

Wow, I find it incredible that this works. As I understand it, the approach is to do a Fourier transform on a couple seconds of the song to create a 128x128 pixel spectrogram. Each horizontal pixel represents a 20 ms slice in time, and each vertical pixel represents 1/128 of the frequency domain.

Then treating these spectrograms as images, train a neural net to classify them using pre-labelled samples. Then take samples from the unknown songs, and let it classify them. I find it incredible that 2.5 seconds of sound represented as a tiny picture captures information enough for reliable classification, but apparently it does!

reply

iammyIP | karma 165 | avg karma 0.61 · 2016-12-02 02:04:05+00:00

One reason might be that the mentioned genres are highly formulaic to begin with. The standard rap song contains about 2 bars of unique music stretched out over 3 minutes with slight variations. Same with dubstep and techno. All highly repetitive. Classical music got no drums, so you can detect that. Metal got guitar distortion all over the spectrum. So with these examples the spectral images should have enough distinctive features that can be learned. Why should it be different than with 'normal' pictures? Also it looks like they take four 128x128 guesses per song.

dagw | karma 25328 | avg karma 2.79 · 2016-12-02 07:49:19

If they can write some code that can classify metal into one of its 72 sub-genres then I'll be truly impressed :)

Although I wonder what that would do to the metal scene if their main topic of discussion and contention got completely solved.

reply

Despoisj | karma 162 | avg karma 5.23 · 2016-12-02 09:22:13

Haha that would be awesome ! I guess we'd need a lot of data, and probably use much more detailed spectrogram (time-wise and frequency-wise).

Despoisj | karma 162 | avg karma 5.23 · 2016-12-02 09:26:40

It's true that having very different genres helps the model a lot. It would be much more difficult to distinguish between closer genres, especially when people don't really know which is which and argue all the time about it.

mjn | karma 9588 | avg karma 5.24 · 2016-12-02 13:52:20

It's quite possible that it's mainly using even more surface-level audio features, before getting to whether the genres are formulaic or not. For example, if specific mastering studios have telltale production features visible in the audio (choice of dynamic range compression algorithms, mixing approaches, etc.), and some mastering studios mainly master, say, country, you can learn to classify country with pretty high accuracy by just recognizing a half-dozen studios' production signatures, without learning anything fundamental about the genre. Whether this happens depends a lot on your choice of data set and validation method.

There's more on that (and some other pitfalls) in a paper linked elsewhere in the comments here: https://news.ycombinator.com/item?id=13085651

reply

kimburgess | karma 2583 | avg karma 6.46 · 2016-12-02 08:59:41

From the description in the walkthrough, it doesn't. The final output looks to be based on 5 of these slices, with each providing a probability distribution that ultimately influences the final classification.

Despoisj | karma 162 | avg karma 5.23 · 2016-12-02 09:20:15

Sorry if the 5 slices are misleading it was only for readability, the average song has 70 slices, which are all classified and used or voting.

Despoisj | karma 162 | avg karma 5.23 · 2016-12-02 09:19:17

Yep you got it right, except the voting system adds tons of reliability because we cannot trust the slice classification (2.5s) too much.

nkurz | karma 53095 | avg karma 9.3 · 2016-12-02 20:13:06

I wonder if training another net on top of the slices would work better than voting for a single winner. I'd presume that there are genres that are well characterized by the distribution and progression of their spectrograms. Probably expand/compress the collection of slices to a standard length before training?

(Nice to see you show up for the discussion. I was worried that you'd given up hope before your article hit the front page.)

reply

amelius | karma 42902 | avg karma 1.63 · 2016-12-02 12:59:44+00:00

I guess the spectogram behaves like an image in that translation of any feature by an arbitrary distance (dx,dy) preserves its predicting quality.

But please correct me if I'm wrong.

reply

iverjo | karma 167 | avg karma 3.71 · 2016-12-02 01:23:25+00:00

Nice approach, and well explained! By the way, Niland is a startup that also does music labeling with the help of deep learning.

Demo available here: http://demo.niland.io/

For example, it can output Drum Machine: 87%, House: 88%, Female Voice: 55%, Groovy: 93%

reply

Despoisj | karma 162 | avg karma 5.23 · 2016-12-02 09:28:41

Thanks for the kind words, I'll take a look !

iverjo | karma 167 | avg karma 3.71 · 2016-12-02 01:47:19+00:00

To the author: Have you tried to use a logarithmic frequency scale in the spectrogram? [1] That representation is closer to the way humans perceive sound, and gives you finer resolution in the lower frequencies. [2] If you want to make your representation even closer to the human's perception, take a look at Google's CARFAC research. [3] Basically, they model the ear. I've prepared a Python utility for converting sound to Neural Activity Pattern (resembles a spectrogram when you plot it) here: https://github.com/iver56/carfac/tree/master/util

[1] https://sourceforge.net/p/sox/feature-requests/176/

[2] https://en.wikipedia.org/wiki/Mel_scale

[3] http://research.google.com/pubs/pub37215.html

reply

jschmitz28 | karma 309 | avg karma 2.92 · 2016-12-02 03:58:12+00:00

Mel scale spectrograms are the approach taken in a research paper which uses roughly the same technique as is described in this post: https://dl.dropboxusercontent.com/u/19706734/paper_pt.pdf

Terribledactyl | karma 494 | avg karma 2.73 · 2016-12-02 04:03:03+00:00

I don't think this problem is bound by absolute frequency resolution, the tightest distance between two notes on a typical piano is ~2hz and if you assume a doubling between octaves you're at <90 notes. The temporal changes and relative chord progressions probably give more info.

Despoisj | karma 162 | avg karma 5.23 · 2016-12-02 09:46:36+00:00

Thanks for your insights! I agree that log/mel spectrograms could be even more detailed and effective, and could be used with the SoX patch discussed here https://sourceforge.net/p/sox/feature-requests/176/.

Despoisj | karma 162 | avg karma 5.23 · 2016-12-02 09:48:26+00:00

I didn't intend to go that far in the human genre recognition parallel, but thanks for the references ! Good job on the script too

tunesmith | karma 8404 | avg karma 3.06 · 2016-12-02 02:14:21

That's pretty cool, I'd like to use something like this to tell me what genre my own songs are. It's annoying to write a song and then upload it to some service or another and have no idea what genre to pick. :-) My stuff is somewhere in the jazz-influenced singer-songwriter american piano pop realm which is a combination that works for me but it generally feels like I'm selling the song short if I have to pick only one.

Despoisj | karma 162 | avg karma 5.23 · 2016-12-02 09:31:36+00:00

Yeah that's a problem I know - I used to make some Electro/Dubstep/Trap music - and I feel people will always disagree with the genre you pick anyway.

GFK_of_xmaspast | karma 3102 | avg karma 0.89 · 2016-12-02 03:05:23+00:00

See also Bob Sturm's work on genre classification: http://link.springer.com/article/10.1007/s10844-013-0250-y

Despoisj | karma 162 | avg karma 5.23 · 2016-12-02 09:37:14+00:00

Wow, that's an in-depth analysis of the task ! Thanks for sharing

jschmitz28 | karma 309 | avg karma 2.92 · 2016-12-02 03:55:17

Unless I'm misunderstanding the validation set, I'm skeptical of the ability of this classifier to tag unlabeled tracks, given that it is only being trained and tested on tracks which are already known to belong to one of the few trained genres. I'd be curious to see the performance if you were to additionally test on tracks which are not any of (Hardcore, Dubstep, Electro, Classical, Soundtrack and Rap), with a correct prediction being no tag.

Despoisj | karma 162 | avg karma 5.23 · 2016-12-02 09:23:02+00:00

It's true that the validation set only contains genres I used for the training. I'll try this out this evening ;)

dkarapetyan | karma 1 | avg karma 0.0 · 2016-12-02 04:33:06

Hmm, convolution is perfectly good operation to run on wave forms as well. In fact the wikipedia article (https://en.wikipedia.org/wiki/Convolution) shows the operation on functions which would correspond to time-domain wave forms. What is the point of converting everything to pictures and then using 2D convolutions when that step could have been skipped entirely?

Converting to pictures is unnecessary. It makes the processing harder. The pooling should just happen on segments of the wave form instead of the fourier transform (frequency-domain) picture spectrograms.

reply

highd | karma 526 | avg karma 3.35 · 2016-12-02 04:46:49

The idea is that the vertical axis of the spectrogram is basically already an hierarchical set of features (in scale/frequency). Then convolutions on that is a lot like how DenseNets combine hierarchical features.

I agree it seems a little jank, but the features are pretty good - and a lot of network architectures / training techniques are most practiced in an image processing context.

reply

Despoisj | karma 162 | avg karma 5.23 · 2016-12-02 09:35:33+00:00

Thanks for your inputs, it's true that we can use convolutions on raw waveform, however the main reason I've used a spectrogram was to work on precomputed relevant features as highd pointed out, instead of running the convolution on lots of data.

return0 | karma 5191 | avg karma 1.91 · 2016-12-02 08:37:48+00:00

Good luck convincing musicians that "THAT's your genre"

Despoisj | karma 162 | avg karma 5.23 · 2016-12-02 09:24:19+00:00

Haha, I should have titled the article "How to build an internet music-genre-troll-bot"

chestervonwinch | karma 1317 | avg karma 2.46 · 2016-12-02 12:46:45+00:00

1. I wonder how the continuous wavelet transform would compare to the windowed Fourier transform used here. See [1] an python implementation, for example.

2. The size of frequency analysis blocks seems arbitrary. I wonder if there is a "natural" block size based on a song's tempo, say 1 bar. This would of course require a priori tempo knowledge or a run-time estimate.

[1]: https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/...

reply

Despoisj | karma 162 | avg karma 5.23 · 2016-12-02 12:59:06+00:00

The slice size is indeed quite arbitrary, and knowing the BPM would help, but isn't reliable either (various tempos, rubato for classical etc.)

jtmarmon | karma 780 | avg karma 3.44 · 2016-12-02 13:17:27

i'm not super familiar with deep learning so forgive me if i'm missing some nuance, but what's the purpose of writing/reading to/from images? seems like it would add a ton of processing time. could the CNN not just read from a 50 item array of tuples representing the data from the 20ms slice?

Despoisj | karma 162 | avg karma 5.23 · 2016-12-02 15:15:34+00:00

I'm not sure what you mean, but I have chosen to store slices on the disk so that I could still take a look at them, and not store the data only in numpy arrays. That could be optimize for a better processing time!

maxerickson | karma 36996 | avg karma 2.02 · 2016-12-02 14:35:28

See also http://everynoise.com/ which is a view into how Spotify classifies music.

The creator wrote about it here:

http://blog.echonest.com/post/52385283599/how-we-understand-...

and writes a lot about it on their blog:

http://www.furia.com/page.cgi?terms=noise&type=search

Of course those are going in the other direction, not generating the classification from the data, but it's probably one of the best data sets as far as classifying existing music.

reply

Despoisj | karma 162 | avg karma 5.23 · 2016-12-02 15:11:40

Thanks for the awesome ressources ! :D