Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Template matching is normally doing the pixelwise correlation between source + offset and dest (in 2D).

Its a really basic algorithm, even more basic in 1D. So it would be pretty trivial to just compare the shifted spectrograph profiles for a massive gain in performance.

Infact, you probably don't even the "shift" bit of the algorithm because you will end up comparing one frequency to a different frequency which does not make much sense (outside of Doppler shift calcs). So its a really long winded approach to taking the cross correlation of two spectrograms that introduces a load of unwanted homomorphism.

see also http://dsp.stackexchange.com/questions/736/how-do-i-implemen...

I would like to add that just because an algorithm is in vision processing library, doesn't make it "clever". Basic template matching implementation is 4 nested for loops. Clever template matching is FFT to save one two of those loops, but the results are the same. In audio processing you don't need all those loops because the signal is 1D. So going via a vision processing library is just introducing pointless loops that do nothing but worsen results and slow down code.



sort by: page size:

I used the spectrogram just to get the audio in a 2d representation that was insensitive to timeshifts. The image processing part was the `match_template` function from skimage.

Although normalized correlation (which is what `match_template` uses) has been applied outside of image processing before. I also tried other image processing techniques like harris corner detection and SURF feature detection. I didn't write them up but you can see the code on github:

https://github.com/jminardi/audio_fingerprinting


> Instead, we use the fact that "scoring all alignments" is a convolution operation and can be implemented with the Fast Fourier Transform (FFT), bringing the complexity down to O(n log n).

Not sure I understand this right - is this basically treating both binary strings as square waves, converting them to the frequency domain and determining the offset as a pitch shift between the two spectrograms?


I only have a cursory knowledge of DSP, but it sounds like this was implemented by using the 5 second clip to build a matched filter kernel. This would require convolving the samples of the 5 second clip with the entire song to find out where it matches, which should work super well, but I think it would be really tough to scale this up to millions of songs. Is this how it was implemented?

Any idea if that article is still available? Or do you have any good resources for implementing FTT? I'm working on a project to compare sound clips for similarity, but I haven't grasped an effective way to accomplish that yet.

I guess it just does pattern matching for spectrograms with some basic filters for background noises? At least that would be sufficient, IMO.

Not an identical match but perhaps as an intuition pump you can think about fourier transform of audio that is "clipping".

I meant the spectrogram encoded as a 2d array, but I guess there isn't a big difference when the db query is the most expensive part.

I've always wondered: Is there a way to compare fingerprints with humming sounds or live recordings?

Those fingerprinting techniques don't seem to be suitable for those tasks, do you know of any methods to accomplish this?


Very cool stuff! It seems that all those solutions are based on the analysis of visual representations of spectrograms. Is this common or could you just use 2d arrays which encode the same information - would this be more performant?

Nice blog post about this stuff: http://willdrevo.com/fingerprinting-and-audio-recognition-wi... - https://github.com/worldveil/dejavu


Okay, so I am not a lawyer, but I don't get it. One of the most know audio matching systems is from Kalker and Haitsma, published 2002. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.103...

> robust to noise but not warping in time or frequency domains.

Could you use something like a dynamic time warping algorithm for this? (I'm not super acquainted with the technique and not sure if you could get away with it in the frequency domain used for the matching.)


This seems to be targeted at signals that are already quite close. Is there anything similar for broad ballpark similarity?

Whenever I save searched for such things I have more often encountered techniques designed to detect re-use for copyright reasons.

I have played around with generating instrument sounds from a blend of very few basic waveforms with attack,decay,sustain,release, pitch sliding and bell modulation.

While it is quite fun just trying to make things by tweaking parameters, your ear/perception drifts as you hear the same thing over and over.

It would be really nice to have an automated "how close is this abomination?". I'd even give evolution a go to try and make some more difficult matches.


Given a text and waveform, how do you know how to match them up exactly?

That's fine. The goal is to map "similar" 1 second chunks to similar vectors. I'm sure this can be done and uniqueness of sound won't be a problem.

It’s definitely similar, though should be a lot easier when one of the things you’re trying to distinguish and remove has a known high quality reference track you can use, right?

You mean 2d arrays containing the raw audio signal? No, this would not work because you do not know the phase along the y dimension when you want to compare to another signal.

Another method to detect an audio pattern is cross correlation on the raw audio signal. But it is very expensive in computation power and memory.

The longest operation with fingerprinting is often the DB query that is associated. Lots of work to do there. In that space, Will Drevo's work is really good. I will share my DB implementation later.


One nontrivial part is transforming the spectrogram into some representation that is robust to the things that can affect the query audio, like background noise. Another nontrivial part is figuring out, given this representation, how to quickly match the query with a song or database of songs.

I wrote up some of my experiments attempting to do what you are describing. I explain why you cant simply use a 2D array of an audiofile. You can find my post here:

http://jack.minardi.org/software/computational-synesthesia/

You can also see the code behind it here:

https://github.com/jminardi/audio_fingerprinting

I am by no means an expert in this area and a few people have since told me I did a few stupid things in my analysis. But you might find it interesting.


This reminds me of some people working on a cat translator[0]

The paper this dataset links to [1] seems to be using statistical techniques to compare spectrograms of meows, something that seems pretty easy with fastaudio[2].

[0]: https://github.com/FrogBoss74/RealCatTranslator [1]: https://www.mdpi.com/2076-2615/9/8/543 [2]: https://github.com/fastaudio/fastaudio


Can we diff spectrograms to define the "distance" between two chunks of sound and use this measure to guide the ML learning process?

Would it help to decompose sound into subpatterns with Fourier transform?

Afaik, there is a similar technique for recognizing faces: a face picture is mapped to a "face vector". Yet this technique doesn't need the notion of "sequence of faces" to train the model. Can we use it to get "sound vectors"?

next

Legal | privacy