"Here is my personal answer to the second question: deep neural networks are more useful than traditional neural networks for two reasons:
The automatic encoding of features which previously had to be hand engineered.
The exploitation of structurally/spatially associated features.
At the risk of sounding bold, that’s it — if you believe there is another benefit which is not somehow encompassed by these two traits, please let me know."
Let me ask a very simple question. What set of hand-engineered features gives <5% error on ImageNet?
Exactly --- none. But those features were born out of brute forcing a spatial exploitation, not some magical connection that we humans never thought about previously, which reinforces the point.
I would go farther and say that the success of deep learning comes mostly from one thing: putting convolution layers in neural nets, instead of just random connections or fully-connected layers.
Google's image and video recognition? Deep Dream? That's all convolution.
Speech-to-text? That's convolution.
AlphaGo? That's convolution.
Convnets are a great advance in machine learning, don't get me wrong. I hope that soon we get a generalizable way to apply convolution layers to text or music.
Speech-to-text? Not really. There has been good gains from adding convolutional layers, particularly for noisy speech, but the big big breakthroughs have been deep fully-connected layers and more recently exotic recurrent designs (LSTMs and the like).
Yeah, I should not make such sweeping statements. (Or maybe I should, because it's a great way to find out what other people's perspective is when they show up to correct me.)
My concern about the uses of convnets on text that I've seen is that I don't think they can deal with little things like the word "not". (The Stanford movie review thing can definitely handle the word "not", but that's different.) I'm unconvinced as of yet that we're convolving over the right thing. But maybe the right thing is on the way, especially when Google gave us a pretty good parser. And maybe the right thing involves other things like RNNs, sure.
I guess image recognition could have similar cases, and the results just look more impressive to me because I work with text and not with images.
I think there are some papers out of the IBM Watson group on question answering where they use ConvNets. I don't remember looking a the negation case specifically, but Question Answering generally has cases where that is important.
Again: convolutional filters existed long before CNNs did.
Yet no one solved ImageNet with some convolutions and hand-engineered features. You can't get that performance even if you shift a set of hand-engineered feature across an image in a convolutional fashion. No one solved it with any previous approach. (Are we to believe that CNNs are the very first method to ever try to exploit some spatial structure...?)
So deep networks do bring things to the table beyond hand-engineered features and you are simply wrong.
I think there is a categorization error here. Deep architectures are possible for many classes of learning algorithms: deep Bayesian nets, restricted Boltzman machines, graphical models, multilayer perceptrons, etc. Convolutional neural nets and LTSM are just two other types, and they exploit spatial and temporal structure explicitly.
Let me ask a very simple question. What set of hand-engineered features gives <5% error on ImageNet?
reply