Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

2)Why do you think DL doesn't scale? I am curious. It can easily leverage thousands of GPUs, training on 300 millions of images (https://ai.googleblog.com/2017/07/revisiting-unreasonable-ef...). No other methods is even close to leverage that amount of computational power. I don't really know about CFD, but at least in ML land and dealing with ML problems, DL is very scalable, maybe only next to random forests style algorithm, where they effectively share nothing.

3)It does matter. In fact most valuable startup around DL are CV based startups, they are mainly located in China though.



sort by: page size:

Leon Bottou isn't a humanities professor, but a ML researcher. In fact, not just any ML researcher but arguably one of the ML researchers who most anticipated the current DL scaling era.

Bottou was arguing for the virtues of SGD on the grounds of "CPUs [GPUs] go brrr" literally 2 decades ago in 2003: https://papers.nips.cc/paper/2003/file/9fb7b048c96d44a0337f0... or here he is in 2007/2012 explaining why larger models/data can scale and keep getting better: https://gwern.net/doc/ai/scaling/2012-bottou.pdf https://gwern.net/doc/ai/scaling/2013-bottou.pdf

Which is not to say that he necessarily has anything worthwhile to say about 'Borges and AI' but I'm going to at least give it a read to see if there's something I might want to know 20 years from now. :)


I know, DL works only if you have massive datasets, DRL is even worse for the number of training episodes. Maybe you've heard about recent craziness of using GANs to generate training set fillers when your training sets are small, i.e. you have only 1,000 examples but need 10,000 for reasonable performance. Instead of gathering more examples, you use GAN to create believable training data, and it seems to be working quite well (i.e. a bump from 60% accuracy to 80% while bigger training dataset with real examples would bump you to 90%).

What I observed is that many ML companies now run two pipelines in parallel, one based on Deep Learning and the other on classical ML, then cherry pick solutions that work best for the problem/scale they have.


This is not really true. ML applications in general do not scale linearly. There is the systems (scaling) overhead, and then there are the algorithm payoffs, which start to diminish depending on the algorithm. If they went from 4 to 8 GPUs, the only thing they can guarantee to double is how fast they burn power.

I agree that scale is an important factor in deep learning's success, but that Google experiment ended up being a good example of how not to do it. They used 16000 CPU cores to get that cat detector. A short while later, a group at Baidu were able to replicate the same network with only 3 computers with 4 GPUs each. (The latter group was also lead by Andrew Ng.)

You know, there is more to deep learning research than meta-learning/architecture exploration. Sure you can explore the hyper-parameter space faster with 500 GPUs and get yet again a 0.05% better test score on ImageNet (or more I don't actually know), but there are other ways to do something meaningful in DL without using such compute power.

If size was the major factor, then Amazon with it's GPU farms would be the king. Google also would have nothing to worry about. So, just the size is not enough to explain. I think

a) it is really complex thing with many components

b) multimodal adds different components. But, I think, it's not limited to text and images.

c) they found the way to use 'addons', algorithms, or external models, to improve the results. this would explain why it is good at some tasks, but not the others.

d) using 'addons' requires model (re)training. however, some addons can be updated separately.

e) they put quite a lot of work in design and implementation, besides the training itself.


Great work, well presented too.

In my experience working on applied DL research problems the training bottleneck is almost always the data loading. This is because datasets for real-world problems usually don't fit in GPU memory (not even close) and often require expensive pre-processing. The latter can obviously be driven down using similar methods to here, but the former seems insurmountable. It's not clear to me what the path is to getting these workloads down into this rapid experimental iteration regime.


i'm curious to know what the limitations of the technology are. Are the machine learning/CV algorithms not accurate enough to run it at scale?

"...Worry about scaling; worry about vectorization; worry about data locality...." http://www.hpcwire.com/2016/01/21/conversation-james-reinder...

http://icri-ci.technion.ac.il/events-2/presentation-files-ic...

http://icri-ci.technion.ac.il/files/2015/05/00-Boris-Ginzbur...

Nvidia Chief Scientist for Deep Learning was poached from Intel ICRI-CI Group https://il.linkedin.com/in/boris-ginsburg-2249545?trk=pub-pb...

http://www.cs.tau.ac.il/~wolf/deeplearningmeeting/speaker.ht... Look for quote: "...In a very interesting admission, LeCun told The Next Platform ..." http://www.nextplatform.com/2015/08/25/a-glimpse-into-the-fu...

Yann LeCun states on Nov 2015 (29:00 min mark) GPU's short lived in Deep Learning / CNN / NN https://www.youtube.com/watch?v=R7TUU94ir38

https://www.altera.com/en_US/pdfs/literature/solution-sheets...


Yikes. There is just so much progress with ML and DL.

Can anyone estimate how much CPU $$ we need to get some results with this ?

I train on about 20M samples (1K data points each).


Sorry but even tho the big companies produce a lot of interesting research, I challenge you to not find any interesting model trained on a single GPU from recent publications (the majority coming from academia). Actually it's very rare to find a paper where largely distributed training is necessary (i.e: the training would fail or would be unreasonably too long). Yes having more money help you to scale your experiments, it's nothing new and it's not something specific to AI.

OpenAI and the other research labs (FAIR, Google Brain, MS Research) are heavily focused on image and speech models, but the reality is the vast majority of models deployed in industry don't need DL and benefit more from intelligent feature engineering and simpler models with good hyperparameter tuning. It's definitely the exception that more compute automatically yields more performance.

why wouldn't you do it?

Perhaps because I don't have hundreds of GPUs laying around? I'm a DL researcher fortunate enough to have a 4 Titan X workstation for my experiments. Most of my peers at other labs only have shared access to such workstations, and with the current trends in GPU pricing, that's not likely to change in the nearest future.

More importantly, the lack of compute power is rarely a bottleneck in my research. Reading papers, brainstorming new ideas, and writing/debugging code are the most time consuming parts, it's not like I'm sitting idle while waiting for my models to train.

Distributed TF is the standard in Google, will be the standard amongst everybody else within a couple of years.

By "everyone else" you mean a few dozen corporations, which can justify spending millions of dollars on hardware to speed up their mission critical ML research?


I don't think it is thanks to wasteful software development. The libraries used for LLMs do a lot to squeeze out the full potential of GPUs.

I think it is more of an information problem. How can we store enough information in weights so that it is possible to train models without a budget similar to OpenAI


The problem is that, currently, large ML models need to be trained on clusters of tightly-connected GPUs/accelerators. So it's kinda useless having a bunch of GPUs spread all over the world with huge latency and low bandwidth between them. That may change though - there are people working on it: https://github.com/learning-at-home/hivemind

I'm not saying that there's no very big model, just saying that it's a minority of publications, for any trivia example of big model I can show you 10x trivia examples of relevant non-big models.

Also you are talking about a model which is specifically designed for TPU (the dimensionality of the networks is especially fine-tuned).

And even tho, BERT_large still fit in the memory of a single GPU (for very small batch), there is an implementation on Pytorch. I don't understand, are people complaining that Deep Learning is actually (reasonably) scaling ? Isn't it a good news ?


Scale is unfortunately not the focus of the current implementation. We would address this aspect in the future releases however. Considering the speed and memory requirements, following are the current considerations: 1. Hashing methods: Generation of hashes is quick (a couple of seconds on about 10K images). The tricky part is the retrieval of duplicates, which on the same 10K dataset takes a few minutes. (I would refrain from giving exact numbers since this was done on a local system, not a very good environment for benchmarking) 2. CNN: Here, the time consuming part is encoding generation, which, in the absence of GPU would take much more time (a couple of minutes on 10k images). The retrieval part is pretty quick, but requires memory. So, at this point, using this package on a scale of more than a couple of thousand images is not a good idea when done locally. We would however, address the scale aspect of the problem in future releases. Thanks for your question.

Jumbo CNNs are not the battleground. The real battleground is distribution. The first framework that scales out without placing much onus on the programmer will win, IMO. Facebook already showed that Caffe2 scales to 256 GPUs for imagenet. Tensorflow need to show it can scale as well. PyTorch needs to work on usability - model serving, integration in ecosystems like Hadoop, etc.

What is the bottleneck in building the "largest LLM" today for any interested party? AI expertise? Training data? GPUs?
next

Legal | privacy