I appreciate the engagement in making this argument more concrete. I understand that you are talking about returns on compute power.
However, your last paragraph about how investors view deep learning does not describe anyone in the community of academics, practitioners and investors that I know. People understand that the limiting inputs to improved performance are data, followed closely by PhD labor. Compute power is relevant mainly because it shortens the feedback loop on that PhD labor, making it more efficient.
Folks investing in AI believe the returns are worth it due to the potential to scale deployment, not (primarily) training. They may be wrong, but this is a straw man definition of scalability that doesn't contribute to that thesis.
This concern is vague, frequently raised, and not especially useful. What are you advocating, specifically? That the amount of compute being thrown at problems is excessive relative to the economic benefit that Google (or whoever) derives from training models as long as they do? This is an empirical question, and it’s well understood how much compute is useful. Are you saying that advances in M.L. theory are coming too slowly because we focus too much on hardware? I spend maybe a third of my day job reading journal articles, and the rate of ideas coming out is blistering.
This isn’t people banging GPUs together in desperation because nothing works. This is what happens when everything works better and better the more money you throw at it with no apparent end.
> It sounds a lot like you’re saying “small models are just as good” … which is false. No one believes that. … a fully trained larger model is going to be better.
You're absolutely right, a fully trained larger model _will_ be better. This is meant to be under the context of OP of a "limited compute", the statement I'm trying to make is “fully trained small models are just as good as a undertrained large model”.
> …but surely, the laws of diminishing returns applies here?
They do but it's diminishing in that the performance gains of larger models becomes less and less, while the training time required changes a lot. If I'm reading the first chart of figure 2, page 5 correctly, you a 5B vs 10B, the 10B needs almost 10x the training time for a 10% loss gain. and its a similar jump from 1B to 5B. My understanding is at this also starts flattening out, and that loss gain from each 10x becomes gradually lower and lower.
> Over all of time that compute budget is not fixed.
Realistically there is an upper bound to your compute budget. If you needed 1000GPUS for 30 days for a small model, you need 1000GPUS for 300 days for that ~10% at these smaller sizes, or 10,000GPUS for 30 days... You're going to become limited very quickly by time and/or money. There's a reason openai said they aren't training a model larger then GPT 4 at the moment - I don't think they can scale it from what I think is a ~1~2T model.
More compute -> more precision is just one field's definition of scalable... Saying that DNNs can't get better just by adding GPUs is like complaining that an apple isn't very orange.
To generalize notions of scaling, you need to look at the economics of consumed resources and generated utility, and you haven't begun to make the argument that data acquisition and PhD student time hasn't created ROI, or that ROI on those activities hasn't grown over time.
Data acquisition and labeling is getting cheaper all the time for many applications. Plus, new architectures give ways to do transfer learning or encode domain bias that let you specialize a model with less new data. There is substantial progress and already good returns on these types of scalability which (unlike returns on more GPUs) influence ML economics.
>if, to produce meaningful results, you need way-too-expensive hardware...
If you are in a team that is looking at problems that justify massive hardware (in the sense that solving them will pay back the capital and environmental cost) then you will have access to said hardware.
Most (almost all) AI and Data Science teams are not working on that kind of problem though, and it's often the case that we are working on cloud infrastructures where GPU's and TPU's can be accessed on demand and $100's can be used to train pretty good models. Obviously models need to be trained many times so the crossover point from a few $100 to a few $1000 can be painful - but actually many/most engagements really only need models that cost <$100 to train.
Also many of the interesting problems that are out there can utilize transfer learning over shared large pretrained models such as Resnet or GPT-2 (I know that in the dizzying paced modern world these no longer count as large or modern but they are examples...) So for images and natural language problems we can get round the intractable demand for staggeringly expensive compute.
Imagine that you had got a degree in Aeronautical Engineering, you are watching the Apollo program and wondering how you will get a job at NASA or something similar... but there are lots of jobs at Boeing and Lockheed and Cessna and so on.
,,This article reports on the computational demands of Deep Learning applications in five prominent application areas and shows that progress in all five is strongly reliant on increases in computing power.''
I don't agree with the conclusion of the paper. The computing architectures have been improving dramatically over the last few years, and almost any task that was achievable 5 years ago with deep learning is orders of magnitudes cheaper to train.
The energy resources taken by deep learning is increasing because of the huge ROI for companies, but it will probably slow down as the compute cost gets close to the cost of software engineers (or the profit of a company), because at that point researching improvements to the models gets relatively cheaper again.
+1 on calling this BS, even though I think it is only partly BS:
While it is true that training very large language models is very expensive, pre-trained models + transfer learning allows interesting NLP work on a budget. For many types of deep learning a single computer with a fast and large memory GPU is enough.
It is easy to under appreciate the importance of having a lot of human time to think, be creative, and try things out. I admit that new model architecture research is helped by AutoML, like AdaNet, etc. and being able to run many experiments in parallel becomes important.
Teams that make breakthroughs can provide lots of human time, in addition to compute resources.
There is another cost besides compute that favors companies: being able to pay very large salaries for top tier researchers, much more than what universities can pay.
To me the end goal of what I have been working on since the 1980s is flexible general AI, and I don’t think we will get there with deep learning as it is now. I am in my 60s and I hope to see much more progress in my lifetime, but I expect we will need to catch several more “waves” of new technology like DL before we get there.
Agreed with @roenxi and I’d like to propose a variant of your comment:
All evidence is that “more is better”. Everyone involved professionally is of the mind that scaling up is the key.
However, like you said, just a single invention could cause the AI winds to blow the other way and instantly crash NVIDIA’s stock price.
Something I’ve been thinking about is that the current systems rely on global communications which requires expensive networking and high bandwidth memory. What if someone invents an algorithm that can be trained on a “Beowulf cluster” of nodes with low communication requirements?
For example the human brain uses local connectivity between neurons. There is no global update during “training”. If someone could emulate that in code, NVIDIA would be in trouble.
I agree that scale is an important factor in deep learning's success, but that Google experiment ended up being a good example of how not to do it. They used 16000 CPU cores to get that cat detector. A short while later, a group at Baidu were able to replicate the same network with only 3 computers with 4 GPUs each. (The latter group was also lead by Andrew Ng.)
I think you are overegenralizing applicability of Neural Architecture Search etc. and cherry picking individual examples. There is an enormous gap between what gets published in academia with what’s actually useful.
E.g. Compute wars have only intensified with TPUs and FPGA. sure for training you might be okay with few 1080ti but good luck building any reliable, cheap and low latency service that uses DNNs. Similarly big data for academia is few terabytes but real Big data is Petabytes of street level imagery, Videos/Audio etc.
One point that I've not seen mentioned yet, is that the neural revolution somewhat aggravates economic inequality. What recent progress basically has shown is that deep learning works better with more layers and more resources. Geoffrey Hinton has also recently conjectured that there exist a pretty much unexplored regime of applying gradient descent to huge models with strong regularization trained on relatively small (but still big) data. This inequality is alleviated to some extent by the fact that the machine learning community fully embraces online education and open science, but still, you need 50-150 GPUs to play human-level Go and having several grad students that explore a wide variety of complex and huge models is key for progress. I can only see this aspect getting worse in the years to come.
Sorry but even tho the big companies produce a lot of interesting research, I challenge you to not find any interesting model trained on a single GPU from recent publications (the majority coming from academia). Actually it's very rare to find a paper where largely distributed training is necessary (i.e: the training would fail or would be unreasonably too long). Yes having more money help you to scale your experiments, it's nothing new and it's not something specific to AI.
>"Deep neural networks (DNN) are a powerful form of artificial intelligence that can outperform humans at some tasks. DNN training is typically a series of matrix multiplication operations, an ideal workload for graphics processing units (GPUs), which cost about three times more than general purpose central processing units (CPUs).
"The whole industry is fixated on one kind of improvement—faster matrix multiplications," Shrivastava said. "Everyone is looking at specialized hardware and architectures to push matrix multiplication. People are now even talking about having specialized hardware-software stacks for specific kinds of deep learning. Instead of taking an expensive algorithm and throwing the whole world of system optimization at it, I'm saying, 'Let's revisit the algorithm.'"
Shrivastava's lab did that in 2019,
recasting DNN training as a search problem that could be solved with hash tables.
Their "sub-linear deep learning engine" (SLIDE) is specifically designed to run on commodity CPUs, and Shrivastava and collaborators from Intel showed it could outperform GPU-based training when they unveiled it at MLSys 2020."
PDS: Quote: "All programming is an exercise in caching." - Terje Mathisen
Not to mention that training a model is a "one time deal", where a successfully trained model can be reused by a lot of client (devices).
Considering the way AI can potentially bring benefits to humanity, i see it more like an investment.
For comparison, Bitcoin in 2021 used 110 TWh, solving a problem we've either solved millenea ago, or could be solved using much less power with premined coins.
As I've come to understand it when implementing deep learning machine learning the hardware requirement for running a trained neural network model efficiently is much lower than the one required to train it. I've also read somewhere that if an instance of AlphaGo run on a modern powerful desktop machine would compete with the cluster used in Lee See-dol match which consisted of something close to 2000 of CPUs and more than 100 gpus then the ratio in wins would be only be 3:1 in favor of the cluster. Much less than if you would compare it to the ratio of computational power so that gives you hint of how much of the intelligence in it is about previous knowledge gained vs brute computing power. To me that also implies that a lot of the innovation done at DeepMind is more about improving the speed of which a neural net learns and aquires knowledge and less about the running time of applying the model. And that's probably an important distinction in the discussion of this technology.
Money quote for those who don't want to read the whole thing:
'''
When people talk about training a Chinchilla-optimal model, this is what they mean: training a model that matches their estimates for optimality. They estimated the optimal model size for a given compute budget, and the optimal number of training tokens for a given compute budget.
However, when we talk about “optimal” here, what is meant is “what is the cheapest way to obtain a given loss level, in FLOPS.” In practice though, we don’t care about the answer! This is exactly the answer you care about if you’re a researcher at DeepMind/FAIR/AWS who is training a model with the goal of reaching the new SOTA so you can publish a paper and get promoted. If you’re training a model with the goal of actually deploying it, the training cost is going to be dominated by the inference cost. This has two implications:
1) there is a strong incentive to train smaller models which fit on single GPUs
2) we’re fine trading off training time efficiency for inference time efficiency (probably to a ridiculous extent).
Chinchilla implicitly assumes that the majority of the total cost of ownership (TCO) for a LLM is the training cost. In practice, this is only the case if you’re a researcher at a research lab who doesn’t support products (e.g. FAIR/Google Brain/DeepMind/MSR). For almost everyone else, the amount of resources spent on inference will dwarf the amount of resources spent during training.
You know, there is more to deep learning research than meta-learning/architecture exploration. Sure you can explore the hyper-parameter space faster with 500 GPUs and get yet again a 0.05% better test score on ImageNet (or more I don't actually know), but there are other ways to do something meaningful in DL without using such compute power.
Can anyone give some color on to what extent advancements in AI are limited by the availability of compute, versus the availability of data?
I was under the impression that the size and quality of the training dataset had a much bigger impact on performance versus the sophistication of the model, but I could be mistaken.
Agreed that "simply" scaling up with more compute will result in progress and useful systems, and work in that direction is interesting and valuable. But, while we may not need new architectures or training objectives to make progress, we do need them to approach human level sample complexity. Humans don't need to read through 40 GB of text multiple times to learn to write.
I feel like it was a bit misleading to mention large amount of energy to train a model and then show a chip which can only be used for inference and not training. In general people are optimizing for power use at the edge and not for training.
However, your last paragraph about how investors view deep learning does not describe anyone in the community of academics, practitioners and investors that I know. People understand that the limiting inputs to improved performance are data, followed closely by PhD labor. Compute power is relevant mainly because it shortens the feedback loop on that PhD labor, making it more efficient.
Folks investing in AI believe the returns are worth it due to the potential to scale deployment, not (primarily) training. They may be wrong, but this is a straw man definition of scalability that doesn't contribute to that thesis.
reply