Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
SparseGPT: Language Models Can Be Accurately Pruned in One-Shot (arxiv.org) similar stories update story
209 points by tosh | karma 156026 | avg karma 6.88 2023-05-03 11:44:19 | hide | past | favorite | 66 comments



view as:

Wow, these results are amazing! This might be extremely helpful in reducing the memory consumption of large models in the future.

wow this could make the fabled 65 billion parameter llama sparsed and pruned runnable on a 3060

How would this be any different from running one of the lower parameter models?

It says in the abstract

> at minimal loss of accuracy

Suggesting that there is a lot of redundancy in the weights.


I wonder how far we can take this. Is 1B parameters theoretically "expressive" enough for GPT-4 like performance? I wonder how far off "theoretically optimal" we are in terms of performance/parameters ratio.

Good open questions. I suspect we'll see models distilled and compressed down to retain most of their "common sense", "core knowledge", and reasoning ability, while stripping out the gigabytes of random trivia most people will never need.

This makes me wonder what would happen if you took the sparse model, reset its weights, then trained it with the original training data.

Very interesting idea. I'd hypothesize that it won't achieve the same(ish) accuracy, and that pruning might be required (similar to how humans go through a heavy pruning phase at an early age[0]). Would be worth setting up an experiment on a smaller scale.

As some other commentator stated, there's currently a lot of low hanging fruit in optimizing NN.

0. https://en.m.wikipedia.org/wiki/Synaptic_pruning


Larger models seem to handle much better introspection, makes for better backend for sourced knowledge extraction

It looks like by pruning by a factor of 0.5, you reduce the size of the model by 50%? In practice, what is the expected observed change in the output before and after pruning?

>what is the expected observed change in the output before and after pruning?

The expected and observed change is virtually none. That's the whole point!

Notably, quantizing weights from 16bit weights to 4bit weights (reducing the size by 75%) also has almost no change in output quality when using modern algorithms like GPTQ.


What are the drawbacks of quantizing though?

You are losing information. Knowing if you need that information or not is sometimes difficult. That's the whole lottery hypothesis (mentioned elsewhere).

>What are the drawbacks of quantizing though?

A 0.01% loss in quality for a 4x speed up and 4x less VRAM/RAM requirement.

90GB models now fit and run on a $600 consumer video card with quality so similar the difference is only detectable on hours long automated tests with tens of thousands of iterations.


Very good result, and awesome to see such great progress happen so fast.

Quantizing/pruning/deduplicating/compressing models and embeddings is still a vast orchard of low hanging fruit.

I personally think there are still quite a few multiple-orders-of-magnitude scale opportunities to accelerate inference, and we are fortunate to have strong economic incentives aligned with the problem.


So much low hanging fruit for those willing to pick it. It's a great time to be an LLM researcher.

Yeah, writing a thesis on it right now; this + adapters give so many options to play with and Meta was nice to give me access to their research models.

The robustness with which these models can be quantized, and now trimmed makes one think if they could be easily implemented some form of analog (or optical) hardware.

Isn't the inspiration behind these models our own analog hardware?

not sure i would call neurons analog, they are very nonlinear and capricious beasts.

analog and linear are not the same thing.

> adjective: analog

> relating to or using signals or information represented by a continuously variable physical quantity such as spatial position, voltage, etc.


yeah neurons use discontinuous signals (spikes)

Perhaps even by biological cells!

Xenobots, even.

Once you prune your model, can you get even better performance by re-training it? I've heard theories that this is the function of sleep in brains.

It sounds nice (maybe too nice?). I always wanted to see that it would be necessary to have a "sleeping" phase in AI.

It always felt weird that we have to sleep, it doesn't seem to give any evolutionary advantages.


It must give an evolutionary advantage, or we wouldn't sleep.

It may be hard to pin point exactly what advantage, but as we do it, it must have given us an advantage!


Especially considering that it is so widespread in nearly every creature with a brain. And it’s not simply a period of motionless energy conservation but has very specific neural patterns. The science is definitely zeroing in on a connection to learning.

I have an unbaked theory, but the very short version is:

- Animals that have peaks of energy use outcompete animals that have a steady-state energy use. Catch the animal, then rest and recover. For any given amount of energy, this means we can recruit more in a smaller window compared to an animal that plods along with no recuperative phase.

- Many things happen when you're sleeping. Rather than having everything running 24/7, having different phases means we can specialise action and recovery. Since the time is already driven by energy demands, many parts of our body and mind leverage it for different purposes.


1 day of in-context learning and 1 night of fine-tuning on context. that's my pet theory, just shooting from the hip as a total layperson.

Continual learning. When models will do that, they will have to sleep as well to avoid catastrophic forgetting.

[dead]

[dead]

You've, intentionally or not, ignored a whole body of research. An extremely cursory dive into sleep will show all sorts of functional reasons, related to memory formation.

Retraining a large GPT is very expensive. The goal of this paper is to help limit the need for retraining after pruning.

Extremely expensive, though as I understand it, the goal is to get maximum performance out of a given model size so that you can actually inference at product scale. A few extra million for training is expensive, but then consider what it costs to run inference for something like Bing for 100 million daily active users.

If they can develop new methods to “overtrain” these models they will get more bang out of the smaller parameter model buck.


To copy a reddit meme: text-generation-webui plugin when? But seriously, this seems like an incredible upgrade.

Better yet, text-generation-webui-docker when?

Is Serge what you are looking for? I've been using it to play around with prompting a few large language models.

https://github.com/nsarrazin/serge


This is interesting. OPT and BLOOM are significantly Chinchilla under-trained, and I can't help but wonder if this is related to their compressibility here. I would like to see the results for something Chinchilla over-trained, like Llama - my gut is that the 'free lunch' they see will get slightly more expensive.

Implementation q - can torch or other inference runtimes take advantage of the memory savings delivered by a sparsification like this? Or do you need a special implementation to not malloc out all the memory implied by each tensor layer?


SparseGPT-for-LLaMA[0] exists. Pruning more than 30% of weights of 30B starts to show significant perplexity losses. 50% is a total disaster, while it was not for OPT or BLOOM. So your intuition seems to be good here.

[0] https://github.com/AlpinDale/sparsegpt-for-LLaMA


Has anyone figured out what the optimal pruning is for 65b? I don't really know what that matrix in your link is saying, but it didn't seem to show optimal pruning.

Could you explain what you mean by "Chinchilla under-trained" or "Chinchilla over-trained"? I assume it refers to some measure of trained-ness, but Googling yielded nothing relevant.

My memory says that there’s a “Chinchilla” paper showing how to make the best model with a given training budget. There’s a trade-off between the amount of training data and the size of the model itself. Chinchilla under-training would mean that the model is too big for the amount of training data used. Llama is Chinchilla over-trained in that there is a ton of data relative to the small size of the model.

Note that this is still desirable for inference because you want the most possible training on whatever model you can actually fit in your memory.


It's from this paper - https://arxiv.org/abs/2203.15556

Like the sibling comment said - the proportion of training tokens to parameter size is very important, and there's a certain threshold needed to be met for it to be "fully trained".

Usually you have a fixed amount of compute (budget/time essentially) - and in that case you want to pick the largest parameter count that you can fully train, and not the largest parameter count your hardware can support and then train that for less time.

tl;dr - Small models with training over the chinchilla threshold can out perform large models that are undertrained

EDIT: Figure 2 page 5, and Table 3 page 8 - might be worth checking out.


Google put out the Chinchilla paper last year, showing that GPT-3 and others could have gotten better at the same size by just shoving more tokens at them in further training loops. The paper showed some snazzy curves where more training time and data equalled better quality, and speculated / demonstrated that a lot more training tokens and time could get better quality out of smaller models than GPT-'s 175B.

The was, for a minute, ignored, because the PaLM paper came out very shortly thereafter which seemed to show, pretty conclusively, that there are unusual and exciting emergent behaviours coming out of much larger models, (PaLM is 540B parameters), and so that was hotter news.

In the meantime, some really smart folks looked at the Chinchilla curve, and were like "hmm. One way to think about this is to see that if you are willing to put a LOT more compute in upfront on a model, then the inference costs go down in some sub-linear function."

Llama's architectural instincts are that if you're going to give away a model, and it is going to get run on the edge, it might make sense to spend a whole, whole lot of compute, once, training something past what the paper considered optimal, and well into the point where the paper thought of it as "not worth it", precisely because the entire world might be able to run it if you can get something good and much smaller.

Conclusively, OPT and LLMs from its era are significantly 'under-trained' compared even to GPT-3, itself undertrained by something like an order of magnitude from where the Chinchilla paper implies they should be.

I guess I made up the phrase over and under-trained; their might be some other way to talk about it elsewhere. Sorry! :)


Really good insight. I’d love to see a study that goes over a bunch of models, prunes them back to some standard measure and then compares the result to see if the collapse to the chinchilla compute optimal line.

If the abstract is accurate, then I'm very, very excited to try this on LLaMA 65B. We are tantalizingly close to ChatGPT performance parity on consumer hardware.

Hopefully this lowers the cost of doing instruct fine tuning on the larger models, and we see a Vicuna like model based on LLaMA 65B soon. This is exciting folks.


Neat paper. Planning on reading more in-depth over the weekend, but more fundamental than just applications to GPT their insights are:

- Existing pruners were written for models that are order-of-magnitudes smaller than any in the modern GPT family. They grow in linear time with the amount of input parameters so they're unequipped to work on current architectures. The best existing pruner performs takes 4.3h for a 1.3B model

- The core issue to scale is time to calculate the Hessian during prune analysis (effectively a matrix of second-order derivatives, famously computationally intense to calculate)

- They follow the existing literature and use a local approach to each layer. By doing this (and doing it well), it can preserve the input/output contract for surrounding layers, which makes the whole thing paralellizable across machines

- Their solution approximates reconstruction loss by approximating a quadratic loss and then running a OBS update (with a few other optimizations on ordering and iteration on the side)

I'm particularly excited for these smaller models, mostly for inference efficiency gains in realtime applications. The general con of weight pruning is they still require incredibly large training clusters / investment in training resources upfront to get the original parameter weight. But if the lottery ticket hypothesis holds true, this might be the best way we have at the moment to get models with same performance and lower longterm operational costs.


Another random thought: Most of these general purpose pruning approaches rely on randomly calculating the X vector for which they want to measure output loss of the layer. In theory it's possible to feed actual datasets into these models as well, which could be another way to get a sparse model that's more acutely optimized towards one task. The original model produces the X activations on each layer, and these are used as the optimization criteria for the pruned version.

It might be able to provide performance similar to fine-tuning but without the weight skew that you'll necessarily see in parameter values.


Statisticians have been using L1 regularization to estimate sparse models for a while, it seems reasonable to assume that you could fine tune the model on a data set while also pushing weak parameters to zero in a natural way in this domain as well.

I believe you are correct, I worked on a summer research project at NYU in 2018 based on https://arxiv.org/abs/1805.12185

As part of that project I constructed an API that took a small dataset and a model, launched a K8s pod and ran something like this from the paper:

> The pruning defense works as follows: the defender exercises the DNN received from the attacker with clean inputs from the validation dataset, D_valid, and records the average activation of each neuron. The defender then iteratively prunes neurons from the DNN in increasing order of average activations and records the accuracy of the pruned network in each iteration. The defense terminates when the accuracy on the validation dataset drops below a pre-determined threshold. We note that pruning has been proposed in prior work for n

Obviously this wasn't on transformers but the idea is similar.


This is a great breakdown. I don’t know much about LLM internals but I could follow this easily.

Reduction in working memory for sparse models seems pretty huge.

> lottery ticket hypothesis

For those that, like me, didn't know the reference: https://arxiv.org/abs/1803.03635


Who’s gonna apply this to LLama weights?

It was done over two months ago: https://github.com/AlpinDale/sparsegpt-for-LLaMA

Eternal Sunshine of the Spotless Mind for LLMs.

I don't think so, from the abstract it's more like JPEG for LLMs.

One hopes it's more like PNG for LLMs ?

It's lossy rather than lossless, and the impact can be dialled up or down depending on space requirements.

Can we call it LLMPeg?

For anyone interested in SparseGPT, on May 25th, the author of the SparseGPT paper will show you how you can download an optimized and open-sourced LLM and run it on CPUs at GPU speeds using DeepSparse.

Confirm your spot: https://neuralmagic.com/unlock-faster-and-more-efficient-lan...


This suggests that the effective number of parameters is far lower than the nominal number. My head canon for neural networks as overparametrized models still holds.

Legal | privacy