SparseGPT: Language Models Can Be Accurately Pruned in One-Shot

Reubend | karma 1428 | avg karma 4.68 · 2023-05-03 12:09:58

Wow, these results are amazing! This might be extremely helpful in reducing the memory consumption of large models in the future.

avereveard | karma None | avg karma None · 2023-05-03 12:30:05

wow this could make the fabled 65 billion parameter llama sparsed and pruned runnable on a 3060

bick_nyers | karma None | avg karma None · 2023-05-03 12:43:36

How would this be any different from running one of the lower parameter models?

sva_ | karma 7790 | avg karma 3.58 · 2023-05-03 12:46:01

It says in the abstract

> at minimal loss of accuracy

Suggesting that there is a lot of redundancy in the weights.

reply

theLiminator | karma None | avg karma None · 2023-05-03 13:01:47

I wonder how far we can take this. Is 1B parameters theoretically "expressive" enough for GPT-4 like performance? I wonder how far off "theoretically optimal" we are in terms of performance/parameters ratio.

dontwearitout | karma 207 | avg karma 2.52 · 2023-05-03 13:09:33

Good open questions. I suspect we'll see models distilled and compressed down to retain most of their "common sense", "core knowledge", and reasoning ability, while stripping out the gigabytes of random trivia most people will never need.

amelius | karma 42902 | avg karma 1.63 · 2023-05-03 13:41:05

This makes me wonder what would happen if you took the sparse model, reset its weights, then trained it with the original training data.

sva_ | karma 7790 | avg karma 3.58 · 2023-05-03 13:51:52

Very interesting idea. I'd hypothesize that it won't achieve the same(ish) accuracy, and that pruning might be required (similar to how humans go through a heavy pruning phase at an early age[0]). Would be worth setting up an experiment on a smaller scale.

As some other commentator stated, there's currently a lot of low hanging fruit in optimizing NN.

0. https://en.m.wikipedia.org/wiki/Synaptic_pruning

reply

avereveard | karma None | avg karma None · 2023-05-03 13:38:05

Larger models seem to handle much better introspection, makes for better backend for sourced knowledge extraction

babyshake | karma 3026 | avg karma 2.74 · 2023-05-03 12:30:22

It looks like by pruning by a factor of 0.5, you reduce the size of the model by 50%? In practice, what is the expected observed change in the output before and after pruning?

MacsHeadroom | karma 2958 | avg karma 2.23 · 2023-05-03 14:44:00

>what is the expected observed change in the output before and after pruning?

The expected and observed change is virtually none. That's the whole point!

Notably, quantizing weights from 16bit weights to 4bit weights (reducing the size by 75%) also has almost no change in output quality when using modern algorithms like GPTQ.

reply

levidos | karma 407 | avg karma 8.31 · 2023-05-03 20:44:40

What are the drawbacks of quantizing though?

huslage | karma 690 | avg karma 3.03 · 2023-05-03 20:56:49

You are losing information. Knowing if you need that information or not is sometimes difficult. That's the whole lottery hypothesis (mentioned elsewhere).

MacsHeadroom | karma 2958 | avg karma 2.23 · 2023-05-04 13:59:20

>What are the drawbacks of quantizing though?

A 0.01% loss in quality for a 4x speed up and 4x less VRAM/RAM requirement.

90GB models now fit and run on a $600 consumer video card with quality so similar the difference is only detectable on hours long automated tests with tens of thousands of iterations.

reply

gorkish | karma 2793 | avg karma 3.68 · 2023-05-03 12:46:37

Very good result, and awesome to see such great progress happen so fast.

Quantizing/pruning/deduplicating/compressing models and embeddings is still a vast orchard of low hanging fruit.

I personally think there are still quite a few multiple-orders-of-magnitude scale opportunities to accelerate inference, and we are fortunate to have strong economic incentives aligned with the problem.

reply

Der_Einzige | karma 1358 | avg karma 0.53 · 2023-05-03 12:53:14

So much low hanging fruit for those willing to pick it. It's a great time to be an LLM researcher.

bitL | karma 10019 | avg karma 2.09 · 2023-05-03 13:24:26

Yeah, writing a thesis on it right now; this + adapters give so many options to play with and Meta was nice to give me access to their research models.

seydor | karma 7759 | avg karma 1.79 · 2023-05-03 12:48:19

The robustness with which these models can be quantized, and now trimmed makes one think if they could be easily implemented some form of analog (or optical) hardware.

meepmorp | karma 6378 | avg karma 2.4 · 2023-05-03 12:55:03

Isn't the inspiration behind these models our own analog hardware?

seydor | karma 7759 | avg karma 1.79 · 2023-05-03 16:51:00

not sure i would call neurons analog, they are very nonlinear and capricious beasts.

gryn | karma 873 | avg karma 2.49 · 2023-05-03 20:13:50

analog and linear are not the same thing.

> adjective: analog

> relating to or using signals or information represented by a continuously variable physical quantity such as spatial position, voltage, etc.

reply

seydor | karma 7759 | avg karma 1.79 · 2023-05-04 08:00:23

yeah neurons use discontinuous signals (spikes)

mitthrowaway2 | karma 4112 | avg karma 3.1 · 2023-05-03 12:55:17

Perhaps even by biological cells!

barking_biscuit | karma 359 | avg karma 1.2 · 2023-05-03 17:58:30

Xenobots, even.

dougmwne | karma 13708 | avg karma 5.33 · 2023-05-03 12:52:43

Once you prune your model, can you get even better performance by re-training it? I've heard theories that this is the function of sleep in brains.

napo | karma 126 | avg karma 2.42 · 2023-05-03 13:04:42

It sounds nice (maybe too nice?). I always wanted to see that it would be necessary to have a "sleeping" phase in AI.

It always felt weird that we have to sleep, it doesn't seem to give any evolutionary advantages.

reply

roomey | karma 1090 | avg karma 4.7 · 2023-05-03 14:06:27

It must give an evolutionary advantage, or we wouldn't sleep.

It may be hard to pin point exactly what advantage, but as we do it, it must have given us an advantage!

reply

dougmwne | karma 13708 | avg karma 5.33 · 2023-05-03 14:19:53

Especially considering that it is so widespread in nearly every creature with a brain. And it’s not simply a period of motionless energy conservation but has very specific neural patterns. The science is definitely zeroing in on a connection to learning.

richardw | karma 3457 | avg karma 2.33 · 2023-05-03 15:58:24

I have an unbaked theory, but the very short version is:

- Animals that have peaks of energy use outcompete animals that have a steady-state energy use. Catch the animal, then rest and recover. For any given amount of energy, this means we can recruit more in a smaller window compared to an animal that plods along with no recuperative phase.

- Many things happen when you're sleeping. Rather than having everything running 24/7, having different phases means we can specialise action and recovery. Since the time is already driven by energy demands, many parts of our body and mind leverage it for different purposes.

reply

LesZedCB | karma 2238 | avg karma 2.38 · 2023-05-03 16:24:27

1 day of in-context learning and 1 night of fine-tuning on context. that's my pet theory, just shooting from the hip as a total layperson.

visarga | karma 12425 | avg karma 1.65 · 2023-05-03 16:08:51

Continual learning. When models will do that, they will have to sleep as well to avoid catastrophic forgetting.

feuerwehrnrw | karma 1 | avg karma 0.5 · 2023-05-03 16:10:23

[dead]

klobuerste | karma 1 | avg karma 0.5 · 2023-05-03 16:11:00

[dead]

nomel | karma 8158 | avg karma 2.06 · 2023-05-03 20:31:31

You've, intentionally or not, ignored a whole body of research. An extremely cursory dive into sleep will show all sorts of functional reasons, related to memory formation.

valine | karma 4797 | avg karma 5.1 · 2023-05-03 13:42:16

Retraining a large GPT is very expensive. The goal of this paper is to help limit the need for retraining after pruning.

dougmwne | karma 13708 | avg karma 5.33 · 2023-05-03 13:58:47

Extremely expensive, though as I understand it, the goal is to get maximum performance out of a given model size so that you can actually inference at product scale. A few extra million for training is expensive, but then consider what it costs to run inference for something like Bing for 100 million daily active users.

If they can develop new methods to “overtrain” these models they will get more bang out of the smaller parameter model buck.

reply

smrtinsert | karma 2231 | avg karma 1.42 · 2023-05-03 13:05:03

To copy a reddit meme: text-generation-webui plugin when? But seriously, this seems like an incredible upgrade.

chime | karma 9034 | avg karma 5.8 · 2023-05-03 13:17:22

Better yet, text-generation-webui-docker when?

NortySpock | karma 2511 | avg karma 3.69 · 2023-05-03 23:55:58

Is Serge what you are looking for? I've been using it to play around with prompting a few large language models.

https://github.com/nsarrazin/serge

reply

vessenes | karma 7855 | avg karma 5.78 · 2023-05-03 13:14:10

This is interesting. OPT and BLOOM are significantly Chinchilla under-trained, and I can't help but wonder if this is related to their compressibility here. I would like to see the results for something Chinchilla over-trained, like Llama - my gut is that the 'free lunch' they see will get slightly more expensive.

Implementation q - can torch or other inference runtimes take advantage of the memory savings delivered by a sparsification like this? Or do you need a special implementation to not malloc out all the memory implied by each tensor layer?

reply

MacsHeadroom | karma 2958 | avg karma 2.23 · 2023-05-03 14:41:09

SparseGPT-for-LLaMA[0] exists. Pruning more than 30% of weights of 30B starts to show significant perplexity losses. 50% is a total disaster, while it was not for OPT or BLOOM. So your intuition seems to be good here.

[0] https://github.com/AlpinDale/sparsegpt-for-LLaMA

reply

generalizations | karma 4806 | avg karma 3.34 · 2023-05-03 15:35:06

Has anyone figured out what the optimal pruning is for 65b? I don't really know what that matrix in your link is saying, but it didn't seem to show optimal pruning.

nalzok | karma 33 | avg karma 1.06 · 2023-05-03 18:21:08

Could you explain what you mean by "Chinchilla under-trained" or "Chinchilla over-trained"? I assume it refers to some measure of trained-ness, but Googling yielded nothing relevant.

wlib | karma 255 | avg karma 3.92 · 2023-05-03 18:49:24

My memory says that there’s a “Chinchilla” paper showing how to make the best model with a given training budget. There’s a trade-off between the amount of training data and the size of the model itself. Chinchilla under-training would mean that the model is too big for the amount of training data used. Llama is Chinchilla over-trained in that there is a ton of data relative to the small size of the model.

Note that this is still desirable for inference because you want the most possible training on whatever model you can actually fit in your memory.

reply

thewataccount | karma 1110 | avg karma 3.14 · 2023-05-03 20:45:30

It's from this paper - https://arxiv.org/abs/2203.15556

Like the sibling comment said - the proportion of training tokens to parameter size is very important, and there's a certain threshold needed to be met for it to be "fully trained".

Usually you have a fixed amount of compute (budget/time essentially) - and in that case you want to pick the largest parameter count that you can fully train, and not the largest parameter count your hardware can support and then train that for less time.

tl;dr - Small models with training over the chinchilla threshold can out perform large models that are undertrained

EDIT: Figure 2 page 5, and Table 3 page 8 - might be worth checking out.

reply

vessenes | karma 7855 | avg karma 5.78 · 2023-05-04 14:07:04

Google put out the Chinchilla paper last year, showing that GPT-3 and others could have gotten better at the same size by just shoving more tokens at them in further training loops. The paper showed some snazzy curves where more training time and data equalled better quality, and speculated / demonstrated that a lot more training tokens and time could get better quality out of smaller models than GPT-'s 175B.

The was, for a minute, ignored, because the PaLM paper came out very shortly thereafter which seemed to show, pretty conclusively, that there are unusual and exciting emergent behaviours coming out of much larger models, (PaLM is 540B parameters), and so that was hotter news.

In the meantime, some really smart folks looked at the Chinchilla curve, and were like "hmm. One way to think about this is to see that if you are willing to put a LOT more compute in upfront on a model, then the inference costs go down in some sub-linear function."

Llama's architectural instincts are that if you're going to give away a model, and it is going to get run on the edge, it might make sense to spend a whole, whole lot of compute, once, training something past what the paper considered optimal, and well into the point where the paper thought of it as "not worth it", precisely because the entire world might be able to run it if you can get something good and much smaller.

Conclusively, OPT and LLMs from its era are significantly 'under-trained' compared even to GPT-3, itself undertrained by something like an order of magnitude from where the Chinchilla paper implies they should be.

I guess I made up the phrase over and under-trained; their might be some other way to talk about it elsewhere. Sorry! :)

reply

deepsquirrelnet | karma 852 | avg karma 3.01 · 2023-05-03 21:02:55

Really good insight. I’d love to see a study that goes over a bunch of models, prunes them back to some standard measure and then compares the result to see if the collapse to the chinchilla compute optimal line.

valine | karma 4797 | avg karma 5.1 · 2023-05-03 13:18:04

If the abstract is accurate, then I'm very, very excited to try this on LLaMA 65B. We are tantalizingly close to ChatGPT performance parity on consumer hardware.

Hopefully this lowers the cost of doing instruct fine tuning on the larger models, and we see a Vicuna like model based on LLaMA 65B soon. This is exciting folks.

reply

icyfox | karma 1319 | avg karma 12.21 · 2023-05-03 13:19:12

Neat paper. Planning on reading more in-depth over the weekend, but more fundamental than just applications to GPT their insights are:

- Existing pruners were written for models that are order-of-magnitudes smaller than any in the modern GPT family. They grow in linear time with the amount of input parameters so they're unequipped to work on current architectures. The best existing pruner performs takes 4.3h for a 1.3B model

- The core issue to scale is time to calculate the Hessian during prune analysis (effectively a matrix of second-order derivatives, famously computationally intense to calculate)

- They follow the existing literature and use a local approach to each layer. By doing this (and doing it well), it can preserve the input/output contract for surrounding layers, which makes the whole thing paralellizable across machines

- Their solution approximates reconstruction loss by approximating a quadratic loss and then running a OBS update (with a few other optimizations on ordering and iteration on the side)

I'm particularly excited for these smaller models, mostly for inference efficiency gains in realtime applications. The general con of weight pruning is they still require incredibly large training clusters / investment in training resources upfront to get the original parameter weight. But if the lottery ticket hypothesis holds true, this might be the best way we have at the moment to get models with same performance and lower longterm operational costs.

reply

icyfox | karma 1319 | avg karma 12.21 · 2023-05-03 13:24:01

Another random thought: Most of these general purpose pruning approaches rely on randomly calculating the X vector for which they want to measure output loss of the layer. In theory it's possible to feed actual datasets into these models as well, which could be another way to get a sparse model that's more acutely optimized towards one task. The original model produces the X activations on each layer, and these are used as the optimization criteria for the pruned version.

It might be able to provide performance similar to fine-tuning but without the weight skew that you'll necessarily see in parameter values.

reply

CuriouslyC | karma 5185 | avg karma 1.89 · 2023-05-03 15:24:39

Statisticians have been using L1 regularization to estimate sparse models for a while, it seems reasonable to assume that you could fine tune the model on a data set while also pushing weak parameters to zero in a natural way in this domain as well.

ianbutler | karma 4817 | avg karma 7.99 · 2023-05-03 15:47:44

I believe you are correct, I worked on a summer research project at NYU in 2018 based on https://arxiv.org/abs/1805.12185

As part of that project I constructed an API that took a small dataset and a model, launched a K8s pod and ran something like this from the paper:

> The pruning defense works as follows: the defender exercises the DNN received from the attacker with clean inputs from the validation dataset, D_valid, and records the average activation of each neuron. The defender then iteratively prunes neurons from the DNN in increasing order of average activations and records the accuracy of the pruned network in each iteration. The defense terminates when the accuracy on the validation dataset drops below a pre-determined threshold. We note that pruning has been proposed in prior work for n

Obviously this wasn't on transformers but the idea is similar.

reply

teej | karma 11426 | avg karma 6.22 · 2023-05-03 13:33:52

This is a great breakdown. I don’t know much about LLM internals but I could follow this easily.

gfodor | karma 18859 | avg karma 3.92 · 2023-05-03 14:24:31

Reduction in working memory for sparse models seems pretty huge.

codethief | karma 4183 | avg karma 2.01 · 2023-05-03 16:34:03

> lottery ticket hypothesis

For those that, like me, didn't know the reference: https://arxiv.org/abs/1803.03635

reply

m3kw9 | karma 3357 | avg karma 0.75 · 2023-05-03 13:58:09

Who’s gonna apply this to LLama weights?

MacsHeadroom | karma 2958 | avg karma 2.23 · 2023-05-03 14:47:37

It was done over two months ago: https://github.com/AlpinDale/sparsegpt-for-LLaMA

Jimmc414 | karma 9306 | avg karma 6.54 · 2023-05-03 14:00:16

Eternal Sunshine of the Spotless Mind for LLMs.

ben_w | karma 20467 | avg karma 1.69 · 2023-05-03 14:14:44

I don't think so, from the abstract it's more like JPEG for LLMs.

igravious | karma 3901 | avg karma 1.49 · 2023-05-03 15:53:40

One hopes it's more like PNG for LLMs ?

ben_w | karma 20467 | avg karma 1.69 · 2023-05-03 16:30:22

It's lossy rather than lossless, and the impact can be dialled up or down depending on space requirements.

barking_biscuit | karma 359 | avg karma 1.2 · 2023-05-03 17:59:53

Can we call it LLMPeg?

NM_Ricky | karma 175 | avg karma 14.58 · 2023-05-03 14:37:14

For anyone interested in SparseGPT, on May 25th, the author of the SparseGPT paper will show you how you can download an optimized and open-sourced LLM and run it on CPUs at GPU speeds using DeepSparse.

Confirm your spot: https://neuralmagic.com/unlock-faster-and-more-efficient-lan...

reply

laughy | karma 4 | avg karma 0.44 · 2023-05-03 16:38:23

This suggests that the effective number of parameters is far lower than the nominal number. My head canon for neural networks as overparametrized models still holds.