Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Correct me if I am wrong, to use LORA fine-tuned model in inference you would still need the original model + trained additional layers, right?

If we can perfect methods to fine-tune large models for specific task while reducing the overall model size, then it can fit into more consumer grade hardware for inference and can be broadly used. The objective is to prune unnecessary trivia and memorization artifacts from the model and leverage LLMs purely for interpreting natural language inputs.



sort by: page size:

> to use LORA fine-tuned model in inference you would still need the original model + trained additional layers, right?

You don't need additional layers. After training, the product of the two matrices is added to the original weights matrix, so the model size remains the same as the original during inference.


Yup. I guess LoRA counts as fine tuning. Except I've never seen inference engines where they actually let you take the base model and the LoRA parameters as separate inputs (maybe it exists and I just haven't seen it). Instead, they bake the LoRA part into the bigger tensors as the final step of the fine tune. That makes sense in terms of making inference faster, but prevents the scenario where a host can just run the base model with any finetune you like, maybe switching them mid-conversation. Instead, if you want to host a fine-tuned model, you take the tensor blob and run a separate instance of the inference program on it. Incidentally, this is the one place where OpenAI and Azure pricing differs; OpenAI just charges you a big per-token premium for fine-tuned 3.5, and Azure charges you for the server to host the custom model. Likewise, the hosts for the open-weights models will charge you more to run your fine-tuned model than a standard model, even though it's the almost the same amount of GPU cycles, just because it needs to run on a separate server that won't be shared by multiple customers; that wouldn't be necessary if overlays were separated.

I wouldn't be surprised if GPT-4's rumored mixture of many models does something like this overlay management internally.


> more than a single model and a lot of finetunes/high rank LoRAs

I can imagine a way might be found to host a base model and a bunch of LoRA's whilst using barely more ram than the base model alone.

The fine-tuning could perhaps be done in such a way that only perhaps 0.1% of the weights are changed, and for every computation the difference is computed not over the weights, but of the output layer activations.


LoRA stands for low rank approximation - it’s a very clever way to fine tune large models with a fraction of the compute, the idea being that you need much less capacity to estimate the finetuning

I'm not sure to whom he is responding, since no one is claiming LoRA performs as well as traditional fine tuning. If you click through to the original Tweet he shared, it says "when you have a lot of data and limited compute go for LoRA, while with limited data and ample compute go for full finetuning" which I think is absolutely correct and few would disagree. As these models get bigger and bigger though, fewer and fewer people are going to have the "ample compute" required for full fine tuning.

What about by fine-tuning using LoRA? That would introduce new layers and re-arrange the data for additional uses.

Great, we can get authoritative answers. (I'm trying to understand the ML space and have mostly done readings, not an expert.)

I am assuming you can have n LoRA fine-tunings, say each specializing in one aspect of a coherent task, with n summers, running in parallel, and then combine them at the end? Or more generally, does LoRA enable a sort of modularizing around a core (un-merged) model?

And curious if you ever tried merging 2 or more fine-tunings and then testing the resultant single model (merge all) against the original tests to check retention?


Aren't a lot of base models fine-tuned with (Q)Lora on instruct-based datasets with good results? I thought this was a very common practice?

> For example, if one would want to use an LLM to answer questions regarding a large, private knowledge base, would it make sense to fine-tune a model on this knowledge base?

I initially also thought this would be one of the best use cases for fine-tuning (teaching the model new data), but I've seen quite a few people say fine-tuning should not be used to teach the model new data, but more like new formatting and style of response. This blog post seems to concur.

I do wonder how OpenAI does fine-tuning. I'm guessing it doesn't use Lora.


This study is great and addresses a question I've had about LoRA for a while.

In a continual learning paper from last year, I found LoRA was extremely effective for faster fine-tuning and not forgetting the original dataset:

https://arxiv.org/abs/2306.01904


LoRA is an alternative to traditional fine tuning (which is usually done on specific layers as you mentioned).

To quote the LoRA paper[1]:

> We hypothesize that the change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed Low-Rank Adaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural network indirectly by optimizing rank decomposition matrices of the dense layers’ change during adaptation instead, while keeping the pre-trained weights frozen

It's truly revolutionary: It basically lets you create a very small "diff" which you apply yo an existing model and it is suddenly fine tuned. These diff models are very small (5M for example).

[1] https://arxiv.org/abs/2106.09685


I don't want to get into the weeds of the subtleties of evaluation, hyperparameter-tuning and model comparisons, but let's just say that subsequent studies have shown that LoRA (consistent with most parameter-efficient tuning methods) underperform full fine-tuning: https://arxiv.org/abs/2203.06904

As simple way to think about it is this: if LoRA really gives full fine-tuning performance, why would anyone ever fully fine-tune a model?


LoRA is one the fine-tuning methods. Compared to other methods like DreamBooth, Textual Inversion etc, it has benefits like smaller output file size, fewer training images required etc.

LoRA is particularly valuable for beginners, as it simplifies the process of adapting complex models for your specific needs without requiring extensive expertise in AI. Whether you're looking to enhance the performance of an image generation model or fine-tune a model for your specific need, LoRA can be a valuable ally in AI image generation journey.


LoRA (Low-Rank Adapter) is way to customize/finetune the LLM to a new datasets without needing to retrain the entire network (which makes it better (and in theory easier to do). It doesn't not change the speed significantly afaik

Look at QLoRA. The QLoRA can be attached to all layers, allowing you to alter behavior with much less data than the original LoRA implementation. It seems to "stick" better.

I just fine tuned a ~30b parameter model on my 2x 3090s to check it out. It worked fantastically. I should be able to fine tune up-to 65b parameter models locally but wanted to get my dataset right on a smaller model before trying.


During training, it's more efficient than full finetuning because you only update a fraction of the parameters via backprop. During inference, it can ...

1) ... be theoretically a tad slower if you add the LoRA values dynamically during the forward pass (however, this is also an advantage if you want to keep a separate small weight set per customer, for example; you run only one large base model and can apply the different LoRA weights per customer on the fly)

2) ... have the exact same performance as the base model if you merge the LoRA weights back with the base model.


I'm writing a blog post with some more reasoning but my view is that it can be useful for certain simpler tasks (eg unstructured -> structured, basic summarization) and not more complex things (eg generation).

The tricky thing is that finetuning makes a big difference, and while it should be possible to hotswap LoRA adapters (at some cost to performance), I haven't figured that out yet.


>when it is well-known in the field that parameter-efficient fine-tuning always pays a cost in terms of performance relative to full fine-tuning

The LoRA paper clearly states the performance of the method "LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. ": https://arxiv.org/abs/2106.09685


I’m struggling to understand from this paper whether the approach is better in the general sense (all cases, with wider models seeing greater benefits) or purely for wider models (with narrower models seeing detriment)?

If it’s the former this could effectively halve finetuning cost overnight which would go a significant way towards enabling a wider array of use cases for LoRA.

next

Legal | privacy