IIRC, you're mentioned once before that you've used Private LLM. :) Please try the 4-bit OmniQuant quantized Mixtral 8x7B Instruct model in it. It runs circles around RTN Q3 models at speed and RTN Q8 models at text generation quality.
I'm curious how they keep LLM generated text from turning into future training input, and creating a loop that probably isn't good for quality. Or is that not a problem?
Thanks for sharing! It's definitely a bit painstaking to get a real-world LLM running at all on an iPhone due to memory constraint. It's also quite compute-intense as it has 7B parameters, but we are glad that it's generating texts at reasonable speed!
The model we are using is a quantized Vicuna-7b, which I believe is one of the best open-sourced models. Hallucination is a problem to all LLMs, but I believe research on model side would gradually alleviate this problem :-)
Just gave it a try and it seems really really good! I found that for the subjects I was writing about it was best used in notebook mode generating about 2 tokens at a time so I can supervise and tune its output manually, but I imagine it'd be better at things it was actually trained on. And it was really easy to get it to generate long, detailed descriptions (even though it still obviously shows the fundamental lack of understanding intrinsic to all LLMs).
From what I understand, it can learn and execute the algorithm fairly reliably, though it won't be 100%. When the LLM generates text, it is randomised a little, as well as some tricks that prevent repetition, which would likely cause problems with numbers containing all the same digit.
This is really interesting, thank you for the reference!
Having worked more with images based NN than language models before, I wonder: are LLM inherently more suited to aggressive quantisation, due to their very large size? I see people suggesting here 4b is pretty good, and 3b should be the target.
I remember ResNets etc can of course also be quantized, and up to 8-6b you get pretty good results with very little effort, with low-ish degradation in performance. Trying to go down to 4b is more challenging, though this paper claims with quantisation aware training 4b is possible indeed, but that means a lot of dedicate training compute needed to get to 4b (not just finetuning post-training): https://arxiv.org/abs/2105.03536
2 is just fancy talk for what high perfoming LLMs do
3 has been done many times over
4 is achievable with Imagebind if we're going for an exotic solution. Otherwise GPT-4V with AudioToText and TTS will do just fine (Open AI have something similar set up)
5 is as simple as timed prompts sent by the company unbeknownst to the user.
6 is the probably the most bespoke thing here. I'm guessing this is the "information as a quiz" thing they try to demonstrate.
7 is the same as 2
8 is High perfoming LLM with a specific prompt.
I'm not trying to discount your experience but the technology that is making any of this possible is a few years old and the new state of the art version which is far ahead of everything else is ~8 months old so unless you just worked on something like this then I'm not sure it's much indication on what is achievable.
Most STT systems also tend to still train on normalized text which is free of the punctuation and capitalization complexities and other content you find in text LLMs. I suspect we continue in this way in part due to lack of large scale resources for training, and due to quality issues - Whisper being an outlier here. Anecdotally 8bit quantization of larger pre-normalized STT models seems to not suffer the same degradation you see with LLMs but I can't speak to whether that's due to this issue.
When you say it's trivial to encode text in neural networks, what does that mean for LLMs? What makes it decide to encode certain texts or not? Isn't it just one big network of neurons?
The prompt I've seen for it to verbatim reproduce the fast inverse square root from Quake was:
// fast inverse square root
float Q_
When I ask ChatGPT to give me code for a fast inverse square root it doesn't reproduce it at all but gives me an implementation that looks completely different.
So, my original thought was that the prompt above with the characteristic Quake III Q_ naming is enough to push it into a corner where the path is reduced to just one possibility (with that path being the words in the code itself) and not that it merely copypasted the code from an encoded version of it. I.e. it still predicts it word-by-word but with only one possible way for each step. This is just be my naive take on it though but I really want to understand.
The goal of an LLM, before RLHF, is to accurately predict what token comes next. It cannot do better than that. The perfect outcome is text identical to the training set.
Let's say your LLM can generate text with the same quality as the input 98% of the time, and the other 1% of the time, it's wrong. Each round of recursive training amplifies that error. 96% accuracy after the next round. 67% after 20 rounds.
There's no way for it to get better without more real human input.
Accuracy aside, this is an interesting way to demonstrate "LLM as compression", since you can surely get an LLM to emit far more text than the size of the actual model.
Oh, I've been using language models before a lot (or at least some significant chunk) of HN knew the word LLM, I think.
I remember when going from 6B to 13B was crazy good. We've just normalized our standards to the latest models in the era.
They do have their shortcomings but can be quite useful as well, especially the LLama class ones. They're definitely not GPT-4 or Claude+, for sure, for sure.
Interesting! Do you think RLHF would be a necessity for smaller models to perform as par as state-of-the-art LLMs? In my view, instruction tuning will resolve any isssues related to output structure, tonality or the domain understanding but will it be enough to improve the reasoning capabilities of the smaller model?
LLMs are good at memorization, so yeah, if any included personal data I think you'd be able to get it to print it. (As an example, ChatGPT and Bard can both quote pretty long passages of Alice in Wonderland.)
There aren't any techniques I know of to prevent it either; when training an image model the recommendation is to dedupe the input so nothing is weighted over anything else, but that's not an absolute defense.
I don’t think LLM as a technology can get there, but I would happily be proven wrong. The issue is simply they quality of the output. I’ve tried several times and the output looks reasonable while being riddled with subtle errors. Unfortunately, being forced to look for subtle errors is more time consuming than simply doing it manually.
One possibility might be improving the quality of the training data rather than aiming for such a huge bulk of amateur works. Unfortunately, there might not be enough quality writing to get an LLM to work.
If you don’t believe me you can try training an LLM with just a single parameter, specified to an incredible precision using e.g a trillion bytes. Hint: it won’t perform very well.
reply