Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

You can ask gpt4 or other high value model to rate two chat logs for coherency etc, not as accurate as human evaluation, but you don't have to read thousand lines of text if comparing many models.


sort by: page size:

There are certainly some effective language model benchmarks; however, they are not well-suited for evaluating a chat assistant. Some projects employ human evaluation, while this blog post explores an alternative approach based on GPT-4. Both methods have their advantages and disadvantages, making this blog post an intriguing case study that can inspire the future development of more comprehensive evaluations.

The compression can be lossy though and the model often “predicts” complete garbage even with GPT-4. It would be great to get a probability for correctness “Mr. Spock” style in chat - especially so as in case of a telephone number…

Yes, chat GPT excels at comprehending and explaining things that have a consistent structure, restructuring, and and synthesising variations. If you keep it in its lane, it’s an excellent tool.

It’s really really bad at counting though. For example, try asking it to produce a line of 40 asterisks.


On the first glance Chat GPT seems to be extremely amazing. And it is. But this is one of the issues machine learning models still have, they can’t distinguish well between truth and fiction. And they also have a hard time deciding, when they are allowed to come up with new things (like „write a story“) and when they absolutely can’t.

Another problem is, that they are mostly training from texts from the internet. And a lot of texts contain wrong information. They are not „smart“ enough to do fact checking on that.


The limitation of gpt4 and the chat models in general is that they are adamant to declare themselves "as an ai language model"

Cohercing them in specific output is becoming harder and harder, and postprocessing to cut the fat tedious to maintain as unreliable, plus who wants that on their pipeline.

So yeah they are fine for having a chat about test passage but as authoring tools are heavy,slow,and require lot of manually moving strings back and forth.

(Nevermind that you pay the token to generate that "as an ai language model I can" warning)


Not exactly. ChatGPT was absolutely trained to produce statistically likely output, it just had an extra training step added for human ratings. If they relied entirely on human ratings there would not have been sufficient data to train the model.

I don't now about that. I believe if you had it calibrated correctly no one could tell the difference between a single GPT-3 comment and a human comment.

Maybe after a bit of dialogue you'd have a higher chance, but even then I suspect (from playing around in AI Dungeon) that GPT-3 could do very well.


The author isn't entirely wrong here. However, his understanding of how human intervention is used in the system is wrong. ChatGPT uses GPT-3 as a baseline model. However, it finetunes it using a machine learning technique called Reinforcement Learning from Human Feedback (RLHF) This method uses human in the loop but not how the author thinks. The baseline model is finetuned on prompts and their answers (labels) provided by human labelers initially (which is a tedious task and doesn't scale well). Later this model is used to generate multiple answers for selected input prompt that human labelers rank based on the what they think is the most appropriate response. It's easier for labelers to rank responses compared to actually providing appropriate responses.

Even though the whole process is more complicated then what ive explained above, the model is essentially trained this way.

One of the reasons of releasing it for free for now is so they can gather data for further fine tuning the model (this is why after each answer there's a thumbs-up, thumps-down button for you to rate chatgpts responses)


The expectation seems a little harsh for the setup. GPT just generates acceptable text. You still need to model and verify object relationships, use facts from a knowledge base and discriminate the generated responses at the very minimum.

Any GPT model is just one component inside of a chatbot, not a chatbot itself.


I don't think you understand the scope of training data required for these models. We're talking thousands of lifetimes worth of reading for ChatGPT (GPT-3 for example is trained on 45TB of textual data).

Exactly. Few hundreds of thousands of interactions with chatgpt is definitely too less for model to learn lot of new things. The thing it does well is make them much better at following instructions. It also makes it much better at working with given context.

it seems unlikely to me that ChatGPT is directly trained on chat data. if it is, we should see it know information past its knowledge cutoff. afaik that hasn't happened.

I assume the chat logs are instead training a reward model, which itself is then used as the reward function during RLHF training.


But they don't need a fancy AI classifier, they already have the text that ChatGPT outputs

The model may be the same, but the chatbot is not. They have made the responses shorter to save inference expenses, which are huge in a model the size of GPT-4.

I've been enjoying playing with ChatGPT but am starting to wonder if my brain is feverishly working to convince me of the value and meaning of ChatGPT's responses. Since we are programmed to communicate with other humans and because human speech is so chaotic and inefficient I wonder if we are overestimating what our AI buddies are providing.

I'm coming from a metrology and test and measurement background so I'm always trying to boil things down to a clean reproducible metric. While AI has fantastic use cases I think it also forces us to think about how much we're overvaluing our achievements in this space.


Chat GPT does have counting logic. The math model is encoded inside of the language model.

So we should test out the performance, because with GPT funnily enough, it's possible that this type of prompt could bias it even better.

Because e.g. if you need it to answer in a very specific way, it might be helpful for the ChatBot to see what the other potential ways are so it would know to contrast even more.

Yes, it's influencing the output of the model, but it's unclear to me without extensive trying in which way, because it could be positive influence as well.

As for efficiency. It doesn't add that much in terms of magnitude if you already have long conversations going on.

Finally. Using external libraries and building out the things that you are mentioning will take time, trial and error. Just modifying the initial prompt is easy, and you can start to use it immediately.

So the main positive is just that it's likely faster and easier to implement compared to building out those systems.


I didn't say "for inference", and neither did the person I replied to.

GPT uses the internet to connect to users, but rather more importantly chatGPT in particular has a layer on top of GPT which is trained from human feedback.

Keywords search "RLHF".

That feedback mechanism is, if anything, becoming more detailed as time passes, so I must infer that it's still considered highly important, probably even for the 3.5 model.


I would argue that ChatGPT has opinions, and these opinions are based on it's training data. I don't think GPT has the type of reasoning skills needed to detect and resolve conflicts in its inputs, but it does hold opinions. It's a bit hard to tell because it can easily be swayed by a changing prompt, but it has opinions, it just doesn't hold strong ones.

The only thing stopping GPT from ingesting new information and forming opinions about it is that it is not being trained on new information (such as its own interactions).

next

Legal | privacy