It will certainly decrease. Also, there are multiple ways to deal with hallucinations. You can sample GPT-4 not once, but 10, 100, 1000 times. The chances of it hallucinating the same things asymptotically reaches 0. It all depends on how much money you are willing to invest in getting the right opinion, which in the field of medicine, can be quite a lot.
> You can sample GPT-4 not once, but 10, 100, 1000 times.
Is there a study on improved outcomes based on simple repetitions of GPT-4? I would be very interested in that study. I don't think gpt hallucinations are like human hallucinations. Where if you ask someone after a temporary hallucination they might get it right another 9 times, but I could be wrong. That would be an interesting result.
GPT-4 hallucinates a lot less than 3.5. Same with the Claude Models. This is from personal experience. There are also benchmarks (like TruthfulQA) that try to measure hallucinations that show the same thing.
The technical report[1] makes that claim at least:
>GPT-4 significantly reduces hallucinations relative to previous GPT-3.5 models (which have them-
selves been improving with continued iteration). GPT-4 scores 19 percentage points higher than our
latest GPT-3.5 on our internal, adversarially-designed factuality evaluations
This is the first time I've personally heard someone claiming anything like that, though I don't tend to do anything with LLMs (due to this bullshit factor).
Do you think hallucinations will be solved with GPT-5? If so, that would be an amazing breakthrough. If not, it still won't be suitable for medical advice.
I'm curious if you're using GPT-4 ($)? I find a lot of the criticisms about hallucination come from users who aren't, and my experience with GPT-4 is it's far less likely to make stuff up. Does it know all the answers, certainly not, but it's self-aware enough to say sorry I don't know instead of making a wild guess.
Haha, asking chat-gpt surely won't work. Everything can "feel" like a halting problem if you want perfect results with zero error with uncertain and ambiguous new data adding.
My take - Hallucinations can never be made to perfect zero but they can be reduced to a point where these systems in 99.99% will be hallucinating less than humans and more often than not their divergences will turn out to be creative thought experiments (which I term as healthy imagination). If it hallucinates less than a top human do - I say we win :)
My experience has been that the hallucinations in GPT4 are actually pretty rare. But in any case, if I choose to use code it suggests I ask it for explanations and then I verify those myself by other means, e.g. tests. I think it's too strong to call it a really bad TA. I'd say it's an imperfect TA and you need to check it's work, but it's work still has great value.
The issue of hallucinations is overblown. I use GPT4 all the time and don't see any hallucinations at all. It's a big problem with Google BARD and GPT3 and earlier models. But GPT4 fixed the issue of hallucinations completely.
I honestly haven't found hallucination to be a problem on GPT-4 when asking it to analyze or parse a dataset but can acknowledge it being possible (I just haven't encountered it).
I think that if we consider the accuracy rate as measured in various ways being roughly that of a human, then you're trading human mistakes for AI mistakes in exchange for dramatically lower costs and a dramatically higher speed of processing. You might even say a higher level of reasoning. In my own interactions it's been fantastic at reasoning clearly and quickly outside of complex trick questions. Most scenarios in life aren't generally full of trick questions.
Hallucinations are a feature, not a bug. GPT pre-training teaches the model to always produce an answer, even when it has little or not relevant training experience, and on average it does well at that. Part of the point of RLHF in ChatGPT was to teach the model not to hallucinate when it doesn't have good supporting experience encoded in it's weights. This helps but is not perfect. However, it seems like there might be a path to much less hallucinations with more RL training data.
As others pointed out, humans hallucinate all the time we just have better training for what level of hallucination is appropriate given supporting evidence and context.
reply