Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Tried a few of my normal benchmarks, besides the tiny context window, it made mistakes I've never seen GPT-4 make.

Update: I see others had the same experience.



sort by: page size:

It was interesting to see GPT4 fail at it: https://chat.openai.com/share/02c12bbe-43cd-40da-b5df-33681d...

Not a bad benchmark problem. It didn't get very far, but maybe the next release will.


I'm really surprised your benchmark shows gpt-3.5-turbo-0301 outperforming gpt-4 (non tubo) on first try coding problems.

it says in the graphs listed on the announcement it performs worse than GPT4 on reasoning benchmarks.

No, it benchmarked around the original release of GPT-4 given 32 attempts versus GPT-4's 5.

I noticed that GPT4 performs worse in a lot of case according to HN (I didn't test it yet)

Wait, really? I've only been using GPT4 and it seemed like it's been getting incrementally better. Do you have any test cases?

Yeah GPT is incredibly lazy, ironically 3 is far better at not being lazy than 4.

I guess you benchmarked via API? I've heard even the datestamped models have been nerfed from time to time..


I still don't trust benchmarks, but they've come a long way.

It's genuinely outperforming GPT4 in my manual tests.


All those benchmarks are sooo bad that I'm not really convinced one way or another. Having tried out a few of them I roughly prefer gpt-4 I think still (though I haven't worked with any of them long enough to make a clear judgement).

Why don't they evaluate GPT-4, like they evaluated GPT-3.5? And why not try longer context windows?

I've not used GPT-4, so it could be different, but regular old GPT-3.5 gets a _lot_ of things wrong.

I guess you're entitled to an opinion. I've had the same (error-prone, often unhelpful) experience since gpt-4 was released. It's a little faster now which is nice.

My coding benchmarks agree with the headline. GPT-4 stayed about the same between the March and June model releases.

https://aider.chat/docs/benchmarks.html


That's just GPT3.5 having a pretty spotty attention. This becomes much less of a problem if you use gpt-4-32k.

If "beats GPT 4" is in the title it's almost a guarantee that it's a bold faced lie that includes benchmark overfitting.

The first time a model that actually matched GPT 4 launched (i.e. Command-R+) there was no mention of it at all. If your results speak for themselves, there's no need to shout.


I haven't tried it yet, but people in the /r/chatgpt subreddit are claiming GPT-4-Turbo seems to have issues with understanding/remembering longer (say 100 lines) of code, whereas 3.5 and 4.0 seem to have handled things a bit better, implying that the context-window size isn't (currently) as large as claimed.

Anyone else seeing any evidence of this?


Still not as good at math as gpt4o from benchmarks and also my experience

Ultra benchmarked around the original release of GPT-4, not the current model. My understanding is that was fairly accurate — it's close to current GPT-4 but not quite equal. However, close-to-GPT-4 but 4x cheaper and 10x context length would be very impressive and IMO useful.

No race has begun. GPT 4 is so far ahead in everything. Even in their official metrics[1], and that reports official metrics for first version of GPT 4 from paper. People have ran the benchmarks again and found much better results like 85% HumanEval. It's like no one even thinks about comparing to GPT 4 and it is just reported as gold standard.

[1]: https://x.ai/

next

Legal | privacy