Hacker Read

tmikaeld · 2024-02-08 12:50:27

Tried a few of my normal benchmarks, besides the tiny context window, it made mistakes I've never seen GPT-4 make.

Update: I see others had the same experience.

reply

CamperBob2 | karma 11341 | avg karma 1.16 · | 2024-04-12 16:06:13

It was interesting to see GPT4 fail at it: https://chat.openai.com/share/02c12bbe-43cd-40da-b5df-33681d...

Not a bad benchmark problem. It didn't get very far, but maybe the next release will.

reply

usaar333 | karma 3846 | avg karma 2.28 · | 2023-11-07 11:49:17

I'm really surprised your benchmark shows gpt-3.5-turbo-0301 outperforming gpt-4 (non tubo) on first try coding problems.

camel_Snake | karma 503 | avg karma 2.1 · | 2024-02-09 03:33:26

it says in the graphs listed on the announcement it performs worse than GPT4 on reasoning benchmarks.

refulgentis | karma 3142 | avg karma 1.08 · | 2024-02-16 05:19:44

No, it benchmarked around the original release of GPT-4 given 32 attempts versus GPT-4's 5.

somecommit | karma 252 | avg karma 3.04 · | 2023-10-01 20:25:36

I noticed that GPT4 performs worse in a lot of case according to HN (I didn't test it yet)

generalizations | karma 4806 | avg karma 3.34 · | 2023-05-12 09:10:53

Wait, really? I've only been using GPT4 and it seemed like it's been getting incrementally better. Do you have any test cases?

dontupvoteme | karma 1116 | avg karma 1.92 · | 2024-04-01 21:16:30

Yeah GPT is incredibly lazy, ironically 3 is far better at not being lazy than 4.

I guess you benchmarked via API? I've heard even the datestamped models have been nerfed from time to time..

reply

jasonjmcghee | karma 2166 | avg karma 2.88 · | 2024-03-04 15:42:27

I still don't trust benchmarks, but they've come a long way.

It's genuinely outperforming GPT4 in my manual tests.

reply

nuz | karma 660 | avg karma 5.08 · | 2024-03-08 18:25:57

All those benchmarks are sooo bad that I'm not really convinced one way or another. Having tried out a few of them I roughly prefer gpt-4 I think still (though I haven't worked with any of them long enough to make a clear judgement).

1024core | karma 8043 | avg karma 4.39 · | 2023-06-29 19:43:47

Why don't they evaluate GPT-4, like they evaluated GPT-3.5? And why not try longer context windows?

AussieWog93 | karma 9530 | avg karma 4.25 · | 2023-04-07 01:41:08

I've not used GPT-4, so it could be different, but regular old GPT-3.5 gets a _lot_ of things wrong.

phillipcarter | karma 3882 | avg karma 3.77 · | 2023-05-31 10:56:18

I guess you're entitled to an opinion. I've had the same (error-prone, often unhelpful) experience since gpt-4 was released. It's a little faster now which is nice.

anotherpaulg | karma 982 | avg karma 3.35 · | 2023-09-16 08:58:58

My coding benchmarks agree with the headline. GPT-4 stayed about the same between the March and June model releases.

https://aider.chat/docs/benchmarks.html

reply

danielbln | karma 2147 | avg karma 2.38 · | 2023-10-31 06:23:55

That's just GPT3.5 having a pretty spotty attention. This becomes much less of a problem if you use gpt-4-32k.

moffkalast | karma 7759 | avg karma 1.88 · | 2024-05-29 10:25:01

If "beats GPT 4" is in the title it's almost a guarantee that it's a bold faced lie that includes benchmark overfitting.

The first time a model that actually matched GPT 4 launched (i.e. Command-R+) there was no mention of it at all. If your results speak for themselves, there's no need to shout.

reply

berkut | karma 2799 | avg karma 3.3 · | 2023-11-07 19:04:55

I haven't tried it yet, but people in the /r/chatgpt subreddit are claiming GPT-4-Turbo seems to have issues with understanding/remembering longer (say 100 lines) of code, whereas 3.5 and 4.0 seem to have handled things a bit better, implying that the context-window size isn't (currently) as large as claimed.

Anyone else seeing any evidence of this?

reply

Davidzheng | karma 343 | avg karma 1.41 · | 2024-06-27 03:12:11

Still not as good at math as gpt4o from benchmarks and also my experience

reissbaker | karma 3906 | avg karma 4.33 · | 2024-02-15 20:02:50

Ultra benchmarked around the original release of GPT-4, not the current model. My understanding is that was fairly accurate — it's close to current GPT-4 but not quite equal. However, close-to-GPT-4 but 4x cheaper and 10x context length would be very impressive and IMO useful.

YetAnotherNick | karma 2580 | avg karma 1.49 · | 2023-11-05 01:30:03

No race has begun. GPT 4 is so far ahead in everything. Even in their official metrics[1], and that reports official metrics for first version of GPT 4 from paper. People have ran the benchmarks again and found much better results like 85% HumanEval. It's like no one even thinks about comparing to GPT 4 and it is just reported as gold standard.

[1]: https://x.ai/

reply