All those benchmarks are sooo bad that I'm not really convinced one way or another. Having tried out a few of them I roughly prefer gpt-4 I think still (though I haven't worked with any of them long enough to make a clear judgement).
I guess you're entitled to an opinion. I've had the same (error-prone, often unhelpful) experience since gpt-4 was released. It's a little faster now which is nice.
If "beats GPT 4" is in the title it's almost a guarantee that it's a bold faced lie that includes benchmark overfitting.
The first time a model that actually matched GPT 4 launched (i.e. Command-R+) there was no mention of it at all. If your results speak for themselves, there's no need to shout.
I haven't tried it yet, but people in the /r/chatgpt subreddit are claiming GPT-4-Turbo seems to have issues with understanding/remembering longer (say 100 lines) of code, whereas 3.5 and 4.0 seem to have handled things a bit better, implying that the context-window size isn't (currently) as large as claimed.
Ultra benchmarked around the original release of GPT-4, not the current model. My understanding is that was fairly accurate — it's close to current GPT-4 but not quite equal. However, close-to-GPT-4 but 4x cheaper and 10x context length would be very impressive and IMO useful.
No race has begun. GPT 4 is so far ahead in everything. Even in their official metrics[1], and that reports official metrics for first version of GPT 4 from paper. People have ran the benchmarks again and found much better results like 85% HumanEval. It's like no one even thinks about comparing to GPT 4 and it is just reported as gold standard.
Update: I see others had the same experience.
reply