Asking these to GPT3.5 has been an utterly frustrating experience, lol. I guess gemini is at this level of intelligence right now, not GPT4... rigged demos notwithstanding;)
Curious, have you seen examples of someone convincing it of something clearly wrong? Think I’ve seen examples of that with gpt3 but not 4 that I can recall.
counter point 1: Fewer people are using gpt 4 than those using all other models. So it is subject to far less tests than the others.
counter point 2: It is not a given that gpt 4 should fail in the same way as the older model. It likely has its own unique failure modes yet to be discovered. (See above)
Yes and no. In the paper, they do compare apples to apples with GPT4 (they directly test GPT4's CoT@32 but state its 5-shot as "reported"). GPT4 wins 5-shot and Gemini wins CoT@32. It also came off to me like they were implying something is off about GPT4's MMLU.
https://i.imgur.com/3sNr3LW.png https://i.imgur.com/EIj0nZg.png
reply