The direct prompt comparison isn't quite fair due to the instruction tuning on GPT-3.5 and 4. It'd be interesting to see examples with prompts that would work better for the raw language models.
Yeah it's hard to compare across models, interested in suggestions here.
We give all models a bunch of few-shot examples, which improves GPT-3 (davinci)'s question answering substantially. GPT-2 sometimes generates something that answers the question, sometimes it's just confused. Click "See full prompt" to see the few-shot examples that the models get.
Our goal was to exercise the full capabilities of each model.
I also found the riddle rather odd. I cannot say that 2 is actually the correct answer.
A problem with riddles is that they often have a hidden or secret context. I think especially in our digital age this one is closer to Frodo's "What have I got in my pocket?" "riddle". Here's some other possible solutions. 11+2 = 1. 1 + 1 + 2 = 4, mod 3 and we get 1, so 9 + 5 = 13, mod 3 and we get 1. We could also replace the addition sign with equality and similarly propose a digit summation so 1+1 == 2? True (1). 9 == 5? False (0). There's a hundred solutions to this riddle when it has no context. In fact, I stumbled into the right answer thinking about mod 12 without ever considering a clock until I saw the answer. Maybe I'm just dumb though, I am known to over think.
reply