This is called self-consistency: sample multiple times and then select the answer that appears the most. You can read Alphacode's paper, which uses a similar method. I think we can do even better, though. Over 1000 runs, it's likely that the correct answer appears at least once. Instead of selecting based on the majority, we could use the LLM to judge the answer individually. Since it's easier to verify than to generate, I think this method would work very well. I don't know if any lab has tried this, though.
reply