OpenAI has been collecting a ton of evals here https://github.com/openai/evals with many of them including some comments about how well GPT-4 does vs GPT-3.5.
You could clone that repo, adapt the oaieval script to run against different APIs, then run the evals against both and compare the results.
You could clone that repo, adapt the oaieval script to run against different APIs, then run the evals against both and compare the results.
reply