i) I would guess that human eval is in general better than most benchmarks. This is because it’s a mystery how much benchmark performance is explained by prompt leakage and benchmarks being poorly designed (e.g crowd-sourcing benchmarks has issues with incorrect or not useful tests, and adversarially filtered benchmarks like TruthfulQA have selection effects on their content which make interpreting their results harder, in my opinion)
ii) GPT-4 is the best model we have access to. Any competition with GPT-4 is competition with the SOTA available model! This is a much harder reference class to compare to than models trained with the same compute, models trained without fine-tuning etc.
I think that this critique is a bit overstated.
i) I would guess that human eval is in general better than most benchmarks. This is because it’s a mystery how much benchmark performance is explained by prompt leakage and benchmarks being poorly designed (e.g crowd-sourcing benchmarks has issues with incorrect or not useful tests, and adversarially filtered benchmarks like TruthfulQA have selection effects on their content which make interpreting their results harder, in my opinion)
ii) GPT-4 is the best model we have access to. Any competition with GPT-4 is competition with the SOTA available model! This is a much harder reference class to compare to than models trained with the same compute, models trained without fine-tuning etc.