“A team of 3 top research ML engineers with fine-tuning access to GPT-4o (including SFT and RL), $10 million in compute, and 1 year of time could use GPT-4o to surpass typical naive MTurk performance at ARC-AGI on the test set while using less than $100 per problem at runtime (as denominated by GPT-4o API costs).”
But doing so would tune that GPT-4o to be less good at other tasks, wouldn’t it? Isn’t this way of solving just Goodharting the metric and actually pushing the LLM away from being “General Intelligence”?
That being said, this is a great job you have done just over a week. Thank you for your contribution for science.
Isn’t this way of solving just Goodharting the metric and actually pushing the LLM away from being “General Intelligence”?
I certainly agree that solving ARC-AGI in this way wouldn’t indicate general intelligence. (And, I generally think that ARC-AGI probably doesn’t track general intelligence that well. And it tracks even less when no holds barred methods are considered.)
But doing so would tune that GPT-4o to be less good at other tasks, wouldn’t it?
I don’t think it would degrade performance on other tasks very much if you took basic precautions. LLMs have a lot of parameters.
“A team of 3 top research ML engineers with fine-tuning access to GPT-4o (including SFT and RL), $10 million in compute, and 1 year of time could use GPT-4o to surpass typical naive MTurk performance at ARC-AGI on the test set while using less than $100 per problem at runtime (as denominated by GPT-4o API costs).”
But doing so would tune that GPT-4o to be less good at other tasks, wouldn’t it? Isn’t this way of solving just Goodharting the metric and actually pushing the LLM away from being “General Intelligence”?
That being said, this is a great job you have done just over a week. Thank you for your contribution for science.
I certainly agree that solving ARC-AGI in this way wouldn’t indicate general intelligence. (And, I generally think that ARC-AGI probably doesn’t track general intelligence that well. And it tracks even less when no holds barred methods are considered.)
I don’t think it would degrade performance on other tasks very much if you took basic precautions. LLMs have a lot of parameters.