There’s been criticism somewhere or other—twitter? - of BIG-Bench for having a lot of idiosyncratic tasks that stronger models will not necessarily perform better on. The issue tracker might be of use in finding this criticism though I don’t think it’s a full overview of the tasks of concern. In particular, some of the tasks are very opinion-based, and even constraining to reasonable humans of similar intelligence and education one still might see disagreement. That’s not to say the task is useless, though.
Interesting. Even if only a small part of the tasks in the test are poor estimates of general capabilities, it makes the test as a whole less trustworthy.
yeah, at least in terms of being a raw intelligence test. the easiest criticism is that the test has political content, which of course means that even to the degree the political content is to any one person’s liking, the objectively true answer is significantly more ambiguous than ideal. alignment/friendliness and capabilities/intelligence can never be completely separated when the rubber hits the road. part of the problem with politics is that we can’t be completely sure we’ve mapped the consequences of our personal views on basic moral philosophies correctly, so more reasoning power can cause a stronger model to behave arbitrarily differently on morality-loaded questions. and there may be significantly more arbitrary components than that in any closed book language based capabilities test.
see also the old post on [[guessing the teacher’s “password”]], guessing what the teacher is thinking
And while there might be some uses of such benchmarks on politics etc, combining them with other benchmarks doesn’t really seems like a useful benchmark.
There’s been criticism somewhere or other—twitter? - of BIG-Bench for having a lot of idiosyncratic tasks that stronger models will not necessarily perform better on. The issue tracker might be of use in finding this criticism though I don’t think it’s a full overview of the tasks of concern. In particular, some of the tasks are very opinion-based, and even constraining to reasonable humans of similar intelligence and education one still might see disagreement. That’s not to say the task is useless, though.
Interesting. Even if only a small part of the tasks in the test are poor estimates of general capabilities, it makes the test as a whole less trustworthy.
yeah, at least in terms of being a raw intelligence test. the easiest criticism is that the test has political content, which of course means that even to the degree the political content is to any one person’s liking, the objectively true answer is significantly more ambiguous than ideal. alignment/friendliness and capabilities/intelligence can never be completely separated when the rubber hits the road. part of the problem with politics is that we can’t be completely sure we’ve mapped the consequences of our personal views on basic moral philosophies correctly, so more reasoning power can cause a stronger model to behave arbitrarily differently on morality-loaded questions. and there may be significantly more arbitrary components than that in any closed book language based capabilities test.
see also the old post on [[guessing the teacher’s “password”]], guessing what the teacher is thinking
True.
And while there might be some uses of such benchmarks on politics etc, combining them with other benchmarks doesn’t really seems like a useful benchmark.