Yeah I think things like this are reasonable. I think that these are maybe too hard and high-level for a lot of the things I care about—I’m really interested in questions like “how much less reliable is the model about repeating names when the names are 100 tokens in the past instead of 50”, which are much simpler and lower level.
Maybe useful way to get feedback on how good you are at doing this would be trying to make predictions based on your experience with language models:
without looking at the results or running systematic experiments on the dataset, predict which tasks on BIG-bench will be doable
make bets of the form “we’ll reach X% performance on task A before we reach Y% performance on task B”
predict for some prompt what percentage of samples will satisfy some property, then take a large number of samples and then rate them
Yeah I think things like this are reasonable. I think that these are maybe too hard and high-level for a lot of the things I care about—I’m really interested in questions like “how much less reliable is the model about repeating names when the names are 100 tokens in the past instead of 50”, which are much simpler and lower level.