Love this idea. From the linked post on the BAIR website, the idea of “prompting” a Minecraft task with e.g. a brief sequence of video frames seems especially interesting.
Would you anticipate the benchmark version of this would ask participants to disclose metrics such as “amount of task-specific feedback or data used in training”? Or does this end up being too hard to quantify because you’re explicitly expecting folks to use a variety of feedback modalities to train their agents?
Would you anticipate the benchmark version of this would ask participants to disclose metrics such as “amount of task-specific feedback or data used in training”?
Probably not, just because it’s pretty niche—I expect the vast majority of papers (at least in the near future) will have only task-specific feedback, so the extra data isn’t worth the additional hassle. (The prompting approach seems like it would require a lot of compute.)
Tbc, “amount of task-specific feedback” should still be inferable from research papers, where you are meant to provide enough details that others could reproduce your work. It just wouldn’t be as simple as looking up the “BASALT evaluation table” for your method of choice.
That makes sense, though I’d also expect that LfLH benchmarks like BASALT could turn out to be a better fit for superscale models in general. (e.g. a BASALT analogue might do a better job of capturing the flexibility of GPT-N or DALL-E type models than current benchmarks do, though you’d probably need to define a few hundred tasks for that to be useful. It’s also possible this has already been done and I’m unaware of it.)
That makes sense, though I’d also expect that LfLH benchmarks like BASALT could turn out to be a better fit for superscale models in general.
Oh yeah, it totally is, and I’d be excited for that to happen. But I think that will be a single project, whereas the benchmark reporting process is meant to apply for things where there will be lots of projects that you want to compare in a reasonably apples-to-apples way, so when designing the reporting process I’m focused more on the small-scale projects that aren’t GPT-N-like.
It’s also possible this has already been done and I’m unaware of it
I’m pretty confident that there’s nothing like this that’s been done and publicly released.
Love this idea. From the linked post on the BAIR website, the idea of “prompting” a Minecraft task with e.g. a brief sequence of video frames seems especially interesting.
Would you anticipate the benchmark version of this would ask participants to disclose metrics such as “amount of task-specific feedback or data used in training”? Or does this end up being too hard to quantify because you’re explicitly expecting folks to use a variety of feedback modalities to train their agents?
Probably not, just because it’s pretty niche—I expect the vast majority of papers (at least in the near future) will have only task-specific feedback, so the extra data isn’t worth the additional hassle. (The prompting approach seems like it would require a lot of compute.)
Tbc, “amount of task-specific feedback” should still be inferable from research papers, where you are meant to provide enough details that others could reproduce your work. It just wouldn’t be as simple as looking up the “BASALT evaluation table” for your method of choice.
That makes sense, though I’d also expect that LfLH benchmarks like BASALT could turn out to be a better fit for superscale models in general. (e.g. a BASALT analogue might do a better job of capturing the flexibility of GPT-N or DALL-E type models than current benchmarks do, though you’d probably need to define a few hundred tasks for that to be useful. It’s also possible this has already been done and I’m unaware of it.)
Oh yeah, it totally is, and I’d be excited for that to happen. But I think that will be a single project, whereas the benchmark reporting process is meant to apply for things where there will be lots of projects that you want to compare in a reasonably apples-to-apples way, so when designing the reporting process I’m focused more on the small-scale projects that aren’t GPT-N-like.
I’m pretty confident that there’s nothing like this that’s been done and publicly released.