Edouard Harris comments on BASALT: A Benchmark for Learning from Human Feedback

Edouard Harris 12 Jul 2021 17:46 UTC
LW: 3 AF: 3
AF
That makes sense, though I’d also expect that LfLH benchmarks like BASALT could turn out to be a better fit for superscale models in general. (e.g. a BASALT analogue might do a better job of capturing the flexibility of GPT-N or DALL-E type models than current benchmarks do, though you’d probably need to define a few hundred tasks for that to be useful. It’s also possible this has already been done and I’m unaware of it.)
- Rohin Shah 13 Jul 2021 6:47 UTC
  LW: 3 AF: 3
  AF Parent
  That makes sense, though I’d also expect that LfLH benchmarks like BASALT could turn out to be a better fit for superscale models in general.
  Oh yeah, it totally is, and I’d be excited for that to happen. But I think that will be a single project, whereas the benchmark reporting process is meant to apply for things where there will be lots of projects that you want to compare in a reasonably apples-to-apples way, so when designing the reporting process I’m focused more on the small-scale projects that aren’t GPT-N-like.
  It’s also possible this has already been done and I’m unaware of it
  I’m pretty confident that there’s nothing like this that’s been done and publicly released.