Rohin Shah comments on BASALT: A Benchmark for Learning from Human Feedback

Rohin Shah 13 Jul 2021 6:47 UTC
LW: 3 AF: 3
AF
That makes sense, though I’d also expect that LfLH benchmarks like BASALT could turn out to be a better fit for superscale models in general.
Oh yeah, it totally is, and I’d be excited for that to happen. But I think that will be a single project, whereas the benchmark reporting process is meant to apply for things where there will be lots of projects that you want to compare in a reasonably apples-to-apples way, so when designing the reporting process I’m focused more on the small-scale projects that aren’t GPT-N-like.
It’s also possible this has already been done and I’m unaware of it
I’m pretty confident that there’s nothing like this that’s been done and publicly released.