It’s correct that, so far, Ought has been running small-scale experiments with people who know the research background. (What is amplification? How does it work? What problem is it intended to solve?)
Over time, we also think it’s necessary to run larger-scale experiments. We’re planning to start by running longer and more experiments with contractors instead of volunteers, probably over the next month or two. Longer-term, it’s plausible that we’ll build a platform similar to what this post describes. (See here for related thoughts.)
The reason we’ve focused on small-scale experiments with a select audience is that it’s easy to do busywork that doesn’t tell you anything about the question of interest. The purpose of our experiments so far has been to get high-quality feedback on the setup, not to gather object-level data. As a consequence, the experiments have been changing a lot from week to week. The biggest recent change is the switch from task decomposition (analogous to amplification with imitation learning as distillation step) to decomposition of evaluation (analogous to amplification with RL as distillation step). Based on these changes, I think that if we had stopped at any point so far and focused on scaling up instead of refining the setup, it would have been a mistake.
It’s correct that, so far, Ought has been running small-scale experiments with people who know the research background. (What is amplification? How does it work? What problem is it intended to solve?)
Over time, we also think it’s necessary to run larger-scale experiments. We’re planning to start by running longer and more experiments with contractors instead of volunteers, probably over the next month or two. Longer-term, it’s plausible that we’ll build a platform similar to what this post describes. (See here for related thoughts.)
The reason we’ve focused on small-scale experiments with a select audience is that it’s easy to do busywork that doesn’t tell you anything about the question of interest. The purpose of our experiments so far has been to get high-quality feedback on the setup, not to gather object-level data. As a consequence, the experiments have been changing a lot from week to week. The biggest recent change is the switch from task decomposition (analogous to amplification with imitation learning as distillation step) to decomposition of evaluation (analogous to amplification with RL as distillation step). Based on these changes, I think that if we had stopped at any point so far and focused on scaling up instead of refining the setup, it would have been a mistake.
Thanks for the explanations and for pointing me back to dialog markets!