automating research that ‘tackles the hard bits’ seems to probably be harder (happen chronologically later) and like it might be more bottlenecked by good (e.g. agent foundations) human researcher feedback—which does suggest it might be valuable to recruit more agent foundations researchers today
but my impression is that recruiting/forming agent foundations researchers itself is much harder and has much worse feedback loops
I expect if the feedback loops are good enough, the automated safety researchers could make enough prosaic safety progress, feeding into agendas like control and/or conditioning predictive models (+ extensions to e.g. steering using model internals, + automated reviewing, etc.), to allow the use of automated agent foundations researchers with more confidence and less human feedback
the hard bits might also become much less hard—or prove to be irrelevant—if for at least some of them empirical feedback could be obtained by practicing on much more powerful models
Overall, my impression is that focusing on the feedback loops seems probably significantly more scalable today, and the results would probably be enough to allow to mostly defer the hard bits to automated research.
(Edit: but also, I think if funders and field builders ‘went hard’, it might be much less necessary to choose.)
I’m torn between generally being really fucking into improving feedback loops (and thinking they are a good way to make it easier to make progress on confusing questions), and, being sad that so few people are actually just trying to actually directly tackle the hard bits of the alignment challenge.
Some quick thoughts:
automating research that ‘tackles the hard bits’ seems to probably be harder (happen chronologically later) and like it might be more bottlenecked by good (e.g. agent foundations) human researcher feedback—which does suggest it might be valuable to recruit more agent foundations researchers today
but my impression is that recruiting/forming agent foundations researchers itself is much harder and has much worse feedback loops
I expect if the feedback loops are good enough, the automated safety researchers could make enough prosaic safety progress, feeding into agendas like control and/or conditioning predictive models (+ extensions to e.g. steering using model internals, + automated reviewing, etc.), to allow the use of automated agent foundations researchers with more confidence and less human feedback
the hard bits might also become much less hard—or prove to be irrelevant—if for at least some of them empirical feedback could be obtained by practicing on much more powerful models
Overall, my impression is that focusing on the feedback loops seems probably significantly more scalable today, and the results would probably be enough to allow to mostly defer the hard bits to automated research.
(Edit: but also, I think if funders and field builders ‘went hard’, it might be much less necessary to choose.)