Geoffrey Irving, Jan Leike, Paul Christiano, Rohin Shah, and probably others were doing various kinds of empirical work a few years before Redwood (though I would guess Oliver doesn’t like that work and so wouldn’t consider it a counterexample to his view).
Yeah, I think Open AI tried to do some empirical work, but approximately just produced capability progress, in my current model of the world (though I also think the incentive environment there was particularly bad). I feel confused about the “learning to summarize from human feedback” work, and currently think it was overall bad for the world, but am not super confident (in general I feel very confused about the sign of RLHF research).
I think Rohin Shah doesn’t think of himself as having produced empirical work that helps with AI Alignment, but only to have produced empirical work that might help others to be convinced of AI Alignment. That is still valuable, but I think it should be evaluated on a different dimension.
I haven’t gotten much out of work by Geoffrey Irving or Jan Leike (and don’t think I know many other people who have, or at least haven’t really heard a good story for how their work actually helps). I would actually be interested if someone could give some examples of how this research helped them.
I’m pretty confused about how to think about the value of various ML alignment papers. But I think even if some piece of empirical ML work on alignment is really valuable for reducing x-risk, I wouldn’t expect its value to take the form of providing insight to readers like you or me. So you as a reader not getting much out of it is compatible with the work being super valuable, and we probably need to assess it on different terms.
The main channel of value that I see for doing work like “learning to summarize” and the critiques project and various interpretability projects is something like “identifying a tech tree that it seems helpful to get as far as possible along by the Singularity, and beginning to climb that tech tree.”
In the case of critiques—ultimately, it seems like having AIs red team each other and pointing out ways that another AI’s output could be dangerous seems like it will make a quantitative difference. If we had a really well-oiled debate setup, then we would catch issues we wouldn’t have caught with vanilla human feedback, meaning our models could get smarter before they pose an existential threat—and these smarter models can more effectively work on problems like alignment for us.[1]
It seems good to have that functionality developed as far as it can be developed in as many frontier labs as possible. The first steps of that look kind of boring, and don’t substantially change our view of the problem. But first steps are the foundation for later steps, and the baseline against which you compare later steps. (Also every step can seem boring in the sense of bringing no game-changing insights, while nonetheless helping a lot.)
When the main point of some piece of work is to get good at something that seems valuable to be really good at later, and to build tacit knowledge and various kinds of infrastructure for doing that thing, a paper about it is not going to feel that enlightening to someone who wants high-level insights that change their picture of the overall problem. (Kind of like someone writing a blog post about how they developed effective management and performance evaluation processes at their company isn’t going to provide much insight into the abstract theory of principal-agent problems. The value of that activity was in the company running better, not people learning things from the blog post about it.)
I’m still not sure how valuable I think this work is, because I don’t know how well it’s doing at efficiently climbing tech trees or at picking the right tech trees, but I think that’s how I’d think about evaluating it.
[1] Or do a “pivotal act,” though I think I probably don’t agree with some of the connotations of that term.
Geoffrey Irving, Jan Leike, Paul Christiano, Rohin Shah, and probably others were doing various kinds of empirical work a few years before Redwood (though I would guess Oliver doesn’t like that work and so wouldn’t consider it a counterexample to his view).
Yeah, I think Open AI tried to do some empirical work, but approximately just produced capability progress, in my current model of the world (though I also think the incentive environment there was particularly bad). I feel confused about the “learning to summarize from human feedback” work, and currently think it was overall bad for the world, but am not super confident (in general I feel very confused about the sign of RLHF research).
I think Rohin Shah doesn’t think of himself as having produced empirical work that helps with AI Alignment, but only to have produced empirical work that might help others to be convinced of AI Alignment. That is still valuable, but I think it should be evaluated on a different dimension.
I haven’t gotten much out of work by Geoffrey Irving or Jan Leike (and don’t think I know many other people who have, or at least haven’t really heard a good story for how their work actually helps). I would actually be interested if someone could give some examples of how this research helped them.
I’m pretty confused about how to think about the value of various ML alignment papers. But I think even if some piece of empirical ML work on alignment is really valuable for reducing x-risk, I wouldn’t expect its value to take the form of providing insight to readers like you or me. So you as a reader not getting much out of it is compatible with the work being super valuable, and we probably need to assess it on different terms.
The main channel of value that I see for doing work like “learning to summarize” and the critiques project and various interpretability projects is something like “identifying a tech tree that it seems helpful to get as far as possible along by the Singularity, and beginning to climb that tech tree.”
In the case of critiques—ultimately, it seems like having AIs red team each other and pointing out ways that another AI’s output could be dangerous seems like it will make a quantitative difference. If we had a really well-oiled debate setup, then we would catch issues we wouldn’t have caught with vanilla human feedback, meaning our models could get smarter before they pose an existential threat—and these smarter models can more effectively work on problems like alignment for us.[1]
It seems good to have that functionality developed as far as it can be developed in as many frontier labs as possible. The first steps of that look kind of boring, and don’t substantially change our view of the problem. But first steps are the foundation for later steps, and the baseline against which you compare later steps. (Also every step can seem boring in the sense of bringing no game-changing insights, while nonetheless helping a lot.)
When the main point of some piece of work is to get good at something that seems valuable to be really good at later, and to build tacit knowledge and various kinds of infrastructure for doing that thing, a paper about it is not going to feel that enlightening to someone who wants high-level insights that change their picture of the overall problem. (Kind of like someone writing a blog post about how they developed effective management and performance evaluation processes at their company isn’t going to provide much insight into the abstract theory of principal-agent problems. The value of that activity was in the company running better, not people learning things from the blog post about it.)
I’m still not sure how valuable I think this work is, because I don’t know how well it’s doing at efficiently climbing tech trees or at picking the right tech trees, but I think that’s how I’d think about evaluating it.
[1] Or do a “pivotal act,” though I think I probably don’t agree with some of the connotations of that term.