Ajeya Cotra comments on Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

Ajeya Cotra 20 Jul 2022 23:43 UTC
LW: 22 AF: 10
14
AF
I’m pretty confused about how to think about the value of various ML alignment papers. But I think even if some piece of empirical ML work on alignment is really valuable for reducing x-risk, I wouldn’t expect its value to take the form of providing insight to readers like you or me. So you as a reader not getting much out of it is compatible with the work being super valuable, and we probably need to assess it on different terms.

The main channel of value that I see for doing work like “learning to summarize” and the critiques project and various interpretability projects is something like “identifying a tech tree that it seems helpful to get as far as possible along by the Singularity, and beginning to climb that tech tree.”

In the case of critiques—ultimately, it seems like having AIs red team each other and pointing out ways that another AI’s output could be dangerous seems like it will make a quantitative difference. If we had a really well-oiled debate setup, then we would catch issues we wouldn’t have caught with vanilla human feedback, meaning our models could get smarter before they pose an existential threat—and these smarter models can more effectively work on problems like alignment for us.[1]

It seems good to have that functionality developed as far as it can be developed in as many frontier labs as possible. The first steps of that look kind of boring, and don’t substantially change our view of the problem. But first steps are the foundation for later steps, and the baseline against which you compare later steps. (Also every step can seem boring in the sense of bringing no game-changing insights, while nonetheless helping a lot.)

When the main point of some piece of work is to get good at something that seems valuable to be really good at later, and to build tacit knowledge and various kinds of infrastructure for doing that thing, a paper about it is not going to feel that enlightening to someone who wants high-level insights that change their picture of the overall problem. (Kind of like someone writing a blog post about how they developed effective management and performance evaluation processes at their company isn’t going to provide much insight into the abstract theory of principal-agent problems. The value of that activity was in the company running better, not people learning things from the blog post about it.)

I’m still not sure how valuable I think this work is, because I don’t know how well it’s doing at efficiently climbing tech trees or at picking the right tech trees, but I think that’s how I’d think about evaluating it.

[1] Or do a “pivotal act,” though I think I probably don’t agree with some of the connotations of that term.