I mostly disagreed with bullet point two. The primary result of “empirical AI Alignment research” that I’ve seen in the last 5 years has been a lot of capabilities gain, with approximately zero in terms of progress on any AI Alignment problems. I agree more with the “in the long run there will be a lot of empirical work to be done”, but right now on the margin, we have approximately zero traction on useful empirical work, as far as I can tell (outside of transparency research).
I agree that in an absolute sense there is very little empirical work that I’m excited about going on, but I think there’s even less theoretical work going on that I’m excited about, and when people who share my views on the nature of the problem work on empirical work I feel that it works better than when they do theoretical work.
Hmm, there might be some mismatch of words here. Like, most of the work so far on the problem has been theoretical. I am confused how you could not be excited about the theoretical work that established the whole problem, the arguments for why it’s hard, and that helped us figure out at least some of the basic parameters of the problem. Given that (I think) you currently think AI Alignment is among the global priorities, you presumably think the work that allowed you to come to believe that (and that allowed others to do the same) was very valuable and important.
My guess is you are somehow thinking of work like Superintelligence, or Eliezer’s original work, or Evan’s work on inner optimization as something different than “theoretical work”?
I was mainly talking about the current margin when I talked about how excited I am about the theoretical vs empirical work I see “going on” right now and how excited I tend to be about currently-active researchers who are doing theory vs empirical research. And I was talking about the future when I said that I expect empirical work to end up with the lion’s share of credit for AI risk reduction.
Eliezer, Bostrom, and co certainly made a big impact in raising the problem to people’s awareness and articulating some of its contours. It’s kind of a matter of semantics whether you want to call that “theoretical research” or “problem advocacy” / “cause prioritization” / “community building” / whatever, and no matter which bucket you put it in I agree it’ll probably end up with an outsized impact for x-risk-reduction, by bringing the problem to attention sooner than it would have otherwise been brought to attention and therefore probably allowing more work to happen on it before TAI is developed.
But just like how founding CEOs tend to end up with ~10% equity once their companies have grown large, I don’t think this historical problem-advocacy-slash-theoretical-research work alone will end up with a very large amount of total credit.
On the main thrust of my point, I’m significantly less excited about MIRI-sphere work that is much less like “articulating a problem and advocating for its importance” and much more like “attempting to solve a problem.” E.g. stuff like logical inductors, embedded agency, etc seem a lot less valuable to me than stuff like the orthogonality thesis and so on.
Unfortunately, empirical work is slowly progressing towards alignment, and truth be told, we might be in a local optima for alignment chances, barring stuff like outright banning AI work, or doing something political. And unfortunately that’s hopelessly going to get us mind killed and probably make it even worse.
Also, at the start of a process towards a solution always boots up slowly, with productive mistakes like these. You will never get perfect answers, and thinking that you can get perfect answers is a Nirvana Fallacy. Exponential growth will help us somewhat, but ultimately AI Alignment is probably in a local optima state, that is, the people and companies that are in the lead to building AGI are sympathetic to Alignment, which is far better than we could reasonably have hoped for, and there’s little arms race dynamics for AGI, which is even better.
We often complain about the AGI issue for real reasons, but we do need to realize how good we’ve likely gotten at this. It’s still shitty, yet there are far worse points we could have ended up here.
In a post-AI PONR world, we’re lucky if we can solve the problem of AI Alignment enough such that we go through slow-takeoff safely. We all hate it, yet empirical work will ultimately be necessary, and we undervalue feedback loops because theory can get wildly out of reality without being in contact with reality.
Geoffrey Irving, Jan Leike, Paul Christiano, Rohin Shah, and probably others were doing various kinds of empirical work a few years before Redwood (though I would guess Oliver doesn’t like that work and so wouldn’t consider it a counterexample to his view).
Yeah, I think Open AI tried to do some empirical work, but approximately just produced capability progress, in my current model of the world (though I also think the incentive environment there was particularly bad). I feel confused about the “learning to summarize from human feedback” work, and currently think it was overall bad for the world, but am not super confident (in general I feel very confused about the sign of RLHF research).
I think Rohin Shah doesn’t think of himself as having produced empirical work that helps with AI Alignment, but only to have produced empirical work that might help others to be convinced of AI Alignment. That is still valuable, but I think it should be evaluated on a different dimension.
I haven’t gotten much out of work by Geoffrey Irving or Jan Leike (and don’t think I know many other people who have, or at least haven’t really heard a good story for how their work actually helps). I would actually be interested if someone could give some examples of how this research helped them.
I’m pretty confused about how to think about the value of various ML alignment papers. But I think even if some piece of empirical ML work on alignment is really valuable for reducing x-risk, I wouldn’t expect its value to take the form of providing insight to readers like you or me. So you as a reader not getting much out of it is compatible with the work being super valuable, and we probably need to assess it on different terms.
The main channel of value that I see for doing work like “learning to summarize” and the critiques project and various interpretability projects is something like “identifying a tech tree that it seems helpful to get as far as possible along by the Singularity, and beginning to climb that tech tree.”
In the case of critiques—ultimately, it seems like having AIs red team each other and pointing out ways that another AI’s output could be dangerous seems like it will make a quantitative difference. If we had a really well-oiled debate setup, then we would catch issues we wouldn’t have caught with vanilla human feedback, meaning our models could get smarter before they pose an existential threat—and these smarter models can more effectively work on problems like alignment for us.[1]
It seems good to have that functionality developed as far as it can be developed in as many frontier labs as possible. The first steps of that look kind of boring, and don’t substantially change our view of the problem. But first steps are the foundation for later steps, and the baseline against which you compare later steps. (Also every step can seem boring in the sense of bringing no game-changing insights, while nonetheless helping a lot.)
When the main point of some piece of work is to get good at something that seems valuable to be really good at later, and to build tacit knowledge and various kinds of infrastructure for doing that thing, a paper about it is not going to feel that enlightening to someone who wants high-level insights that change their picture of the overall problem. (Kind of like someone writing a blog post about how they developed effective management and performance evaluation processes at their company isn’t going to provide much insight into the abstract theory of principal-agent problems. The value of that activity was in the company running better, not people learning things from the blog post about it.)
I’m still not sure how valuable I think this work is, because I don’t know how well it’s doing at efficiently climbing tech trees or at picking the right tech trees, but I think that’s how I’d think about evaluating it.
[1] Or do a “pivotal act,” though I think I probably don’t agree with some of the connotations of that term.
I mostly disagreed with bullet point two. The primary result of “empirical AI Alignment research” that I’ve seen in the last 5 years has been a lot of capabilities gain, with approximately zero in terms of progress on any AI Alignment problems. I agree more with the “in the long run there will be a lot of empirical work to be done”, but right now on the margin, we have approximately zero traction on useful empirical work, as far as I can tell (outside of transparency research).
I agree that in an absolute sense there is very little empirical work that I’m excited about going on, but I think there’s even less theoretical work going on that I’m excited about, and when people who share my views on the nature of the problem work on empirical work I feel that it works better than when they do theoretical work.
Hmm, there might be some mismatch of words here. Like, most of the work so far on the problem has been theoretical. I am confused how you could not be excited about the theoretical work that established the whole problem, the arguments for why it’s hard, and that helped us figure out at least some of the basic parameters of the problem. Given that (I think) you currently think AI Alignment is among the global priorities, you presumably think the work that allowed you to come to believe that (and that allowed others to do the same) was very valuable and important.
My guess is you are somehow thinking of work like Superintelligence, or Eliezer’s original work, or Evan’s work on inner optimization as something different than “theoretical work”?
I was mainly talking about the current margin when I talked about how excited I am about the theoretical vs empirical work I see “going on” right now and how excited I tend to be about currently-active researchers who are doing theory vs empirical research. And I was talking about the future when I said that I expect empirical work to end up with the lion’s share of credit for AI risk reduction.
Eliezer, Bostrom, and co certainly made a big impact in raising the problem to people’s awareness and articulating some of its contours. It’s kind of a matter of semantics whether you want to call that “theoretical research” or “problem advocacy” / “cause prioritization” / “community building” / whatever, and no matter which bucket you put it in I agree it’ll probably end up with an outsized impact for x-risk-reduction, by bringing the problem to attention sooner than it would have otherwise been brought to attention and therefore probably allowing more work to happen on it before TAI is developed.
But just like how founding CEOs tend to end up with ~10% equity once their companies have grown large, I don’t think this historical problem-advocacy-slash-theoretical-research work alone will end up with a very large amount of total credit.
On the main thrust of my point, I’m significantly less excited about MIRI-sphere work that is much less like “articulating a problem and advocating for its importance” and much more like “attempting to solve a problem.” E.g. stuff like logical inductors, embedded agency, etc seem a lot less valuable to me than stuff like the orthogonality thesis and so on.
Unfortunately, empirical work is slowly progressing towards alignment, and truth be told, we might be in a local optima for alignment chances, barring stuff like outright banning AI work, or doing something political. And unfortunately that’s hopelessly going to get us mind killed and probably make it even worse.
Also, at the start of a process towards a solution always boots up slowly, with productive mistakes like these. You will never get perfect answers, and thinking that you can get perfect answers is a Nirvana Fallacy. Exponential growth will help us somewhat, but ultimately AI Alignment is probably in a local optima state, that is, the people and companies that are in the lead to building AGI are sympathetic to Alignment, which is far better than we could reasonably have hoped for, and there’s little arms race dynamics for AGI, which is even better.
We often complain about the AGI issue for real reasons, but we do need to realize how good we’ve likely gotten at this. It’s still shitty, yet there are far worse points we could have ended up here.
In a post-AI PONR world, we’re lucky if we can solve the problem of AI Alignment enough such that we go through slow-takeoff safely. We all hate it, yet empirical work will ultimately be necessary, and we undervalue feedback loops because theory can get wildly out of reality without being in contact with reality.
Were any cautious people trying empirical alignment research before Redwood/Conjecture?
Geoffrey Irving, Jan Leike, Paul Christiano, Rohin Shah, and probably others were doing various kinds of empirical work a few years before Redwood (though I would guess Oliver doesn’t like that work and so wouldn’t consider it a counterexample to his view).
Yeah, I think Open AI tried to do some empirical work, but approximately just produced capability progress, in my current model of the world (though I also think the incentive environment there was particularly bad). I feel confused about the “learning to summarize from human feedback” work, and currently think it was overall bad for the world, but am not super confident (in general I feel very confused about the sign of RLHF research).
I think Rohin Shah doesn’t think of himself as having produced empirical work that helps with AI Alignment, but only to have produced empirical work that might help others to be convinced of AI Alignment. That is still valuable, but I think it should be evaluated on a different dimension.
I haven’t gotten much out of work by Geoffrey Irving or Jan Leike (and don’t think I know many other people who have, or at least haven’t really heard a good story for how their work actually helps). I would actually be interested if someone could give some examples of how this research helped them.
I’m pretty confused about how to think about the value of various ML alignment papers. But I think even if some piece of empirical ML work on alignment is really valuable for reducing x-risk, I wouldn’t expect its value to take the form of providing insight to readers like you or me. So you as a reader not getting much out of it is compatible with the work being super valuable, and we probably need to assess it on different terms.
The main channel of value that I see for doing work like “learning to summarize” and the critiques project and various interpretability projects is something like “identifying a tech tree that it seems helpful to get as far as possible along by the Singularity, and beginning to climb that tech tree.”
In the case of critiques—ultimately, it seems like having AIs red team each other and pointing out ways that another AI’s output could be dangerous seems like it will make a quantitative difference. If we had a really well-oiled debate setup, then we would catch issues we wouldn’t have caught with vanilla human feedback, meaning our models could get smarter before they pose an existential threat—and these smarter models can more effectively work on problems like alignment for us.[1]
It seems good to have that functionality developed as far as it can be developed in as many frontier labs as possible. The first steps of that look kind of boring, and don’t substantially change our view of the problem. But first steps are the foundation for later steps, and the baseline against which you compare later steps. (Also every step can seem boring in the sense of bringing no game-changing insights, while nonetheless helping a lot.)
When the main point of some piece of work is to get good at something that seems valuable to be really good at later, and to build tacit knowledge and various kinds of infrastructure for doing that thing, a paper about it is not going to feel that enlightening to someone who wants high-level insights that change their picture of the overall problem. (Kind of like someone writing a blog post about how they developed effective management and performance evaluation processes at their company isn’t going to provide much insight into the abstract theory of principal-agent problems. The value of that activity was in the company running better, not people learning things from the blog post about it.)
I’m still not sure how valuable I think this work is, because I don’t know how well it’s doing at efficiently climbing tech trees or at picking the right tech trees, but I think that’s how I’d think about evaluating it.
[1] Or do a “pivotal act,” though I think I probably don’t agree with some of the connotations of that term.