Reading AI safety articles like this one, I always find myself nodding along in agreement. The conclusions simply follow from the premises, and the premises are so reasonable. Yet by the end, I always feel futility and frustration. Anyone who wanted to argue that AI safety was a hopeless program wouldn’t need to look any further than the AI safety literature! I’m not just referring to “death with dignity”. What fills me with dread and despair is paragraphs like this:
However, optimists often take a very empiricist frame, so they are likely to be interested in what kind of ML experiments or observations about ML models might change my mind, as opposed to what kinds of arguments might change my mind. I agree it would be extremely valuable to understand what we could concretely observe that would constitute major evidence against this view. But unfortunately, it’s difficult to describe simple and realistic near-term empirical experiments that would change my beliefs very much, because models today don’t have the creativity and situational awareness to play the training game. [original emphasis]
Here is the real chasm between the AI safety movement and the ML industry/academia. One field is entirely driven by experimental results; the other is dominated so totally by theory that its own practitioners deny that there can be any meaningful empirical aspect to it, at least, not until the moment when it’s too late to make any difference.
Years ago, I read an article about an RL agent wireheading itself via memory corruption, thereby ignoring its intended task. Either this article exists and I can’t find it now, or I’m misremembering. Either way, it’s exactly the sort of research that the AI safety community should be conducting and publishing right now (i.e. propaganda with epistemic benefits). With things like GPT-3 around nowadays, I bet one could even devise experiments where artificial agents learn to actually deceive humans (via Mechanical Turk, perhaps?). Imagine how much attention such an experiment could generate once journalists pick it up!
EDIT: This post is very close to what I have in mind.
I do not think AI alignment research is hopeless, my personal probability of doom from AI is something like 25%. My frame is definitely not “death with dignity;” I’m thinking about how to win.
I think there’s a lot of empirical research that can be done to reduce risks, including e.g. interpretability, “sandwiching” style projects, adversarial training, testing particular approaches to theoretical problems like eliciting latent knowledge, the evaluations you linked to, empirically demonstrating issues like deception (as you suggested), and more. Lots of groups are in fact working on such problems, and I’m happy about that.
The specific thing that I said was hard to name is realistic and plausible experiments we could do on today’s models that would a) make me update strongly toward “racing forward with plain HFDT will not lead to an AI takeover”, and b) that I think people who disagree with my claim would accept as “fair.” I gave an example right after that of a type of experiment I don’t expect ML people to consider “fair” as a test of this hypothesis. If I saw that ML people could consistently predict the direction in which gradient descent is suboptimal, I would update a lot against this risk.
I think there’s a lot more room for empirical progress where you assume that this is a real risk and try to address it than there is for empirical progress that could realistically cause either skeptics or concerned people to update about whether there’s any risk at all. A forthcoming post by Holden gets into some of these things.
(By the way, I was delighted to find out you’re working on that ARC project that I had linked! Keep it up! You are like in the top 0.001% of world-helping-people.)
Here is the real chasm between the AI safety movement and the ML industry/academia. One field is entirely driven by experimental results; the other is dominated so totally by theory that its own practitioners deny that there can be any meaningful empirical aspect to it, at least, not until the moment when it’s too late to make any difference.
To put a finer point on my view on theory vs empirics in alignment:
Going forward, I think the vast majority of technical work needed to reduce AI takeover risk is empirical, not theoretical (both in terms of “total amount of person-hours needed in each category” and in terms of “total ‘credit’ each category should receive for reducing doom in some sense”).
Conditional on an alignment researcher agreeing with my view of the high-level problem, I tend to be more excited about them if they’re working on ML experiments than if they’re working on theory.
I’m quite skeptical of most theoretical alignment research I’ve seen. The main theoretical research I’m excited about is ARC’s, and I have a massive conflict of interest since the founder is my husband—I would feel fairly sympathetic to people who viewed ARC’s work more like how I view other theory work.
With that said, I think unfortunately there is a lot less good empirical work than in some sense there “could be.” One significant reason why a lot of empirical AI safety work feels less exciting than it could be is that the people doing that work don’t always share my perspective on the problem, so they focus on difficulties I expect to be less core. (Though another big reason is just that everything is hard, especially when we’re working with systems a lot less capable than future systems.)
I’m surprised by how strong the disagreement is here. Even if what we most need right now is theoretical/pre-paradigmatic, that seems likely to change as AI develops and people reach consensus on more things; compare eg. the work done on optics pre-1800 to all the work done post-1800. Or the work done on computer science pre-1970 vs. post-1970. Curious if people who disagree could explain more—is the disagreement about what stage the field is in/what the field needs right now in 2022, or the more general claim that most future work will be empirical?
I mostly disagreed with bullet point two. The primary result of “empirical AI Alignment research” that I’ve seen in the last 5 years has been a lot of capabilities gain, with approximately zero in terms of progress on any AI Alignment problems. I agree more with the “in the long run there will be a lot of empirical work to be done”, but right now on the margin, we have approximately zero traction on useful empirical work, as far as I can tell (outside of transparency research).
I agree that in an absolute sense there is very little empirical work that I’m excited about going on, but I think there’s even less theoretical work going on that I’m excited about, and when people who share my views on the nature of the problem work on empirical work I feel that it works better than when they do theoretical work.
Hmm, there might be some mismatch of words here. Like, most of the work so far on the problem has been theoretical. I am confused how you could not be excited about the theoretical work that established the whole problem, the arguments for why it’s hard, and that helped us figure out at least some of the basic parameters of the problem. Given that (I think) you currently think AI Alignment is among the global priorities, you presumably think the work that allowed you to come to believe that (and that allowed others to do the same) was very valuable and important.
My guess is you are somehow thinking of work like Superintelligence, or Eliezer’s original work, or Evan’s work on inner optimization as something different than “theoretical work”?
I was mainly talking about the current margin when I talked about how excited I am about the theoretical vs empirical work I see “going on” right now and how excited I tend to be about currently-active researchers who are doing theory vs empirical research. And I was talking about the future when I said that I expect empirical work to end up with the lion’s share of credit for AI risk reduction.
Eliezer, Bostrom, and co certainly made a big impact in raising the problem to people’s awareness and articulating some of its contours. It’s kind of a matter of semantics whether you want to call that “theoretical research” or “problem advocacy” / “cause prioritization” / “community building” / whatever, and no matter which bucket you put it in I agree it’ll probably end up with an outsized impact for x-risk-reduction, by bringing the problem to attention sooner than it would have otherwise been brought to attention and therefore probably allowing more work to happen on it before TAI is developed.
But just like how founding CEOs tend to end up with ~10% equity once their companies have grown large, I don’t think this historical problem-advocacy-slash-theoretical-research work alone will end up with a very large amount of total credit.
On the main thrust of my point, I’m significantly less excited about MIRI-sphere work that is much less like “articulating a problem and advocating for its importance” and much more like “attempting to solve a problem.” E.g. stuff like logical inductors, embedded agency, etc seem a lot less valuable to me than stuff like the orthogonality thesis and so on.
Unfortunately, empirical work is slowly progressing towards alignment, and truth be told, we might be in a local optima for alignment chances, barring stuff like outright banning AI work, or doing something political. And unfortunately that’s hopelessly going to get us mind killed and probably make it even worse.
Also, at the start of a process towards a solution always boots up slowly, with productive mistakes like these. You will never get perfect answers, and thinking that you can get perfect answers is a Nirvana Fallacy. Exponential growth will help us somewhat, but ultimately AI Alignment is probably in a local optima state, that is, the people and companies that are in the lead to building AGI are sympathetic to Alignment, which is far better than we could reasonably have hoped for, and there’s little arms race dynamics for AGI, which is even better.
We often complain about the AGI issue for real reasons, but we do need to realize how good we’ve likely gotten at this. It’s still shitty, yet there are far worse points we could have ended up here.
In a post-AI PONR world, we’re lucky if we can solve the problem of AI Alignment enough such that we go through slow-takeoff safely. We all hate it, yet empirical work will ultimately be necessary, and we undervalue feedback loops because theory can get wildly out of reality without being in contact with reality.
Geoffrey Irving, Jan Leike, Paul Christiano, Rohin Shah, and probably others were doing various kinds of empirical work a few years before Redwood (though I would guess Oliver doesn’t like that work and so wouldn’t consider it a counterexample to his view).
Yeah, I think Open AI tried to do some empirical work, but approximately just produced capability progress, in my current model of the world (though I also think the incentive environment there was particularly bad). I feel confused about the “learning to summarize from human feedback” work, and currently think it was overall bad for the world, but am not super confident (in general I feel very confused about the sign of RLHF research).
I think Rohin Shah doesn’t think of himself as having produced empirical work that helps with AI Alignment, but only to have produced empirical work that might help others to be convinced of AI Alignment. That is still valuable, but I think it should be evaluated on a different dimension.
I haven’t gotten much out of work by Geoffrey Irving or Jan Leike (and don’t think I know many other people who have, or at least haven’t really heard a good story for how their work actually helps). I would actually be interested if someone could give some examples of how this research helped them.
I’m pretty confused about how to think about the value of various ML alignment papers. But I think even if some piece of empirical ML work on alignment is really valuable for reducing x-risk, I wouldn’t expect its value to take the form of providing insight to readers like you or me. So you as a reader not getting much out of it is compatible with the work being super valuable, and we probably need to assess it on different terms.
The main channel of value that I see for doing work like “learning to summarize” and the critiques project and various interpretability projects is something like “identifying a tech tree that it seems helpful to get as far as possible along by the Singularity, and beginning to climb that tech tree.”
In the case of critiques—ultimately, it seems like having AIs red team each other and pointing out ways that another AI’s output could be dangerous seems like it will make a quantitative difference. If we had a really well-oiled debate setup, then we would catch issues we wouldn’t have caught with vanilla human feedback, meaning our models could get smarter before they pose an existential threat—and these smarter models can more effectively work on problems like alignment for us.[1]
It seems good to have that functionality developed as far as it can be developed in as many frontier labs as possible. The first steps of that look kind of boring, and don’t substantially change our view of the problem. But first steps are the foundation for later steps, and the baseline against which you compare later steps. (Also every step can seem boring in the sense of bringing no game-changing insights, while nonetheless helping a lot.)
When the main point of some piece of work is to get good at something that seems valuable to be really good at later, and to build tacit knowledge and various kinds of infrastructure for doing that thing, a paper about it is not going to feel that enlightening to someone who wants high-level insights that change their picture of the overall problem. (Kind of like someone writing a blog post about how they developed effective management and performance evaluation processes at their company isn’t going to provide much insight into the abstract theory of principal-agent problems. The value of that activity was in the company running better, not people learning things from the blog post about it.)
I’m still not sure how valuable I think this work is, because I don’t know how well it’s doing at efficiently climbing tech trees or at picking the right tech trees, but I think that’s how I’d think about evaluating it.
[1] Or do a “pivotal act,” though I think I probably don’t agree with some of the connotations of that term.
Since comments start with a small upvote from the commenter themselves—which for Ajeya has a strength of 1—a single strong downvote, to take her comment down to −16, would actually have to have a strength of 17.
I strong-disagreed on it when it was already at −7 or so, so I think it was just me and another person strongly disagreeing. I expected other people would vote it up again (and didn’t expect anything like consensus in the direction I voted).
Reading AI safety articles like this one, I always find myself nodding along in agreement. The conclusions simply follow from the premises, and the premises are so reasonable. Yet by the end, I always feel futility and frustration. Anyone who wanted to argue that AI safety was a hopeless program wouldn’t need to look any further than the AI safety literature! I’m not just referring to “death with dignity”. What fills me with dread and despair is paragraphs like this:
Here is the real chasm between the AI safety movement and the ML industry/academia. One field is entirely driven by experimental results; the other is dominated so totally by theory that its own practitioners deny that there can be any meaningful empirical aspect to it, at least, not until the moment when it’s too late to make any difference.
Years ago, I read an article about an RL agent wireheading itself via memory corruption, thereby ignoring its intended task. Either this article exists and I can’t find it now, or I’m misremembering. Either way, it’s exactly the sort of research that the AI safety community should be conducting and publishing right now (i.e. propaganda with epistemic benefits). With things like GPT-3 around nowadays, I bet one could even devise experiments where artificial agents learn to actually deceive humans (via Mechanical Turk, perhaps?). Imagine how much attention such an experiment could generate once journalists pick it up!
EDIT: This post is very close to what I have in mind.
I want to clarify two things:
I do not think AI alignment research is hopeless, my personal probability of doom from AI is something like 25%. My frame is definitely not “death with dignity;” I’m thinking about how to win.
I think there’s a lot of empirical research that can be done to reduce risks, including e.g. interpretability, “sandwiching” style projects, adversarial training, testing particular approaches to theoretical problems like eliciting latent knowledge, the evaluations you linked to, empirically demonstrating issues like deception (as you suggested), and more. Lots of groups are in fact working on such problems, and I’m happy about that.
The specific thing that I said was hard to name is realistic and plausible experiments we could do on today’s models that would a) make me update strongly toward “racing forward with plain HFDT will not lead to an AI takeover”, and b) that I think people who disagree with my claim would accept as “fair.” I gave an example right after that of a type of experiment I don’t expect ML people to consider “fair” as a test of this hypothesis. If I saw that ML people could consistently predict the direction in which gradient descent is suboptimal, I would update a lot against this risk.
I think there’s a lot more room for empirical progress where you assume that this is a real risk and try to address it than there is for empirical progress that could realistically cause either skeptics or concerned people to update about whether there’s any risk at all. A forthcoming post by Holden gets into some of these things.
(By the way, I was delighted to find out you’re working on that ARC project that I had linked! Keep it up! You are like in the top 0.001% of world-helping-people.)
Thanks, but I’m not working on that project! That project is led by Beth Barnes.
To put a finer point on my view on theory vs empirics in alignment:
Going forward, I think the vast majority of technical work needed to reduce AI takeover risk is empirical, not theoretical (both in terms of “total amount of person-hours needed in each category” and in terms of “total ‘credit’ each category should receive for reducing doom in some sense”).
Conditional on an alignment researcher agreeing with my view of the high-level problem, I tend to be more excited about them if they’re working on ML experiments than if they’re working on theory.
I’m quite skeptical of most theoretical alignment research I’ve seen. The main theoretical research I’m excited about is ARC’s, and I have a massive conflict of interest since the founder is my husband—I would feel fairly sympathetic to people who viewed ARC’s work more like how I view other theory work.
With that said, I think unfortunately there is a lot less good empirical work than in some sense there “could be.” One significant reason why a lot of empirical AI safety work feels less exciting than it could be is that the people doing that work don’t always share my perspective on the problem, so they focus on difficulties I expect to be less core. (Though another big reason is just that everything is hard, especially when we’re working with systems a lot less capable than future systems.)
I’m surprised by how strong the disagreement is here. Even if what we most need right now is theoretical/pre-paradigmatic, that seems likely to change as AI develops and people reach consensus on more things; compare eg. the work done on optics pre-1800 to all the work done post-1800. Or the work done on computer science pre-1970 vs. post-1970. Curious if people who disagree could explain more—is the disagreement about what stage the field is in/what the field needs right now in 2022, or the more general claim that most future work will be empirical?
I mostly disagreed with bullet point two. The primary result of “empirical AI Alignment research” that I’ve seen in the last 5 years has been a lot of capabilities gain, with approximately zero in terms of progress on any AI Alignment problems. I agree more with the “in the long run there will be a lot of empirical work to be done”, but right now on the margin, we have approximately zero traction on useful empirical work, as far as I can tell (outside of transparency research).
I agree that in an absolute sense there is very little empirical work that I’m excited about going on, but I think there’s even less theoretical work going on that I’m excited about, and when people who share my views on the nature of the problem work on empirical work I feel that it works better than when they do theoretical work.
Hmm, there might be some mismatch of words here. Like, most of the work so far on the problem has been theoretical. I am confused how you could not be excited about the theoretical work that established the whole problem, the arguments for why it’s hard, and that helped us figure out at least some of the basic parameters of the problem. Given that (I think) you currently think AI Alignment is among the global priorities, you presumably think the work that allowed you to come to believe that (and that allowed others to do the same) was very valuable and important.
My guess is you are somehow thinking of work like Superintelligence, or Eliezer’s original work, or Evan’s work on inner optimization as something different than “theoretical work”?
I was mainly talking about the current margin when I talked about how excited I am about the theoretical vs empirical work I see “going on” right now and how excited I tend to be about currently-active researchers who are doing theory vs empirical research. And I was talking about the future when I said that I expect empirical work to end up with the lion’s share of credit for AI risk reduction.
Eliezer, Bostrom, and co certainly made a big impact in raising the problem to people’s awareness and articulating some of its contours. It’s kind of a matter of semantics whether you want to call that “theoretical research” or “problem advocacy” / “cause prioritization” / “community building” / whatever, and no matter which bucket you put it in I agree it’ll probably end up with an outsized impact for x-risk-reduction, by bringing the problem to attention sooner than it would have otherwise been brought to attention and therefore probably allowing more work to happen on it before TAI is developed.
But just like how founding CEOs tend to end up with ~10% equity once their companies have grown large, I don’t think this historical problem-advocacy-slash-theoretical-research work alone will end up with a very large amount of total credit.
On the main thrust of my point, I’m significantly less excited about MIRI-sphere work that is much less like “articulating a problem and advocating for its importance” and much more like “attempting to solve a problem.” E.g. stuff like logical inductors, embedded agency, etc seem a lot less valuable to me than stuff like the orthogonality thesis and so on.
Unfortunately, empirical work is slowly progressing towards alignment, and truth be told, we might be in a local optima for alignment chances, barring stuff like outright banning AI work, or doing something political. And unfortunately that’s hopelessly going to get us mind killed and probably make it even worse.
Also, at the start of a process towards a solution always boots up slowly, with productive mistakes like these. You will never get perfect answers, and thinking that you can get perfect answers is a Nirvana Fallacy. Exponential growth will help us somewhat, but ultimately AI Alignment is probably in a local optima state, that is, the people and companies that are in the lead to building AGI are sympathetic to Alignment, which is far better than we could reasonably have hoped for, and there’s little arms race dynamics for AGI, which is even better.
We often complain about the AGI issue for real reasons, but we do need to realize how good we’ve likely gotten at this. It’s still shitty, yet there are far worse points we could have ended up here.
In a post-AI PONR world, we’re lucky if we can solve the problem of AI Alignment enough such that we go through slow-takeoff safely. We all hate it, yet empirical work will ultimately be necessary, and we undervalue feedback loops because theory can get wildly out of reality without being in contact with reality.
Were any cautious people trying empirical alignment research before Redwood/Conjecture?
Geoffrey Irving, Jan Leike, Paul Christiano, Rohin Shah, and probably others were doing various kinds of empirical work a few years before Redwood (though I would guess Oliver doesn’t like that work and so wouldn’t consider it a counterexample to his view).
Yeah, I think Open AI tried to do some empirical work, but approximately just produced capability progress, in my current model of the world (though I also think the incentive environment there was particularly bad). I feel confused about the “learning to summarize from human feedback” work, and currently think it was overall bad for the world, but am not super confident (in general I feel very confused about the sign of RLHF research).
I think Rohin Shah doesn’t think of himself as having produced empirical work that helps with AI Alignment, but only to have produced empirical work that might help others to be convinced of AI Alignment. That is still valuable, but I think it should be evaluated on a different dimension.
I haven’t gotten much out of work by Geoffrey Irving or Jan Leike (and don’t think I know many other people who have, or at least haven’t really heard a good story for how their work actually helps). I would actually be interested if someone could give some examples of how this research helped them.
I’m pretty confused about how to think about the value of various ML alignment papers. But I think even if some piece of empirical ML work on alignment is really valuable for reducing x-risk, I wouldn’t expect its value to take the form of providing insight to readers like you or me. So you as a reader not getting much out of it is compatible with the work being super valuable, and we probably need to assess it on different terms.
The main channel of value that I see for doing work like “learning to summarize” and the critiques project and various interpretability projects is something like “identifying a tech tree that it seems helpful to get as far as possible along by the Singularity, and beginning to climb that tech tree.”
In the case of critiques—ultimately, it seems like having AIs red team each other and pointing out ways that another AI’s output could be dangerous seems like it will make a quantitative difference. If we had a really well-oiled debate setup, then we would catch issues we wouldn’t have caught with vanilla human feedback, meaning our models could get smarter before they pose an existential threat—and these smarter models can more effectively work on problems like alignment for us.[1]
It seems good to have that functionality developed as far as it can be developed in as many frontier labs as possible. The first steps of that look kind of boring, and don’t substantially change our view of the problem. But first steps are the foundation for later steps, and the baseline against which you compare later steps. (Also every step can seem boring in the sense of bringing no game-changing insights, while nonetheless helping a lot.)
When the main point of some piece of work is to get good at something that seems valuable to be really good at later, and to build tacit knowledge and various kinds of infrastructure for doing that thing, a paper about it is not going to feel that enlightening to someone who wants high-level insights that change their picture of the overall problem. (Kind of like someone writing a blog post about how they developed effective management and performance evaluation processes at their company isn’t going to provide much insight into the abstract theory of principal-agent problems. The value of that activity was in the company running better, not people learning things from the blog post about it.)
I’m still not sure how valuable I think this work is, because I don’t know how well it’s doing at efficiently climbing tech trees or at picking the right tech trees, but I think that’s how I’d think about evaluating it.
[1] Or do a “pivotal act,” though I think I probably don’t agree with some of the connotations of that term.
Note I was at −16 with one vote, and only 3 people have voted so far. So it’s a lot due to the karma-weight of the first disagreer.
There’s a 1000 year old vampire stalking lesswrong!? 16 is supposed to be three levels above Eliezer.
Since comments start with a small upvote from the commenter themselves—which for Ajeya has a strength of 1—a single strong downvote, to take her comment down to −16, would actually have to have a strength of 17.
That is, quite literally, off the scale.
This is about the agreement karma, though, which starts at 0.
Oh, I see.
I strong-disagreed on it when it was already at −7 or so, so I think it was just me and another person strongly disagreeing. I expected other people would vote it up again (and didn’t expect anything like consensus in the direction I voted).