I’m fairly confident that whitelisting contributes meaningfully to short- to mid-term AI safety, although I remain skeptical of its robustness to scale.
What I understand this as saying is that the approach is helpful for aligning housecleaning robots (using near extrapolations of current RL), but not obviously helpful for aligning superintelligence, and likely stops being helpful somewhere between the two.
I think it is likely best to push against including that sort of thing in the Overton window of what’s considered AI safety / AI alignment literature.
The people who really care about the field care about existential risks and superintelligence, and that’s also the sort we want to attract to the field as it grows. It is pretty bad if the field drifts toward safety for self-driving cars and housecleaning robots, particularly if it trades off against research reducing existential risk.
There is a risk that a large body of safety literature which works for preventing today’s systems from breaking vases but which fails badly for very intelligent systems actually worsens the AI safety problem, by lulling safety-concerned people into a false sense of security, thinking that installing those solutions counts as sufficient caution. (Note that I am not complaining about using the vase example as a motivating example—my concern lies with approaches which specifically target “short- to mid-term” without the robustness to scale to tackle far-term.)
There is something to be said about making problems on what we can (in the hopes that it will help create progress later where we currently don’t have any traction), but robustness to scale is actually really essential to the hard/interesting part of the problem here. There are many measures of impact which one can come up with; as you say, all of these create other problems when optimized very hard, because the AI can find clever ways to have a very low impact, and these end up being counter to our intentions. Your whitelisting proposal has the same sorts of problems. The interesting thing is to get a notion of “low impact” exactly right, so that it doesn’t go wrong even in a very intelligent system.
(I am here referring to “robustness of scale” in the direction of robustness to high-capability systems, but I note that the term also refers to robustness in the low-capability direction and robustness to differences in relative capability of subcomponents. Those aren’t as relevant to my critique here.)
You already name several failures of the suggested whitelisting approach; of these, I would point to “clinginess” as the most damning. A safeguard would ideally have the property that, added to an already-aligned AI, would not misalign that AI. Whitelisting fails badly in that desideratum; it creates an AI which would seek to reduce the impact of everything in the universe, not just itself.
I would point out several more potential failure modes. I think some of the following may apply to the approach as stated and others to nearby alternatives. I haven’t yet understood exactly the version which you implemented, but my guess is that you would agree that the key idea is to penalize a shift in probability distributions, if those shifts haven’t been whitelisted. Decisions about exactly which probability distributions are being compared, such as how change over time is treated, can be reasonably varied while keeping the spirit of the approach (as I understand it).
If the penalty is applied to immediate effects of actions, then the AI would not care if all the vases fall over and break after the AI completes the course. In other words, a penalty applied to immediate consequences will fail to penalize predictable long-term effects of actions.
If the penalty is applied in a way which penalizes long-term effects, then the AI might do what it can to hold everything still, or it might try to carry out its task but otherwise make the world as much as possible look like one where its task was not carried out (example: curing a fatal disease but then killing the patient, because a saved patient would have all sorts of non-whitelisted consequences eventually).
In particular, “clinginess” could make the AI want to take over the light cone to install impact-reducing measures everywhere.
Penalizing a shift in probability distributions can incentivize the agent to learn as little as possible, which is a bit weird.
Certain versions will have the property that if the agent is already quite confident in what it will do, then consequences of those actions do not count as “changes” (no shift in probability when we condition on the act). This would create a loophole allowing for any actions to be “low impact” under the right conditions.
I think it is likely best to push against including that sort of thing in the Overton window of what’s considered AI safety / AI alignment literature.
I’m really sympathetic to these concerns but I’m worried about the possible unintended consequences of trying to do this. There will inevitably be a large group of people working on short and medium term AI safety (due to commercial incentives) and pushing them out of “AI safety / AI alignment literature” risks antagonizing them and creating an adversarial relationship between the two camps, and/or creates a larger incentive for people to stretch the truth about how robust to scale their ideas are. Is this something you considered?
I’m not sure how to think about this. My intuition is that this doesn’t need to be a problem if people in (my notion of) the AI alignment field just do the best work they can do, so as to demonstrate by example what the larger concerns are. In other words, win people over by being sufficiently exciting rather than by being antagonizing/exclusive. I suppose that’s not very consistent with my comment above.
I think most hard engineering problems are made up of a lot of smaller solutions and especially made up of the lessons learned attempting to implement small solutions, so I think it’s incorrect to think of something that’s useful but incomplete as being competitive to the true solution rather than actually being a part of the path to it.
I definitely agree with that. There has to be room to find traction. The concern is about things which specifically push the field toward “near-term” solutions, which slides too easily into not-solving-the-same-sorts-of-problems-at-all. I think a somewhat realistic outcome is that the field is taken over by standard machine learning research methodology of achieving high scores on test cases and benchmarks, to the exclusion of research like logical induction. This isn’t particularly realistic because logical induction is actually not far from the sorts of things done in theoretical machine learning. However, it points at the direction of my concern.
I think it is likely best to push against including that sort of thing in the Overton window of what’s considered AI safety / AI alignment literature.
Here’s my understanding of your reasoning: “this kind of work may have the unintended consequence of pushing people who would have otherwise worked on hard core problems of x-risk to more prosaic projects, lulling them into a false sense of security when progress is made.”
I think this is possible, but rather unlikely:
It isn’t clear that work allocation for immediate and long-term safety is zero-sum—Victoria wrote more about why this might not be the case.
The specific approach I took here might be conducive for getting more people currently involved with immediate safety interested in long-term approaches. That is, someone might be nodding along—“hey, this whitelisting thing might need some engineering to implement, but this is solid!” and then I walk them through the mental motions of discovering how it doesn’t work, helping them realize that the problem cuts far deeper than they thought.
In my mental model, this is far more likely than pushing otherwise-promising people to inaction.
I’m actually concerned that a lack of overlap between our communities will insulate immediate safety researchers from long-term considerations, having a far greater negative effect. I have weak personal evidence for this being the case.
Why would people (who would otherwise be receptive to rigorous thinking about x-risk) lose sight of the greater problems in alignment? I don’t expect DeepMind to say “hey, we implemented whitelisting, we’re good to go! Hit the switch.” In my model, people who would make a mistake like that probably were never thinking about x-risk to begin with.
my concern lies with approaches which specifically target “short- to mid-term” without the robustness to scale to tackle far-term.
As I understand it, this argument can also be applied to any work that doesn’t plausibly one-shot a significant alignment problem, potentially including research by OpenAI and DeepMind. While obviously we’d all prefer one-shots, sometimes research is more incremental (I’m sure this isn’t news to you!). Here, I set out to make progress on one of the Concrete Problems; after doing so, I thought “does this scale? What insights can we take away?”. I had relaxed the problem by assuming a friendly ontology, and I was curious what difficulties (if any) remained.
We are currently grading this approach by the most rigorous of metrics—I think this is good, as that’s how we will eventually be judged! However, we shouldn’t lose sight of the fact that most safety work won’t be immediately superintelligence-complete. Exploratory work is important. I definitely agree that we should shoot to kill—I’m not advocating an explicit focus on short-term problems. However, we shouldn’t screen off value we can get sharing imperfect results.
There are many measures of impact which one can come up with; as you say, all of these create other problems when optimized very hard, because the AI can find clever ways to have a very low impact, and these end up being counter to our intentions. Your whitelisting proposal has the same sorts of problems. The interesting thing is to get a notion of “low impact” exactly right, so that it doesn’t go wrong even in a very intelligent system.
I’d also like to push back slightly against an implication here—while it is now clear to me that “the interesting thing” is indeed this clinginess issue, this wasn’t apparent at the outset. Perhaps I missed some literature review, but there was no such discussion of the hard core issues of impact measures; Eliezer certainly discussed a few naive approaches, but the literature was otherwise rather slim.
Penalizing a shift in probability distributions can incentivize the agent to learn as little as possible, which is a bit weird.
Yeah, I noticed this too, but I put that under “how do we get agents to want to learn about how the world is—i.e., avoid wireheading?”. I also think that function composition with the raw utility would be helpful in avoiding weird interplay.
Certain versions will have the property that if the agent is already quite confident in what it will do, then consequences of those actions do not count as “changes” (no shift in probability when we condition on the act). This would create a loophole allowing for any actions to be “low impact” under the right conditions.
I don’t follow—the agent has a distribution for an object at time t, and another at t+1. It penalizes based on changes in its beliefs about the actual world at the time steps—not with respect to its expectation.
“this kind of work may have the unintended consequence of pushing people who would have otherwise worked on hard core problems of x-risk to more prosaic projects, lulling them into a false sense of security when progress is made.”
I think it is more like:
This kind of work seems likely to one day redirect funding intended for X-risk away from X-risk.
I know people who would point to this kind of thing to argue that AI can be made safe without the kind of deep decision theory thinking MIRI is interested in. Those people would probably argue against X-risk research regardless, but the more stuff there is that’s difficult for outsiders to distinguish from X-risk relevant research, the more difficulty outsiders have assessing such arguments.
So it isn’t so much that I think people who would work on X-risk would be redirected, as that I think there will be a point where people adjacent to X-risk research will have difficulty telling which people are actually trying to work on X-risk, and also what the state of the X-risk concerns is (I mean to what extent it has been addressed by the research.
As I understand it, this argument can also be applied to any work that doesn’t plausibly one-shot a significant alignment problem, potentially including research by OpenAI and DeepMind. While obviously we’d all prefer one-shots, sometimes research is more incremental (I’m sure this isn’t news to you!). Here, I set out to make progress on one of the Concrete Problems; after doing so, I thought “does this scale? What insights can we take away?”. I had relaxed the problem by assuming a friendly ontology, and I was curious what difficulties (if any) remained.
I agree that research has to be incremental. It should be taken almost for granted that anything currently written about the subject is not anywhere near a real solution even to a sub-problem of safety, unless otherwise stated. If I had to point out one line which caused me to have such a skeptical reaction to your post was, it would be:
I’m fairly confident that whitelisting contributes meaningfully to short- to mid-term AI safety,
If instead this had been presented as “here’s something interesting which doesn’t work” I would not have made the objection I made. IE, what’s important is not any contribution to near- or medium- term AI safety, but rather exploration of the landscape of low-impact RL, which may eventually contribute to reducing X-risk. IE, more the attitude you express here:
We are currently grading this approach by the most rigorous of metrics—I think this is good, as that’s how we will eventually be judged! However, we shouldn’t lose sight of the fact that most safety work won’t be immediately superintelligence-complete. Exploratory work is important.
So, I’m saying that exploratory work should not be justified as “confident that this contributes meaningfully to short-term safety”. Almost everything at this stage is more like “maybe useful for one day having better thoughts about reducing X-risk, maybe not”.
I don’t follow—the agent has a distribution for an object at time t, and another at t+1. It penalizes based on changes in its beliefs about how the world actually is at the time steps—not with respect to its expectation.
So it isn’t so much that I think people who would work on X-risk would be redirected, as that I think there will be a point where people adjacent to X-risk research will have difficulty telling which people are actually trying to work on X-risk, and also what the state of the X-risk concerns is (I mean to what extent it has been addressed by the research).
That makes more sense. I haven’t thought enough about this aspect to have a strong opinion yet. My initial thoughts are that
this problem can be basically avoided if this kind of work clearly points out where the problems would be if scaled.
I do think it’s plausible that some less-connected funding sources might get confused (NSF), but I’d be surprised if later FLI funding got diverted because of this. I think this kind of work will be done anyways, and it’s better to have people who think carefully about scale issues doing it.
your second bullet point reminds me of how some climate change skeptics will point to “evidence” from “scientists”, as if that’s what convinced them. In reality, however, they’re drawing the bottom line first, and then pointing to what they think is the most dignified support for their position. I don’t think that avoiding this kind of work would ameliorate that problem—they’d probably just find other reasons.
most people on the outside don’t understand x-risk anyways, because it requires thinking rigorously in a lot of ways to not end up a billion miles off of any reasonable conclusion. I don’t think that this additional straw will marginally add significant confusion.
IE, what’s important is not any contribution to near- or medium- term AI safety
I’m confused how “contributes meaningfully to short-term safety” and “maybe useful for having better thoughts” are mutually-exclusive outcomes, or why it’s wrong to say that I think my work contributes to short-term efforts. Sure, that may not be what you care about, but I think it’s still reasonable that I mention it.
I’m saying that exploratory work should not be justified as “confident that this contributes meaningfully to short-term safety”. Almost everything at this stage is more like “maybe useful for one day having better thoughts about reducing X-risk, maybe not”.
I’m confused why that latter statement wasn’t what came across! Later in that sentence, I state that I don’t think it will scale. I also made sure to highlight how it breaks down in a serious way when scaled up, and I don’t think I otherwise implied that it’s presently safe for long-term efforts.
I totally agree that having better thoughts about x-risk is a worthy goal at this point.
I’m confused how “contributes meaningfully to short-term safety” and “maybe useful for having better thoughts” are mutually-exclusive outcomes, or why it’s wrong to say that I think my work contributes to short-term efforts. Sure, that may not be what you care about, but I think it’s still reasonable that I mention it.
In hindsight I am regretting the way my response went. While it was my honest response, antagonizing newcomers to the field for paying any attention to whether their work might be useful for sub-AGI safety doesn’t seem like a good way to create the ideal research atmosphere. Sorry for being a jerk about it.
Although I did flinch a bit, my S2 reaction was “this is Abram, so if it’s criticism, it’s likely very high-quality. I’m glad I’m getting detailed feedback, even if it isn’t all positive”. Apology definitely accepted (although I didn’t view you as being a jerk), and really—thank you for taking the time to critique me a bit. :)
What I understand this as saying is that the approach is helpful for aligning housecleaning robots (using near extrapolations of current RL), but not obviously helpful for aligning superintelligence, and likely stops being helpful somewhere between the two.
I think it is likely best to push against including that sort of thing in the Overton window of what’s considered AI safety / AI alignment literature.
The people who really care about the field care about existential risks and superintelligence, and that’s also the sort we want to attract to the field as it grows. It is pretty bad if the field drifts toward safety for self-driving cars and housecleaning robots, particularly if it trades off against research reducing existential risk.
There is a risk that a large body of safety literature which works for preventing today’s systems from breaking vases but which fails badly for very intelligent systems actually worsens the AI safety problem, by lulling safety-concerned people into a false sense of security, thinking that installing those solutions counts as sufficient caution. (Note that I am not complaining about using the vase example as a motivating example—my concern lies with approaches which specifically target “short- to mid-term” without the robustness to scale to tackle far-term.)
There is something to be said about making problems on what we can (in the hopes that it will help create progress later where we currently don’t have any traction), but robustness to scale is actually really essential to the hard/interesting part of the problem here. There are many measures of impact which one can come up with; as you say, all of these create other problems when optimized very hard, because the AI can find clever ways to have a very low impact, and these end up being counter to our intentions. Your whitelisting proposal has the same sorts of problems. The interesting thing is to get a notion of “low impact” exactly right, so that it doesn’t go wrong even in a very intelligent system.
(I am here referring to “robustness of scale” in the direction of robustness to high-capability systems, but I note that the term also refers to robustness in the low-capability direction and robustness to differences in relative capability of subcomponents. Those aren’t as relevant to my critique here.)
You already name several failures of the suggested whitelisting approach; of these, I would point to “clinginess” as the most damning. A safeguard would ideally have the property that, added to an already-aligned AI, would not misalign that AI. Whitelisting fails badly in that desideratum; it creates an AI which would seek to reduce the impact of everything in the universe, not just itself.
I would point out several more potential failure modes. I think some of the following may apply to the approach as stated and others to nearby alternatives. I haven’t yet understood exactly the version which you implemented, but my guess is that you would agree that the key idea is to penalize a shift in probability distributions, if those shifts haven’t been whitelisted. Decisions about exactly which probability distributions are being compared, such as how change over time is treated, can be reasonably varied while keeping the spirit of the approach (as I understand it).
If the penalty is applied to immediate effects of actions, then the AI would not care if all the vases fall over and break after the AI completes the course. In other words, a penalty applied to immediate consequences will fail to penalize predictable long-term effects of actions.
If the penalty is applied in a way which penalizes long-term effects, then the AI might do what it can to hold everything still, or it might try to carry out its task but otherwise make the world as much as possible look like one where its task was not carried out (example: curing a fatal disease but then killing the patient, because a saved patient would have all sorts of non-whitelisted consequences eventually).
In particular, “clinginess” could make the AI want to take over the light cone to install impact-reducing measures everywhere.
Penalizing a shift in probability distributions can incentivize the agent to learn as little as possible, which is a bit weird.
Certain versions will have the property that if the agent is already quite confident in what it will do, then consequences of those actions do not count as “changes” (no shift in probability when we condition on the act). This would create a loophole allowing for any actions to be “low impact” under the right conditions.
I’m really sympathetic to these concerns but I’m worried about the possible unintended consequences of trying to do this. There will inevitably be a large group of people working on short and medium term AI safety (due to commercial incentives) and pushing them out of “AI safety / AI alignment literature” risks antagonizing them and creating an adversarial relationship between the two camps, and/or creates a larger incentive for people to stretch the truth about how robust to scale their ideas are. Is this something you considered?
I’m not sure how to think about this. My intuition is that this doesn’t need to be a problem if people in (my notion of) the AI alignment field just do the best work they can do, so as to demonstrate by example what the larger concerns are. In other words, win people over by being sufficiently exciting rather than by being antagonizing/exclusive. I suppose that’s not very consistent with my comment above.
I think most hard engineering problems are made up of a lot of smaller solutions and especially made up of the lessons learned attempting to implement small solutions, so I think it’s incorrect to think of something that’s useful but incomplete as being competitive to the true solution rather than actually being a part of the path to it.
I definitely agree with that. There has to be room to find traction. The concern is about things which specifically push the field toward “near-term” solutions, which slides too easily into not-solving-the-same-sorts-of-problems-at-all. I think a somewhat realistic outcome is that the field is taken over by standard machine learning research methodology of achieving high scores on test cases and benchmarks, to the exclusion of research like logical induction. This isn’t particularly realistic because logical induction is actually not far from the sorts of things done in theoretical machine learning. However, it points at the direction of my concern.
Here’s my understanding of your reasoning: “this kind of work may have the unintended consequence of pushing people who would have otherwise worked on hard core problems of x-risk to more prosaic projects, lulling them into a false sense of security when progress is made.”
I think this is possible, but rather unlikely:
It isn’t clear that work allocation for immediate and long-term safety is zero-sum—Victoria wrote more about why this might not be the case.
The specific approach I took here might be conducive for getting more people currently involved with immediate safety interested in long-term approaches. That is, someone might be nodding along—“hey, this whitelisting thing might need some engineering to implement, but this is solid!” and then I walk them through the mental motions of discovering how it doesn’t work, helping them realize that the problem cuts far deeper than they thought.
In my mental model, this is far more likely than pushing otherwise-promising people to inaction.
I’m actually concerned that a lack of overlap between our communities will insulate immediate safety researchers from long-term considerations, having a far greater negative effect. I have weak personal evidence for this being the case.
Why would people (who would otherwise be receptive to rigorous thinking about x-risk) lose sight of the greater problems in alignment? I don’t expect DeepMind to say “hey, we implemented whitelisting, we’re good to go! Hit the switch.” In my model, people who would make a mistake like that probably were never thinking about x-risk to begin with.
As I understand it, this argument can also be applied to any work that doesn’t plausibly one-shot a significant alignment problem, potentially including research by OpenAI and DeepMind. While obviously we’d all prefer one-shots, sometimes research is more incremental (I’m sure this isn’t news to you!). Here, I set out to make progress on one of the Concrete Problems; after doing so, I thought “does this scale? What insights can we take away?”. I had relaxed the problem by assuming a friendly ontology, and I was curious what difficulties (if any) remained.
We are currently grading this approach by the most rigorous of metrics—I think this is good, as that’s how we will eventually be judged! However, we shouldn’t lose sight of the fact that most safety work won’t be immediately superintelligence-complete. Exploratory work is important. I definitely agree that we should shoot to kill—I’m not advocating an explicit focus on short-term problems. However, we shouldn’t screen off value we can get sharing imperfect results.
I’d also like to push back slightly against an implication here—while it is now clear to me that “the interesting thing” is indeed this clinginess issue, this wasn’t apparent at the outset. Perhaps I missed some literature review, but there was no such discussion of the hard core issues of impact measures; Eliezer certainly discussed a few naive approaches, but the literature was otherwise rather slim.
Yeah, I noticed this too, but I put that under “how do we get agents to want to learn about how the world is—i.e., avoid wireheading?”. I also think that function composition with the raw utility would be helpful in avoiding weird interplay.
I don’t follow—the agent has a distribution for an object at time t, and another at t+1. It penalizes based on changes in its beliefs about the actual world at the time steps—not with respect to its expectation.
I think it is more like:
This kind of work seems likely to one day redirect funding intended for X-risk away from X-risk.
I know people who would point to this kind of thing to argue that AI can be made safe without the kind of deep decision theory thinking MIRI is interested in. Those people would probably argue against X-risk research regardless, but the more stuff there is that’s difficult for outsiders to distinguish from X-risk relevant research, the more difficulty outsiders have assessing such arguments.
So it isn’t so much that I think people who would work on X-risk would be redirected, as that I think there will be a point where people adjacent to X-risk research will have difficulty telling which people are actually trying to work on X-risk, and also what the state of the X-risk concerns is (I mean to what extent it has been addressed by the research.
I agree that research has to be incremental. It should be taken almost for granted that anything currently written about the subject is not anywhere near a real solution even to a sub-problem of safety, unless otherwise stated. If I had to point out one line which caused me to have such a skeptical reaction to your post was, it would be:
If instead this had been presented as “here’s something interesting which doesn’t work” I would not have made the objection I made. IE, what’s important is not any contribution to near- or medium- term AI safety, but rather exploration of the landscape of low-impact RL, which may eventually contribute to reducing X-risk. IE, more the attitude you express here:
So, I’m saying that exploratory work should not be justified as “confident that this contributes meaningfully to short-term safety”. Almost everything at this stage is more like “maybe useful for one day having better thoughts about reducing X-risk, maybe not”.
Ah, alright.
That makes more sense. I haven’t thought enough about this aspect to have a strong opinion yet. My initial thoughts are that
this problem can be basically avoided if this kind of work clearly points out where the problems would be if scaled.
I do think it’s plausible that some less-connected funding sources might get confused (NSF), but I’d be surprised if later FLI funding got diverted because of this. I think this kind of work will be done anyways, and it’s better to have people who think carefully about scale issues doing it.
your second bullet point reminds me of how some climate change skeptics will point to “evidence” from “scientists”, as if that’s what convinced them. In reality, however, they’re drawing the bottom line first, and then pointing to what they think is the most dignified support for their position. I don’t think that avoiding this kind of work would ameliorate that problem—they’d probably just find other reasons.
most people on the outside don’t understand x-risk anyways, because it requires thinking rigorously in a lot of ways to not end up a billion miles off of any reasonable conclusion. I don’t think that this additional straw will marginally add significant confusion.
I’m confused how “contributes meaningfully to short-term safety” and “maybe useful for having better thoughts” are mutually-exclusive outcomes, or why it’s wrong to say that I think my work contributes to short-term efforts. Sure, that may not be what you care about, but I think it’s still reasonable that I mention it.
I’m confused why that latter statement wasn’t what came across! Later in that sentence, I state that I don’t think it will scale. I also made sure to highlight how it breaks down in a serious way when scaled up, and I don’t think I otherwise implied that it’s presently safe for long-term efforts.
I totally agree that having better thoughts about x-risk is a worthy goal at this point.
In hindsight I am regretting the way my response went. While it was my honest response, antagonizing newcomers to the field for paying any attention to whether their work might be useful for sub-AGI safety doesn’t seem like a good way to create the ideal research atmosphere. Sorry for being a jerk about it.
Although I did flinch a bit, my S2 reaction was “this is Abram, so if it’s criticism, it’s likely very high-quality. I’m glad I’m getting detailed feedback, even if it isn’t all positive”. Apology definitely accepted (although I didn’t view you as being a jerk), and really—thank you for taking the time to critique me a bit. :)