This is probably not the answer you are looking for, but as you are considering putting a lot of work into this...
Does anyone know if this has been done? If not, I might try to make it.
Probably has been done, but depends on what you mean with strongest arguments.
Does strongest mean that the argument has a lot of rhetorical power, so that it will convince people that alignment failure is more plausible than it actually is? Or does strongest mean that it gives the audience the best possible information about the likelihood of various levels of misalignment, where these levels go from ‘annoying but can be fixed’ to ‘kills everybody and converts all matter in its light cone to paperclips’.
Also, the strongest argument when you address an audience of type A, say policy makers, may not be the strongest argument for an audience of type B, say ML researchers.
My main message here, I guess, is that many distilled collections of arguments already exist, even book-length ones like Superintelligence, Human Compatible, and The Alignment Problem. If you are thinking about adding to this mountain of existing work, you need to carefully ask yourself who your target audience is, and what you want to convince them of.
depends on what you mean with strongest arguments.
By strongest I definitely mean the second thing (probably I should have clarified here, thanks for picking up on this).
Also, the strongest argument when you address an audience of type A, say policy makers, may not be the strongest argument for an audience of type B, say ML researchers.
Agree, though I expect it’s more like, the emphasis needs to be different, whilst the underlying argument is similar (conditional on talking about your second definition of “strongest”).
many distilled collections of arguments already exist, even book-length ones like Superintelligence, Human Compatible, and The Alignment Problem.
Probably I should have clarified some more here. By “distilled”, I mean:
a really short summary (e.g. <1 page for each argument, with links to literature which discuss the argument’s premises)
that makes it clear what the epistemic status of the argument is.
Those books aren’t short, and neither do they focus on working out exactly how strong the case for alignment failure is, but rather on drawing attention to the problem and claiming that more work needs to be done on the current margin (which I absolutely agree with).
I also don’t think they focus on surveying the range of arguments for alignment failure, but rather on presenting the author’s particular view.
If there are distilled collections of arguments with these properties, please let me know!
(As some more context for my original question: I’m most interested in arguments for inner alignment failure. I’m pretty confused by the fact that some researchers seem to think inner alignment is the main problem and/or probably extremely difficult, and yet I haven’t really heard a rigorous case made for its plausibility.)
I’m most interested in arguments for inner alignment failure. I’m pretty confused by the fact that some researchers seem to think inner alignment is the main problem and/or probably extremely difficult, and yet I haven’t really heard a rigorous case made for its plausibility.
I have not read all the material about inner alignment that has
appeared on this forum, but I do occasionally read up on it.
There are some posters on this forum who believe that contemplating a
set of problems which are together called ‘inner alignment’ can work
as an intuition pump that would allow us to make needed conceptual
breakthroughs. The breakthroughs sought have mostly to do, I believe,
with analyzing possibilities for post-training treacherous turns which
have so far escaped notice. I am not (no longer) one of the posters
who have high hopes that inner alignment will work as a useful
intuition pump.
The terminology problem I have with the term ‘inner alignment’ is that
many working on it never make the move of defining it in rigorous
mathematics, or with clear toy examples of what are and what are not
inner alignment failures. Absent either a mathematical definition or
some defining examples, I am not able judge if inner alignment is
either the main alignment problem, or whether it would be a minor one,
but still one that is extremely difficult to solve.
What does not help here is that by now several non-mathematical
notions floating around of what an inner alignment failure even is, to
the extent that Evan has felt a need to write an entire clarification
post.
When poster X calls something an example of an inner alignment
failure, poster Y might respond and declare that in their view of
inner alignment failure, it is not actually an example of an inner
alignment failure, or a very good example of an inner alignment
failure. If we interpret it as a meme, then the meme of inner
alignment has a reproduction strategy where it reproduces by
triggering social media discussions about what it means.
Inner alignment has become what Minsky called a suitcase word:
everybody packs their own meaning into it. This means that for the
purpose of distillation, the word is best avoided. If you want to
distil the discussion, my recommendation is to look for the meanings
that people pack into the word.
I’m broadly sympathetic to your point that there have been an unfortunate number of disagreements about inner alignment terminology, and it has been and remains a source of confusion.
to the extent that Evan has felt a need to write an entire clarification post.
Yeah, and recently there has beenevenmore disagreement/clarification attempts.
I should have specified this on the top level question, but (as mentioned in my own answer) I’m talking about abergal’s suggestion of what inner alignment failure should refer to (basically: a model pursuing a different objective to the one it was trained on, when deployed out-of-distribution, while retaining most or all of the capabilities it had on the training distribution). I agree this isn’t crisp and is far from a mathematical formalisim, but note that there are several examples of this kind of failure in current ML systems that help to clarify what the concept is, and people seem to agree on these examples.
If you can think of toy examples that make real trouble for this definition of inner alignment failure, then I’d be curious to hear what they are.
Meta: I usually read these posts via the alignmentforum.org portal, and this portal filters out certain comments, so I missed your mention of abergal’s suggestion, which would have clarified your concerns about inner alignment arguments for me. I have mailed the team that runs the website to ask if they could improve how this filtering works.
Just read the post with the examples you mention, and skimmed the related arxiv paper. I like how the authors develop the metrics of ‘objective robustness’ vs ‘capability robustness’ while avoiding the problem of trying to define a single meaning for the term ‘inner alignment’. Seems like good progress to me.
I also don’t think [these three books] focus on surveying the range of arguments for alignment failure, but rather on presenting the author’s particular view.
I disagree. In my reading. all of these books offer fairly
wide-ranging surveys of alignment failure mechanisms.
A more valid criticism would be that the authors spend most of their
time on showing that all of these failure mechanisms are theoretically
possible, without spending much time discussing how likely each of
them is are in practice. Once we take it as axiomatic that some
people are stupid some of the time, presenting a convincing proof that
some AI alignment failure mode is theoretically possible does not
require much heavy lifting at all.
If there are distilled collections of arguments with these properties, please let me know!
The collection of posts under the threat models
tag may be what you
are looking for: many of these posts highlight the particular risk
scenarios the authors feel are most compelling or likely.
The main problem with distilling this work into, say, a top 3 of most
powerful 1-page arguments is that we are not dealing purely with
technology-driven failure modes.
There is a technical failure mode story which says that it is very
difficult to equip a very powerful future AI with an emergency stop button, that we
have not solved that technical problem yet. In fact, this story is a
somewhat successful meme in its own right: it appears in all 3 books I
mentioned. That story is not very compelling to me. We have plenty
of technical options for building emergency stop buttons, see for
example my post
here.
There have been some arguments that none of the identified technical options for
building AI stop buttons will be useful or used, because they will all
turn out to be incompatible with yet-undiscovered future powerful AI designs. I feel that
these arguments show a theoretical possibility, but I think it is a very low possibility,
so in practice these arguments are not very compelling to me. The more compelling failure
mode argument is that people will refuse to use the emergency AI stop button, even
though it is available.
Many of the posts with the tag above show failure scenarios where the
AI fails to be aligned because of an underlying weakness or structural
problem in society. These are scenarios where society fails to take
the actions needed to keep its AIs aligned.
One can observe hat that in recent history, society has mostly failed
to take the actions needed to keep major parts of the global economy
aligned with human needs. See for example the oil industry and
climate change. Or the cigarette industry and health.
One can be a pessimist, and use our past performance on climate change
to predict how good we will be in handling the problem of keeping
powerful AI under control. Like oil, AI is a technology that has
compelling short-term economic benefits. This line of thought would
offer a very powerful 1-page AI failure mode argument. To a
pessimist.
Or one can be an optimist, and argue that the case of climate change
is teaching us all very valuable lessons, so we are bound to handle AI
better than oil. So will you be distilling for an audience of
pessimists or optimists?
There is a political line of thought, which I somewhat subscribe to,
that optimism is a moral duty. This has kept me from spending much energy
myself on rationally quantifying the odds of different failure mode
scenarios. I’d rather spend my energy in finding ways to improve the
odds. When it comes to the political sphere, a many problems often
seem completely intractable, until suddenly there are not.
A more valid criticism would be that the authors spend most of their time on showing that all of these failure mechanisms are theoretically possible, without spending much time discussing how likely each of them is are in practice
Sure, I agree this is a stronger point.
The collection of posts under the threat models tag may be what you are looking for: many of these posts highlight the particular risk scenarios the authors feel are most compelling or likely.
Not really, unfortunately. In those posts, the authors are focusing on painting a plausible picture of what the world looks like if we screw up alignment, rather than analysing the arguments that we should expect alignment failures in the first place—which is what I’m interested in (with the exception of Steven’s scenario, who already answered here).
The main problem with distilling this work into, say, a top 3 of most powerful 1-page arguments is that we are not dealing purely with technology-driven failure modes.
I fully agree that thinking through e.g. incentives that different actors will have in the lead up to TAI, the interaction between AI technology and society, etc. is super important. But we can think through those things as well—e.g. we can look at historical examples of humanity being faced with scenarios where the global economy is (mis)aligned with human needs, and reason about the extent to which AI will be different. I’d count all of that as part of the argument to expect alignment failure. Yes, as soon as you bring societal interactions into the mix, things become a whole lot more complicated. But that isn’t reason not to try.
As it stands, I don’t think there are super clear arguments for alignment failure that take into account interactions between AI tech and society that are ready to be distilled down, though I tried doing some of it here.
Equally, much of the discussion (and predictions of many leading thinkers in this space) is premised on technical alignment failure being the central concern (i.e. if we had better technical alignment solutions, we would manage to avoid existential catastrophe). I don’t want to argue about whether that’s correct here, but just want to point out that at least some people think that at least some of the plausible failure modes are mostly technology-driven.
So will you be distilling for an audience of pessimists or optimists?
Neither—just trying to think clearly through the arguments on both sides.
In the particular case you describe, I find the “pessimist” side more compelling, because I don’t see much evidence that humanity has really learned any lessons from oil and climate change. In particular, we still don’t know how to solve collective action problems.
This has kept me from spending much energy myself on rationally quantifying the odds of different failure mode scenarios. I’d rather spend my energy in finding ways to improve the odds.
Yeah, I’m sympathetic to this line of thought, and I think I personally tend to err on the side of trying to spend too much energy on quantifying odds and not enough on acting.
However, to the extent that you’re impartial between different ways of trying to improve the odds (e.g. working on technical AI alignment vs other technical AI safety vs AI policy vs meta interventions vs other cause areas entirely), then it still pays to work out (e.g.) how plausible AI alignment failure is, in order to inform your decision about what to do if you want to have the best chance of helping.
Not really, unfortunately. In those posts [under the threat models tag], the authors are focusing on painting a plausible picture of what the world looks like if we screw up alignment, rather than analysing the arguments that we should expect alignment failures in the first place.
I feel that Christiano’s post here is pretty good at identifying plausible failure modes inside society that lead to unaligned agents not being corrected. My recollection of that post is partly why I mentioned the posts under that tag.
There is an interesting question of methodology here: if you want to estimate the probability that society will fail in this this way in handing the impact of AI, do you send a poll to a bunch of AI technology experts, or should you be polling a bunch of global warming activists or historians of the tobacco industry instead? But I think I am reading in your work that this question is no news to you.
Several of the AI alignment organisations you polled have people in them who produced work like this examination of the nuclear arms race. I wonder what happens in your analysis of your polling data if you single out this type of respondent specifically. In my own experience in analysing polling results with this type of response rate, I would be surprised however if you could find a clear signal above the noise floor.
However [...] it still pays to work out (e.g.) how plausible AI alignment failure is, in order to inform your decision about what to do if you want to have the best chance of helping.
Agree, that is why I am occasionally reading various posts with failure scenarios and polls of experts. To be clear: my personal choice of alignment research subjects is only partially motivated by what I think is the most important to work to do, if I want to have the best chance of helping. Another driver is that I want to have some fun with mathematics. I tend to work on problems which lie in the intersection of those two fuzzy sets.
This is probably not the answer you are looking for, but as you are considering putting a lot of work into this...
Probably has been done, but depends on what you mean with strongest arguments.
Does strongest mean that the argument has a lot of rhetorical power, so that it will convince people that alignment failure is more plausible than it actually is? Or does strongest mean that it gives the audience the best possible information about the likelihood of various levels of misalignment, where these levels go from ‘annoying but can be fixed’ to ‘kills everybody and converts all matter in its light cone to paperclips’.
Also, the strongest argument when you address an audience of type A, say policy makers, may not be the strongest argument for an audience of type B, say ML researchers.
My main message here, I guess, is that many distilled collections of arguments already exist, even book-length ones like Superintelligence, Human Compatible, and The Alignment Problem. If you are thinking about adding to this mountain of existing work, you need to carefully ask yourself who your target audience is, and what you want to convince them of.
Thanks for your reply!
By strongest I definitely mean the second thing (probably I should have clarified here, thanks for picking up on this).
Agree, though I expect it’s more like, the emphasis needs to be different, whilst the underlying argument is similar (conditional on talking about your second definition of “strongest”).
Probably I should have clarified some more here. By “distilled”, I mean:
a really short summary (e.g. <1 page for each argument, with links to literature which discuss the argument’s premises)
that makes it clear what the epistemic status of the argument is.
Those books aren’t short, and neither do they focus on working out exactly how strong the case for alignment failure is, but rather on drawing attention to the problem and claiming that more work needs to be done on the current margin (which I absolutely agree with).
I also don’t think they focus on surveying the range of arguments for alignment failure, but rather on presenting the author’s particular view.
If there are distilled collections of arguments with these properties, please let me know!
(As some more context for my original question: I’m most interested in arguments for inner alignment failure. I’m pretty confused by the fact that some researchers seem to think inner alignment is the main problem and/or probably extremely difficult, and yet I haven’t really heard a rigorous case made for its plausibility.)
I’ll do the easier part of your question first:
I have not read all the material about inner alignment that has appeared on this forum, but I do occasionally read up on it.
There are some posters on this forum who believe that contemplating a set of problems which are together called ‘inner alignment’ can work as an intuition pump that would allow us to make needed conceptual breakthroughs. The breakthroughs sought have mostly to do, I believe, with analyzing possibilities for post-training treacherous turns which have so far escaped notice. I am not (no longer) one of the posters who have high hopes that inner alignment will work as a useful intuition pump.
The terminology problem I have with the term ‘inner alignment’ is that many working on it never make the move of defining it in rigorous mathematics, or with clear toy examples of what are and what are not inner alignment failures. Absent either a mathematical definition or some defining examples, I am not able judge if inner alignment is either the main alignment problem, or whether it would be a minor one, but still one that is extremely difficult to solve.
What does not help here is that by now several non-mathematical notions floating around of what an inner alignment failure even is, to the extent that Evan has felt a need to write an entire clarification post.
When poster X calls something an example of an inner alignment failure, poster Y might respond and declare that in their view of inner alignment failure, it is not actually an example of an inner alignment failure, or a very good example of an inner alignment failure. If we interpret it as a meme, then the meme of inner alignment has a reproduction strategy where it reproduces by triggering social media discussions about what it means.
Inner alignment has become what Minsky called a suitcase word: everybody packs their own meaning into it. This means that for the purpose of distillation, the word is best avoided. If you want to distil the discussion, my recommendation is to look for the meanings that people pack into the word.
I’m broadly sympathetic to your point that there have been an unfortunate number of disagreements about inner alignment terminology, and it has been and remains a source of confusion.
Yeah, and recently there has been even more disagreement/clarification attempts.
I should have specified this on the top level question, but (as mentioned in my own answer) I’m talking about abergal’s suggestion of what inner alignment failure should refer to (basically: a model pursuing a different objective to the one it was trained on, when deployed out-of-distribution, while retaining most or all of the capabilities it had on the training distribution). I agree this isn’t crisp and is far from a mathematical formalisim, but note that there are several examples of this kind of failure in current ML systems that help to clarify what the concept is, and people seem to agree on these examples.
If you can think of toy examples that make real trouble for this definition of inner alignment failure, then I’d be curious to hear what they are.
Meta: I usually read these posts via the alignmentforum.org portal, and this portal filters out certain comments, so I missed your mention of abergal’s suggestion, which would have clarified your concerns about inner alignment arguments for me. I have mailed the team that runs the website to ask if they could improve how this filtering works.
Just read the post with the examples you mention, and skimmed the related arxiv paper. I like how the authors develop the metrics of ‘objective robustness’ vs ‘capability robustness’ while avoiding the problem of trying to define a single meaning for the term ‘inner alignment’. Seems like good progress to me.
I disagree. In my reading. all of these books offer fairly wide-ranging surveys of alignment failure mechanisms.
A more valid criticism would be that the authors spend most of their time on showing that all of these failure mechanisms are theoretically possible, without spending much time discussing how likely each of them is are in practice. Once we take it as axiomatic that some people are stupid some of the time, presenting a convincing proof that some AI alignment failure mode is theoretically possible does not require much heavy lifting at all.
The collection of posts under the threat models tag may be what you are looking for: many of these posts highlight the particular risk scenarios the authors feel are most compelling or likely.
The main problem with distilling this work into, say, a top 3 of most powerful 1-page arguments is that we are not dealing purely with technology-driven failure modes.
There is a technical failure mode story which says that it is very difficult to equip a very powerful future AI with an emergency stop button, that we have not solved that technical problem yet. In fact, this story is a somewhat successful meme in its own right: it appears in all 3 books I mentioned. That story is not very compelling to me. We have plenty of technical options for building emergency stop buttons, see for example my post here.
There have been some arguments that none of the identified technical options for building AI stop buttons will be useful or used, because they will all turn out to be incompatible with yet-undiscovered future powerful AI designs. I feel that these arguments show a theoretical possibility, but I think it is a very low possibility, so in practice these arguments are not very compelling to me. The more compelling failure mode argument is that people will refuse to use the emergency AI stop button, even though it is available.
Many of the posts with the tag above show failure scenarios where the AI fails to be aligned because of an underlying weakness or structural problem in society. These are scenarios where society fails to take the actions needed to keep its AIs aligned.
One can observe hat that in recent history, society has mostly failed to take the actions needed to keep major parts of the global economy aligned with human needs. See for example the oil industry and climate change. Or the cigarette industry and health.
One can be a pessimist, and use our past performance on climate change to predict how good we will be in handling the problem of keeping powerful AI under control. Like oil, AI is a technology that has compelling short-term economic benefits. This line of thought would offer a very powerful 1-page AI failure mode argument. To a pessimist.
Or one can be an optimist, and argue that the case of climate change is teaching us all very valuable lessons, so we are bound to handle AI better than oil. So will you be distilling for an audience of pessimists or optimists?
There is a political line of thought, which I somewhat subscribe to, that optimism is a moral duty. This has kept me from spending much energy myself on rationally quantifying the odds of different failure mode scenarios. I’d rather spend my energy in finding ways to improve the odds. When it comes to the political sphere, a many problems often seem completely intractable, until suddenly there are not.
Sure, I agree this is a stronger point.
Not really, unfortunately. In those posts, the authors are focusing on painting a plausible picture of what the world looks like if we screw up alignment, rather than analysing the arguments that we should expect alignment failures in the first place—which is what I’m interested in (with the exception of Steven’s scenario, who already answered here).
I fully agree that thinking through e.g. incentives that different actors will have in the lead up to TAI, the interaction between AI technology and society, etc. is super important. But we can think through those things as well—e.g. we can look at historical examples of humanity being faced with scenarios where the global economy is (mis)aligned with human needs, and reason about the extent to which AI will be different. I’d count all of that as part of the argument to expect alignment failure. Yes, as soon as you bring societal interactions into the mix, things become a whole lot more complicated. But that isn’t reason not to try.
As it stands, I don’t think there are super clear arguments for alignment failure that take into account interactions between AI tech and society that are ready to be distilled down, though I tried doing some of it here.
Equally, much of the discussion (and predictions of many leading thinkers in this space) is premised on technical alignment failure being the central concern (i.e. if we had better technical alignment solutions, we would manage to avoid existential catastrophe). I don’t want to argue about whether that’s correct here, but just want to point out that at least some people think that at least some of the plausible failure modes are mostly technology-driven.
Neither—just trying to think clearly through the arguments on both sides.
In the particular case you describe, I find the “pessimist” side more compelling, because I don’t see much evidence that humanity has really learned any lessons from oil and climate change. In particular, we still don’t know how to solve collective action problems.
Yeah, I’m sympathetic to this line of thought, and I think I personally tend to err on the side of trying to spend too much energy on quantifying odds and not enough on acting.
However, to the extent that you’re impartial between different ways of trying to improve the odds (e.g. working on technical AI alignment vs other technical AI safety vs AI policy vs meta interventions vs other cause areas entirely), then it still pays to work out (e.g.) how plausible AI alignment failure is, in order to inform your decision about what to do if you want to have the best chance of helping.
I feel that Christiano’s post here is pretty good at identifying plausible failure modes inside society that lead to unaligned agents not being corrected. My recollection of that post is partly why I mentioned the posts under that tag.
There is an interesting question of methodology here: if you want to estimate the probability that society will fail in this this way in handing the impact of AI, do you send a poll to a bunch of AI technology experts, or should you be polling a bunch of global warming activists or historians of the tobacco industry instead? But I think I am reading in your work that this question is no news to you.
Several of the AI alignment organisations you polled have people in them who produced work like this examination of the nuclear arms race. I wonder what happens in your analysis of your polling data if you single out this type of respondent specifically. In my own experience in analysing polling results with this type of response rate, I would be surprised however if you could find a clear signal above the noise floor.
Agree, that is why I am occasionally reading various posts with failure scenarios and polls of experts. To be clear: my personal choice of alignment research subjects is only partially motivated by what I think is the most important to work to do, if I want to have the best chance of helping. Another driver is that I want to have some fun with mathematics. I tend to work on problems which lie in the intersection of those two fuzzy sets.