Koen.Holtman comments on Collection of arguments to expect (outer and inner) alignment failure?

Koen.Holtman 4 Oct 2021 13:04 UTC
LW: 1 AF: 1
AF

I also don’t think [these three books] focus on surveying the range of arguments for alignment failure, but rather on presenting the author’s particular view.

I disagree. In my reading. all of these books offer fairly wide-ranging surveys of alignment failure mechanisms.

A more valid criticism would be that the authors spend most of their time on showing that all of these failure mechanisms are theoretically possible, without spending much time discussing how likely each of them is are in practice. Once we take it as axiomatic that some people are stupid some of the time, presenting a convincing proof that some AI alignment failure mode is theoretically possible does not require much heavy lifting at all.

If there are distilled collections of arguments with these properties, please let me know!

The collection of posts under the threat models tag may be what you are looking for: many of these posts highlight the particular risk scenarios the authors feel are most compelling or likely.

The main problem with distilling this work into, say, a top 3 of most powerful 1-page arguments is that we are not dealing purely with technology-driven failure modes.

There is a technical failure mode story which says that it is very difficult to equip a very powerful future AI with an emergency stop button, that we have not solved that technical problem yet. In fact, this story is a somewhat successful meme in its own right: it appears in all 3 books I mentioned. That story is not very compelling to me. We have plenty of technical options for building emergency stop buttons, see for example my post here.

There have been some arguments that none of the identified technical options for building AI stop buttons will be useful or used, because they will all turn out to be incompatible with yet-undiscovered future powerful AI designs. I feel that these arguments show a theoretical possibility, but I think it is a very low possibility, so in practice these arguments are not very compelling to me. The more compelling failure mode argument is that people will refuse to use the emergency AI stop button, even though it is available.

Many of the posts with the tag above show failure scenarios where the AI fails to be aligned because of an underlying weakness or structural problem in society. These are scenarios where society fails to take the actions needed to keep its AIs aligned.

One can observe hat that in recent history, society has mostly failed to take the actions needed to keep major parts of the global economy aligned with human needs. See for example the oil industry and climate change. Or the cigarette industry and health.

One can be a pessimist, and use our past performance on climate change to predict how good we will be in handling the problem of keeping powerful AI under control. Like oil, AI is a technology that has compelling short-term economic benefits. This line of thought would offer a very powerful 1-page AI failure mode argument. To a pessimist.

Or one can be an optimist, and argue that the case of climate change is teaching us all very valuable lessons, so we are bound to handle AI better than oil. So will you be distilling for an audience of pessimists or optimists?

There is a political line of thought, which I somewhat subscribe to, that optimism is a moral duty. This has kept me from spending much energy myself on rationally quantifying the odds of different failure mode scenarios. I’d rather spend my energy in finding ways to improve the odds. When it comes to the political sphere, a many problems often seem completely intractable, until suddenly there are not.
- Sam Clarke 4 Oct 2021 16:27 UTC
  LW: 1 AF: 1
  AF Parent
  
  A more valid criticism would be that the authors spend most of their time on showing that all of these failure mechanisms are theoretically possible, without spending much time discussing how likely each of them is are in practice
  
  Sure, I agree this is a stronger point.
  
  The collection of posts under the threat models tag may be what you are looking for: many of these posts highlight the particular risk scenarios the authors feel are most compelling or likely.
  
  Not really, unfortunately. In those posts, the authors are focusing on painting a plausible picture of what the world looks like if we screw up alignment, rather than analysing the arguments that we should expect alignment failures in the first place—which is what I’m interested in (with the exception of Steven’s scenario, who already answered here).
  
  The main problem with distilling this work into, say, a top 3 of most powerful 1-page arguments is that we are not dealing purely with technology-driven failure modes.
  
  I fully agree that thinking through e.g. incentives that different actors will have in the lead up to TAI, the interaction between AI technology and society, etc. is super important. But we can think through those things as well—e.g. we can look at historical examples of humanity being faced with scenarios where the global economy is (mis)aligned with human needs, and reason about the extent to which AI will be different. I’d count all of that as part of the argument to expect alignment failure. Yes, as soon as you bring societal interactions into the mix, things become a whole lot more complicated. But that isn’t reason not to try.
  
  As it stands, I don’t think there are super clear arguments for alignment failure that take into account interactions between AI tech and society that are ready to be distilled down, though I tried doing some of it here.
  
  Equally, much of the discussion (and predictions of many leading thinkers in this space) is premised on technical alignment failure being the central concern (i.e. if we had better technical alignment solutions, we would manage to avoid existential catastrophe). I don’t want to argue about whether that’s correct here, but just want to point out that at least some people think that at least some of the plausible failure modes are mostly technology-driven.
  
  So will you be distilling for an audience of pessimists or optimists?
  
  Neither—just trying to think clearly through the arguments on both sides.
  
  In the particular case you describe, I find the “pessimist” side more compelling, because I don’t see much evidence that humanity has really learned any lessons from oil and climate change. In particular, we still don’t know how to solve collective action problems.
  
  This has kept me from spending much energy myself on rationally quantifying the odds of different failure mode scenarios. I’d rather spend my energy in finding ways to improve the odds.
  
  Yeah, I’m sympathetic to this line of thought, and I think I personally tend to err on the side of trying to spend too much energy on quantifying odds and not enough on acting.
  
  However, to the extent that you’re impartial between different ways of trying to improve the odds (e.g. working on technical AI alignment vs other technical AI safety vs AI policy vs meta interventions vs other cause areas entirely), then it still pays to work out (e.g.) how plausible AI alignment failure is, in order to inform your decision about what to do if you want to have the best chance of helping.
  - Koen.Holtman 9 Oct 2021 11:21 UTC
    LW: 2 AF: 2
    AF Parent
    
    Not really, unfortunately. In those posts [under the threat models tag], the authors are focusing on painting a plausible picture of what the world looks like if we screw up alignment, rather than analysing the arguments that we should expect alignment failures in the first place.
    
    I feel that Christiano’s post here is pretty good at identifying plausible failure modes inside society that lead to unaligned agents not being corrected. My recollection of that post is partly why I mentioned the posts under that tag.
    
    There is an interesting question of methodology here: if you want to estimate the probability that society will fail in this this way in handing the impact of AI, do you send a poll to a bunch of AI technology experts, or should you be polling a bunch of global warming activists or historians of the tobacco industry instead? But I think I am reading in your work that this question is no news to you.
    
    Several of the AI alignment organisations you polled have people in them who produced work like this examination of the nuclear arms race. I wonder what happens in your analysis of your polling data if you single out this type of respondent specifically. In my own experience in analysing polling results with this type of response rate, I would be surprised however if you could find a clear signal above the noise floor.
    
    However [...] it still pays to work out (e.g.) how plausible AI alignment failure is, in order to inform your decision about what to do if you want to have the best chance of helping.
    
    Agree, that is why I am occasionally reading various posts with failure scenarios and polls of experts. To be clear: my personal choice of alignment research subjects is only partially motivated by what I think is the most important to work to do, if I want to have the best chance of helping. Another driver is that I want to have some fun with mathematics. I tend to work on problems which lie in the intersection of those two fuzzy sets.