Sam Marks comments on Thoughts on “AI is easy to control” by Pope & Belrose

Sam Marks 1 Dec 2023 20:22 UTC
LW: 44 AF: 22
14
AF
What follows is a note I wrote responding to the AI Optimists essay, explaining where I agree and disagree. I was thinking about posting this somewhere, so I figure I’ll leave it in the comments here. (So to be clear, it’s responding to the AI Optimists essay, not responding to Steven’s post.)
Places I think AI Optimists and I agree:
- We have a number of advantages for aligning NNs that we don’t have for humans: white box access, better control over training environments and inputs, better control over the reward signal, and better ability to do research about which alignment techniques are most effective.
- Evolution is a misleading analogy for many aspects of the alignment problem; in particular, gradient-based optimization seems likely to have importantly different training dynamics from evolution, like making it harder to gradient hack your training process into retaining cognition which isn’t directly useful for producing high-reward outputs during training.
- Humans end up with learned drives, e.g. empathy and revenge, which are not hard-coded into our reward systems. AI systems also have not-strictly-optimal-for-their-training-signal learned drives like this.
- It shouldn’t be difficult for AI systems to faithfully imitate human value judgements and uncertainty about those value judgements.
Places I think we disagree, but I’m not certain. The authors of the Optimists article promise a forthcoming document which addresses pessimistic arguments, and these bullet points are something like like “points I would like to see addressed in this document.”
- I’m not sure we’re worrying about the same regimes.
  - The regime I’m most worried about is:
    AI systems which are much smarter than the smartest humans
    These AI systems are aligned in a controlled lab environment, but then deployed into the world at-large. Many of their interactions are difficult to monitor (and are also interactions with other AI systems).
    Possibly: these AI systems are highly multi-modal, including sensors which look like “camera readouts of real-world data”
  - It’s unclear to me whether the authors are discussing alignment in a regime like the one above, or a regime like “LLMs which are not much smarter than the smartest humans.” (I too am very optimistic about remaining safe in this latter regime.)
    When they write things like “AIs are white boxes, we have full control over their ‘sensory environment’,” it seems like they’re imagining the latter regime.
    They’re not very clear about what intelligence regime they’re discussing, but I’m guessing they’re talking about the ~human-level intelligence regime (e.g. because they don’t spill much ink discussing scalable oversight problems; see below).
- I worry that the difference between “looks good to human evaluators” and “what human evaluators actually want” is important.
  - Concretely, I worry that training AI systems to produce outputs which look good to human evaluators will lead to AI systems which learn to systematically deceive their overseers, e.g. by introducing subtle errors which trick overseers into giving a too-high score, or by tampering with the sensors that overseers use to evaluate model outputs.
  - Note that arguments about the ease of learning human values and NN inductive biases don’t address this point — if our reward signal systematically prefers goals like “look good to evaluators” over goals like “actually be good,” then good priors won’t save us.
    (Unless we do early stopping, in which case I want to hear a stronger case for why our models’ alignment will be sufficiently robust (robust enough that we’re happy to stop fine-tuning) before our models have learned to systematically deceive their overseers.)
- I worry about sufficiently situationally aware AI systems learning to fixate on reward mechanisms (e.g. “was the thumbs-up button pressed” instead of “was the human happy”).
  - To sketch this concern out concretely, suppose an AI system is aware that it’s being fine-tuned and learned during pretraining that human overseers have a “thumbs-up” button which determines whether the model is rewarded. Suppose that so far during fine-tuning “thumbs-up button was pressed” and “human was happy” were always perfectly correlated. Will the model learn to form values around the thumbs-up button being pressed or around humans being happy? I think it’s not obvious.
  - Unlike before, NN inductive biases are relevant here. But it’s not clear to me that “humans are happy” will be favored over “thumbs-up button is pressed” — both seem similarly simple to an AI with a rich enough world model.
  - I don’t think the comparison with humans here is especially a cause for optimism: lots of humans get addicted to things, which feels to me like “forming drives around directly intervening on reward circuitry.”
- For both of the above concerns, I worry that they might emerge suddenly with scale.
  - As argued here, “trick the overseer” will only be selected for in fine-tuning once the (pretrained) model is smart enough to do it well.
  - You can only form values around the thumbs-up button once you know it exists.
- It seems to me that, on the authors’ view, an important input to “human alignment” is the environment that we’re trained in (rather than details of our brain’s reward circuitry, which is probably very simple). It doesn’t seem to me that environmental factors that make humans aligned (with each other) should generalize to make AI systems aligned (with humans).
  - In particular, I would guess that one important part of our environment is that humans need to interact with lots of similarly-capable humans, so that we form values around cooperation with humans. I also expect AI systems to interact with lots of AI systems (though not necessarily in training), which (if this analogy holds at all) would make AI systems care about each other, not about humans.
- I neither have high enough confidence in our understanding of NN inductive biases, nor in the way Quintin/Nora make arguments based on said understanding, to consider these arguments as strong evidence that models won’t “play the training game” while they know they’re being trained/evaluated only to, in deployment, pursue goals they hid from their overseers.
  - I don’t really want to get into this, because it’s thorny and not my main source of P(doom).
A specific critique about the article:
- The authors write “Some people point to the effectiveness of jailbreaks as an argument that AIs are difficult to control. We don’t think this argument makes sense at all, because jailbreaks are themselves an AI control method.” I don’t really understand this point.
  - The developer wanted their model to be sufficiently aligned that it would, e.g. never say racist stuff no matter what input it saw. In contrast, it takes only a little bit of adversarial pressure to produce inputs which will make the model say racist stuff. This indicates that the developer failed at alignment. (I agree that it means that the attacker succeeded at alignment.)
  - Part of the story here seems to be that AI systems have held-over drives from pretraining (e.g. drives like “produce continuations that look like plausible web text”). Eliminating these undesired drives is part of alignment.
- Noosphere89 1 Dec 2023 20:51 UTC
  LW: 22 AF: 11
  5
  AF Parent
  
  The authors write “Some people point to the effectiveness of jailbreaks as an argument that AIs are difficult to control. We don’t think this argument makes sense at all, because jailbreaks are themselves an AI control method.” I don’t really understand this point.
  
  The point is that it requires a human to execute the jailbreak, the AI is not the jailbreaker, and the examples show that humans can still retain control of the model.
  
  The AI is not jailbreaking itself, here.
  
  This link explains it better than I can, here:
  
  https://www.aisnakeoil.com/p/model-alignment-protects-against
  - jacquesthibs 2 Dec 2023 13:34 UTC
    7 points
    0
    Parent
    Just wanted to mention that, though this is not currently the case, there are two instances I can currently think of where the AI can be a jailbreaker:
    Jailbreaking the reward model to get a high score. (Toy-ish example here.)
    Autonomous AI agents embedded within society jailbreak other models to achieve a goal/sub-goal.
    - Noosphere89 2 Dec 2023 16:11 UTC
      4 points
      2
      Parent
      Yep, I’d really like people to distinguish between misuse and misalignment a lot more than people do currently, because they require quite different solutions.
- ryan_greenblatt 1 Dec 2023 23:06 UTC
  LW: 17 AF: 8
  6
  AF Parent
  I’m not sure we’re worrying about the same regimes.
  The regime I’m most worried about is:
  AI systems which are much smarter than the smartest humans
  ...
  It’s unclear to me whether the authors are discussing alignment in a regime like the one above, or a regime like “LLMs which are not much smarter than the smartest humans.” (I too am very optimistic about remaining safe in this latter regime.)
  ...
  The AI Optimists don’t make this argument AFAICT, but I think optimism about effectively utilizing “human level” models should transfer to a considerable amount of optimism about smarter than human models due to the potential for using these “human level” systems to develop considerably better safety technology (e.g. alignment research). AIs might have structural advantages (speed, cost, and standardization) which make it possible heavily accelerate R&D^[1] even at around qualitatively “human level” capabilities. (That said, my overall view is that even if we had the exact human capability profile while also having ML structural advantages these systems would themselves pose substantial (e.g. 15%) catastrophic misalignment x-risk on the “default” trajectory because we’ll want to run extremely large numbers of these systems at high speeds.)
  The idea of using human level models like this has a bunch of important caveats which mean you shouldn’t end up being extremely optimistic overall IMO^[2]:
  - It’s not clear that “human level” will be a good description at any point. AIs might be way smarter than humans in some domains while way dumber in other domains. This can cause the oversight issues mentioned in the parent comment to manifest prior to massive acceleration of alignment research. (In practice, I’m moderately optimistic here.)
  - Is massive effective acceleration enough? We need safety technology to keep up with capabilites and capabilities might also be accelerated. There is the potential for arbitrarily scalable approaches to safety which should make us somewhat more optimistic. But, it might end up being the case that to avoid catastrophe from AIs which are one step smarter than humans we need the equivalent of having the 300 best safety researchers work for 500 years and we won’t have enough acceleration and delay to manage this. (In practice I’m somewhat optimistic here so long as we can get a 1-3 year delay at a critical point.)
  - Will “human level” systems be sufficiently controlled to get enough useful work? Even if systems could hypothetically be very useful, it might be hard to quickly get them actually doing useful work (particularly in fuzzy domains like alignment etc.). This objection holds even if we aren’t worried about catastrophic misalignment risk.
  1. ^
    At least R&D which isn’t very limited by physical processes.
  2. ^
    I think <1% doom seems too optimistic without more of a story for how we’re going to handle super human models.
  - Vladimir_Nesov 2 Dec 2023 16:40 UTC
    LW: 5 AF: 3
    1
    AF Parent
    Plans that rely on aligned AGIs working on alignment faster than humans would need to ensure that no AGIs work on anything else in the meantime. The reason humans have no time to develop alignment of superintelligence is that other humans develop misaligned superintelligence faster. Similarly by default very fast AGIs working on alignment end up having to compete with very fast AGIs working on other things that lead to misaligned superintelligence. Preventing aligned AGIs from building misaligned superintelligence is not clearly more manageable than preventing humans from building AGIs.
    - ryan_greenblatt 2 Dec 2023 17:46 UTC
      LW: 5 AF: 2
      0
      AF Parent
      
      Plans that rely on aligned AGIs working on alignment faster than humans would need to ensure that no AGIs work on anything else in the meantime.
      
      This isn’t true. It could be that making an arbitrarily scalable solution to alignment takes X cognitive resources and in practice building an uncontrollably powerful AI takes Y cognitive resources with X < Y.
      
      (Also, this plan doesn’t require necessarily aligning “human level” AIs, just being able to get work out of them with sufficiently high productivity and low danger.)
      - Vladimir_Nesov 2 Dec 2023 18:17 UTC
        LW: 11 AF: 4
        7
        AF Parent
        I’m being a bit simplistic. The point is that it needs to stop being a losing or a close race, and all runners getting faster doesn’t obviously help with that problem. I guess there is some refactor vs. rewrite feel to the distinction between the project of stopping humans from building AGIs right now, and the project of getting first AGIs to work on alignment and global security in a post-AGI world faster than other AGIs overshadow such work. The former has near/concrete difficulties, the latter has nebulous difficulties that don’t as readily jump to attention. The whole problem is messiness and lack of coordination, so starting from scratch with AGIs seems more promising than reforming human society. But without strong coordination on development and deployment of first AGIs, the situation with activities of AGIs is going to be just as messy and uncoordinated, only unfolding much faster, and that’s not even counting the risk of getting a superintelligence right away.
  - Bogdan Ionut Cirstea 2 Dec 2023 16:26 UTC
    5 points
    1
    Parent
    I’m on the optimists discord and I do make the above argument explicitly in this presentation (e.g. slide 4): Reasons for optimism about superalignment (though, fwiw, Idk if I’d go all the way down to 1% p(doom), but I have probably updated something like 10% to <5%, and most of my uncertainty now comes more from the governance / misuse side).
    On your points ‘Is massive effective acceleration enough?’ and ‘Will “human level” systems be sufficiently controlled to get enough useful work?’, I think conditioned on aligned-enough ~human-level automated alignment RAs, the answers to the above are very likely yes, because it should be possible to get a very large amount of work out of those systems even in a very brief amount of time—e.g. a couple of months (feasible with e.g. a coordinated pause, or even with a sufficient lead). See e.g. slides 9, 10 of the above presentation (and I’ll note that this argument isn’t new, it’s been made in variously similar forms by e.g. Ajeya Cotra, Lukas Finnveden, Jacob Steinhardt).
    - ryan_greenblatt 2 Dec 2023 17:49 UTC
      12 points
      10
      Parent
      I’m generally reasonably optimistic about using human level-ish systems to do a ton of useful work while simultaneously avoiding most risk from these systems. But, I think this requires substantial effort and won’t clearly go well by default.
    - Dakara 20 Nov 2024 14:51 UTC
      1 point
      0
      Parent
      Have you had any p(doom) updates since then or is it still around 5%?
      - Bogdan Ionut Cirstea 20 Nov 2024 17:10 UTC
        4 points
        1
        Parent
        Mostly the same, perhaps a minor positive update on the technical side (basically, from systems getting somewhat stronger—so e.g. closer to automating AI safety research—while still not showing very dangerous capabilities, like ASL-3, prerequisites to scheming, etc.). My views are even more uncertain / unstable on the governance side though, which probably makes my overall p(doom) (including e.g. stable totalitarianism, s-risks, etc.) more like 20% than 5% (I was probably mostly intuitively thinking of extinction risk only when giving the 5% figure a year ago; overall my median probably hasn’t changed much, but I have more variance, coming from the governance side).
        Dakara 24 Nov 2024 17:08 UTC
        1 point
        0
        Parent
        If it’s not a big ask, I’d really like to know your views on more of a control-by-power-hungry-humans side of AI risk.
        
        For example, the first company to create intent-aligned AGI would be wielding incredible power over the rest of us. I don’t think I could trust any of the current leading AI labs to use that power fairly. I don’t think this lab would voluntarily decide to give up control over it either (intuitively, it would take quite something for anyone to give up such a source of power). Is there anything that can be done to prevent such a scenario?
        Bogdan Ionut Cirstea 24 Nov 2024 17:25 UTC
        3 points
        1
        Parent
        I’m very uncertain and feel somewhat out of depth on this. I do have quite some hope though from arguments like those in https://aiprospects.substack.com/p/paretotopian-goal-alignment.
- ryan_greenblatt 3 Dec 2023 4:12 UTC
  LW: 5 AF: 3
  5
  AF Parent
  The “AI is easy to control” piece does talk about scaling to superhuman AI:
  
  In what follows, we will argue that AI, even superhuman AI, will remain much more controllable than humans for the foreseeable future. Since each generation of controllable AIs can help control the next generation, it looks like this process can continue indefinitely, even to very high levels of capability.
  
  If we assume that each generation can ensure a relatively strong notion of alignment between it and the next generation, then I think this argument goes through.
  
  However, there are weaker notions of control which are insufficient for this sort of bootstrapping argument. Suppose each generation can ensure a the following weaker notion of control “we can set up a training, evaluation, and deployment protocol with sufficient safeguards (monitoring, auditing, etc) such that we can avoid generation N+1 AIs being capable of causing catastrophic outcomes (like AI takeover) while using those AIs to speed up labor of the generation N by a large multiple”. This notion of control doesn’t (clearly) allow the bootstrapping argument to go through. In particular, suppose that all AIs smarter than humans are deceptively aligned and they defect on humanity at the point where they are doing tasks which would be extremely hard for a human to oversee. (This isn’t the only issue, but it is a sufficient counterexample.)
  
  This weaker notion of control can be very useful in ensuring good outcomes via getting lots of useful work out of AIs, but we will likely need to build something more scalable eventually.
  
  (See also my discussion of using human level ish AIs to automate safety research in the sibling.)
  - Sam Marks 3 Dec 2023 22:36 UTC
    LW: 10 AF: 3
    5
    AF Parent
    I agree with everything you wrote here and in the sibling comment: there are reasonable hopes for bootstrapping alignment as agents grow smarter; but without a concrete bootstrapping proposal with an accompanying argument, <1% P(doom) from failing to bootstrap alignment doesn’t seem right to me.
    I’m guessing this is my biggest crux with the Quintin/Nora worldview, so I guess I’m bidding for—if Quintin/Nora have an argument for optimism about bootstrapping beyond “it feels like this should work because of iterative design”—for that argument to make it into the forthcoming document.