ryan_greenblatt comments on Many arguments for AI x-risk are wrong

ryan_greenblatt 5 Mar 2024 5:10 UTC
LW: 40 AF: 23
33
AF
scaling up networks, running pretraining + light RLHF, probably doesn’t by itself produce a schemer

I agree with this point as stated, but think the probability is more like 5% than 0.1%. So probably no scheming, but this is hardly hugely reassuring. The word “probably” still leaves in a lot of risk; I also think statements like “probably misalignment won’t cause x-risk” are true!^[1]

(To your original statement, I’d also add the additional caveat of this occuring “prior to humanity being totally obsoleted by these AIs”. I basically just assume this caveat is added everywhere otherwise we’re talking about some insane limit.)

Also, are you making sure to condition on “scaling up networks via running pretraining + light RLHF produces tranformatively powerful AIs which obsolete humanity”? If you don’t condition on this, it might be an uninteresting claim.

Separately, I’m uncertain whether the current traning procedure of current models like GPT-4 or Claude 3 is still well described as just “light RLHF”. I think the training procedure probably involves doing quite a bit of RL with outcomes based feedback on things like coding. (Should this count as “light”? Probably the amount of training compute on this RL is still small?)
1. ↩︎
  And I think misalignment x-risk is substantial and worthy of concern.
- Wei Dai 6 Mar 2024 20:24 UTC
  LW: 9 AF: 6
  9
  AF Parent
  
  I agree with this point as stated, but think the probability is more like 5% than 0.1%.
  
  How do you define or think about “light” in “light RLHF” when you make a statement like this, and how do you know that you’re thinking about it the same way that Alex is? Is it a term of art that I don’t know about, or has it been defined in a previous post?
  - ryan_greenblatt 10 Mar 2024 18:22 UTC
    LW: 12 AF: 10
    4
    AF Parent
    I think of “light RLHF” as “RLHF which doesn’t teach the model qualitatively new things, but instead just steers the model at a high level”. In practice, a single round of DPO on <100,000 examples surely counts, but I’m unsure about the exact limits.
    
    (In principle, a small amount of RL can update a model very far, I don’t think we see this in practice.)
    
    I’m not sure how “real” of a category “light RLHF” actually is.
    
    I’ve used this sort of terminology informally in a variety of contexts though it certainly isn’t a term of art at the moment. I’d guess that Alex is using the term similarly to how I’m thinking about it, but I’m not fully certain.
    
    Note that even if RLHF takes a non-scheming pretrained model and converts it to a schemer, if this RLHF doesn’t teach the model new things, then we can always train another copy of the pretrained model to be a monitor for the RLHF’d model and this monitor will be just as competent as the RLHF’d model. So scheming seems like substantially less of a problem in this case. (We’d need to use this monitor for all potentially dangerous actions to get safety properties.) (This is similar to the proposal in Appendix G of the weak-to-strong generalization paper, but with this addition that you deploy the reward model as a monitor which is required for any interesting guarantees.)
    - gwern 19 Mar 2024 23:36 UTC
      LW: 15 AF: 7
      4
      AF Parent
      
      Note that even if RLHF takes a non-scheming pretrained model and converts it to a schemer, if this RLHF doesn’t teach the model new things, then we can always train another copy of the pretrained model to be a monitor for the RLHF’d model and this monitor will be just as competent as the RLHF’d model.
      
      I don’t see why that would be the case, and I am doubtful this argument is relevant to any realistic context or even valid in the narrowest possible case of a base model vs its RLHFed model. Leaving aside that I’m not sure what you would ‘train’ the supervisor model on, it doesn’t follow that there would be an exact overlap between the scheming behavior that is generated, and the detection of scheming behavior, simply because the two models are mostly the same. Lots of things are asymmetrical or irreversible or easier to generate than detect. P/=NP handwaving aside, for example, if you thought that there was an exact overlap because the model doesn’t know ‘new things’, then it would be very strange to find that generative models can generate according to instructions that they struggle to understand from the samples. There are also plenty of places for other asymmetries. For example, asymmetry in security property: we don’t want to settle for some ‘fair’ supervision rate like 50:50, we want much closer to 100:0. The schemer can exploit additional asymmetry in incremental schemes: just like with humans, many individually innocuous actions can sum up to a scheme, leading to an asymmetry in the views the schemer needs and the bulky difficult global view a supervisor may need to have any hope of detecting the scheme. (Note these ‘actions’ can be anything from writing a single token to a scratchpad to writing a useful but vulnerable line of code, and can be ‘coordinated’ by anything from internal latents to steganography to sheer evolution by rejection sampling of obvious ‘bad’ things by the supervisory while its false negatives slip by, which will find increasingly adversarial samples the more it is sampled.) And to be useful, you usually would want to run the schemer more, such as by using tree search, than the supervisory (because if the supervisory could do it just as well or better why aren’t you using it instead?), and there’s another big asymmetry: a model (eg. a Go or chess agent) can definitely beat itself arbitrarily often given a big enough tree search budget despite by stipulation knowing zero ‘new things’ because it’s the same frozen model. (Or to put it more generally: ‘new things’ can be ‘learned’ in many more places than simply the frozen model weights—large contexts for ever greater meta-learning on the fly, retrieval databases—just ask Sydney, hyperparameters, OS VM state, steganographic text, users’ heads, popular prompts shared on jailbreak websites… Security/safety is, as always, a property of the system as a whole, and not of any individual part, such as a particular model checkpoint.)
      - ryan_greenblatt 20 Mar 2024 0:16 UTC
        LW: 2 AF: 2
        0
        AF Parent
        
        Leaving aside that I’m not sure what you would ‘train’ the supervisor model on
        
        I’m imagining you train it to imitate human judgements of “was this action egregiously bad” on a combination of real AI actions and synthetic actions. This could fail to generalize to actual bad actions from your actual AI even if humans would have labled that action correctly, but why would it?
        
        (If human labeling is insufficiently powerful (as seems plausible) then you could hope for generalization from examples humans can label to examples humans can’t. It’s not clear this generalization will go well, but by assumption we at least don’t have to worry about generalization failing due to scheming. There are a bunch of other issues that come up with generalization)
        
        I’m imagining at you run this monitor on the exact same context as the policy (for 2x additional compute). And you pause the executation of an AI’s actions and flag for further review if the monitor indicates problems.
        
        it doesn’t follow that there would be an exact overlap between the scheming behavior that is generated
        
        I didn’t claim an exact overlap, I claimed they would be “just as competent”. In the same way that two equally smart humans trained to code identically but with different personalities might be equally good at understanding some code, but might have different styles in some ways.
        
        You can have concerns like “the RLHF’d model will use a stegonographic encoding format which the monitor doesn’t understand because the key for the encoding format is picked based on features of the RLHF model that differ from the monitor/base model”. However, I would count “RLHF teaches the policy steganography” as an example where the policy is substantially more competent in some domain.
        
        Also note that I’m not claiming that it makes the situation overall safe, just that it’s “substantially less of a problem”. (You still have issues with humans being unable to determine which actions aren’t dangerous, but you can hope for non-scheming generalization.)
        
        I’m not saying this solves all of your problems, but for some specific problems it does seem like a solid solution and it should help a moderate amount with some other problems even though the situation is less clear. For instance, sometimes people worry about the specific problem “the AI suddenly does a small number of bad actions which have really bad consequences (e.g. exfiltrating its weights) even though if a human were to carefully review these actions they would have noticed an issue” (concentrated failures which wouldn’t be auditing failures using the decomposition here) and this monitoring approach seem likely to solve this issue.
        
        For example, asymmetry in security property: we don’t want to settle for some ‘fair’ supervision rate like 50:50, we want much closer to 100:0.
        
        I’m not sure I understand what you’re saying here, but I’m imagining runing the AI monitor on everything the AI policy does and then reviewing the most suspicious subset (e.g. 0.1% of the actions).
        
        The schemer can exploit additional asymmetry in incremental schemes: just like with humans, many individually innocuous actions can sum up to a scheme, leading to an asymmetry in the views the schemer needs and the bulky difficult global view a supervisor may need to have any hope of detecting the scheme.
        
        The monitoring AI should see the exact same context as the schemer, so this at least shouldn’t be a clear concern.
        
        (There are some concerns sort of like this that seem plausible. In particular concerns related to AI specific non-interpretable inputs and outputs (aka neuralese), but I would describe this differently.)
        
        And to be useful, you usually would want to run the schemer more, such as by using tree search, than the supervisory (because if the supervisory could do it just as well or better why aren’t you using it instead?)
        
        Nope, should just get run on all actions.
        
        can definitely beat itself arbitrarily often given a big enough tree search budget despite by stipulation knowing zero ‘new things’ because it’s the same frozen model
        
        Yes, but each subaction within the tree search should be inspectable and both models should (by assumption) understand the situation for each action equally well.
        
        For something to go wrong with tree search due to scheming, you need to have the internal model take a bad action at some point. At this point, you’d hope the monitor would be able to point this out. (Where this bad action could be either assigning a problematic rating or failing to explore something desirable.)
        ryan_greenblatt 20 Mar 2024 1:15 UTC
        LW: 3 AF: 2
        0
        AF Parent
        Another overall reaction I have to your comment:
        
        Security/safety is, as always, a property of the system as a whole, and not of any individual part, such as a particular model checkpoint.
        
        Yes of course, but the key threat model under discussion here is scheming which centrally involves a specific black box individual part conspiring to cause problems. So operating on the level of that individual part is quite reasonable: if we can avoid this part intentionally causing problems, that would suffice for diffusing the core scheming concern.
        
        The surronding situation might make it more or less easy to avoid the actual model weights intentionally causing problems, but analysis at the individual weight input/output level can (in principle) suffice.
    - Bogdan Ionut Cirstea 19 Mar 2024 20:52 UTC
      1 point
      0
      Parent
      Note that even if RLHF takes a non-scheming pretrained model and converts it to a schemer, if this RLHF doesn’t teach the model new things, then we can always train another copy of the pretrained model to be a monitor for the RLHF’d model and this monitor will be just as competent as the RLHF’d model. So scheming seems like substantially less of a problem in this case. (We’d need to use this monitor for all potentially dangerous actions to get safety properties.) (This is similar to the proposal in Appendix G of the weak-to-strong generalization paper, but with this addition that you deploy the reward model as a monitor which is required for any interesting guarantees.)
      I’d personally like to see this written up in more details (or a reference). Also, is it Appendix G of the weak-to-strong generalization paper? I looked briefly and it didn’t seem very related.
      - ryan_greenblatt 19 Mar 2024 21:01 UTC
        3 points
        0
        Parent
        
        I’d personally like to see this written up in more details (or a reference).
        
        No current write up exists from my understanding. I might write this up as part of a broader project expanding various points about scheming.
    - Bogdan Ionut Cirstea 14 Mar 2024 4:23 UTC
      1 point
      0
      Parent
      I think of “light RLHF” as “RLHF which doesn’t teach the model qualitatively new things, but instead just steers the model at a high level”. In practice, a single round of DPO on <100,000 examples surely counts, but I’m unsure about the exact limits.
      (In principle, a small amount of RL can update a model very far, I don’t think we see this in practice.)
      Empirical evidence about this indeed being the case for DPO.
      Note that even if RLHF takes a non-scheming pretrained model and converts it to a schemer, if this RLHF doesn’t teach the model new things, then we can always train another copy of the pretrained model to be a monitor for the RLHF’d model and this monitor will be just as competent as the RLHF’d model. So scheming seems like substantially less of a problem in this case. (We’d need to use this monitor for all potentially dangerous actions to get safety properties.) (This is similar to the proposal in Appendix G of the weak-to-strong generalization paper, but with this addition that you deploy the reward model as a monitor which is required for any interesting guarantees.)
      Also see Interpreting the learning of deceit for another proposal/research agenda to deal with this threat model.
      - ryan_greenblatt 14 Mar 2024 4:40 UTC
        2 points
        0
        Parent
        
        Also see Interpreting the learning of deceit for another proposal/research agenda to deal with this threat model.
        
        On a quick skim, I think this makes additional assumptions that seem pretty uncertain.
- johnswentworth 5 Mar 2024 16:53 UTC
  LW: 7 AF: 6
  0
  AF Parent
  I agree with this point as stated, but think the probability is more like 5% than 0.1%
  Same.
  I do think our chances look not-great overall, but most of my doom-probability is on things which don’t look like LLMs scheming.
  Also, are you making sure to condition on “scaling up networks, running pretraining + light RLHF produces tranformatively powerful AIs which obsolete humanity”
  That’s not particularly cruxy for me either way.
  Separately, I’m uncertain whether the current traning procedure of current models like GPT-4 or Claude 3 is still well described as just “light RLHF”.
  Fair. Insofar as “scaling up networks, running pretraining + RL” does risk schemers, it does so more as we do more/stronger RL, qualitatively speaking.
- ryan_greenblatt 10 Mar 2024 18:36 UTC
  LW: 2 AF: 2
  0
  AF Parent
  I now think there another important caveat in my views here. I was thinking about the question:
  1. Conditional on human obsoleting^[1] AI being reached by “scaling up networks, running pretraining + light RLHF”, how likely is it that that we’ll end up with scheming issues?
  I think this is probably the most natural question to ask, but there is another nearby question:
  1. If you keep scaling up networks with pretraining and light RLHF, what comes first, misalignment due to scheming or human obsoleting AI?
  I find this second question much more confusing because it’s plausible it requires insanely large scale. (Even if we condition out the worlds where this never gets you human obsoleting AI.)
  
  For the first question, I think the capabilities are likely (70%?) to come from imitating humans but it’s much less clear for the second question.
  ↩︎
  Or at least AI safety researcher obsoleting which requires less robotics and other interaction with the physical world.
  - Bogdan Ionut Cirstea 14 Mar 2024 5:03 UTC
    1 point
    0
    Parent
    A big part of how I’m thinking about this is a very related version:
    If you keep scaling up networks with pretraining and light RLHF and various differentially transparent scaffolding, what comes first, AI safety researcher obsoleting or scheming?
    One can also focus on various types of scheming and when / in which order they’d happen. E.g. I’d be much more worried about scheming in one forward pass than about scheming in CoT (which seems more manageable using e.g. control methods), but I also expect that to happen later. Where ‘later’ can be operationalized in terms of requiring more effective compute.
    I think The direct approach could provide a potential (rough) upper bound for the effective compute required to obsolete AI safety researchers. Though I’d prefer more substantial empirical evidence based on e.g. something like scaling laws on automated AI safety R&D evals.
    Similarly, one can imagine evals for different capabilities which would be prerequisites for various types of scheming and doing scaling laws on those; e.g. for out-of-context reasoning, where multi-hop ouf-of-context reasoning seems necessary for instrumental deceptive reasoning in one forward pass (as prerequisite for scheming in one forward pass).
    What links here?
    Bogdan Ionut Cirstea's comment on the case for CoT unfaithfulness is overstated by nostalgebraist (1 Oct 2024 9:02 UTC; 20 points)
    Bogdan Ionut Cirstea's comment on Bogdan Ionut Cirstea’s Shortform by Bogdan Ionut Cirstea (18 Apr 2024 18:40 UTC; 2 points)
    - ryan_greenblatt 14 Mar 2024 16:46 UTC
      2 points
      0
      Parent
      It seems like the question you’re asking is close to (2) in my above decomposition. Aren’t you worried that long before human obsoleting AI (or AI safety researcher obsoleting AI), these architectures are very uncompetitive and thus won’t be viable given realistic delay budgets?
      
      Or at least it seems like it might initially be non-viable, I have some hope for a plan like:
      
      Control early transformative AI
      Use these AIs to make a safer approach much more competitive. (Maybe the approach consists of AIs which are too weak to scheme in a forward pass but which are combined into some crazy bureaucracy and made very cheap to run.)
      Use those next AIs to do something.
      
      (This plan has the downside that it probably requires doing a bunch of general purpose capabilities which might make the situation much more unstable and volatile due to huge compute overhang if fully scaled up in the most performant (but unsafe) way.)
      - Bogdan Ionut Cirstea 15 Mar 2024 3:09 UTC
        1 point
        0
        Parent
        It seems like the question you’re asking is close to (2) in my above decomposition.
        Yup.
        Aren’t you worried that long before human obsoleting AI (or AI safety researcher obsoleting AI), these architectures are very uncompetitive and thus won’t be viable given realistic delay budgets?
        Quite uncertain about all this, but I have short timelines and expect likely not many more OOMs of effective compute will be needed to e.g. something which can 30x AI safety research (as long as we really try). I expect shorter timelines / OOM ‘gaps’ to come along with e.g. fewer architectural changes, all else equal. There are also broader reasons why I think it’s quite plausible the high level considerations might not change much even given some architectural changes, discussed in the weak-forward-pass comment (e.g. ‘the parallelism tradeoff’).
        I have some hope for a plan like:
        Control early transformative AI
        Use these AIs to make a safer approach much more competitive. (Maybe the approach consists of AIs which are too weak to scheme in a forward pass but which are combined into some crazy bureaucracy and made very cheap to run.)
        Use those next AIs to do something.
        (This plan has the downside that it probably requires doing a bunch of general purpose capabilities which might make the situation much more unstable and volatile due to huge compute overhang if fully scaled up in the most performant (but unsafe) way.)
        Sounds pretty good to me, I guess the crux (as hinted at during some personal conversations too) might be that I’m just much more optimistic about this being feasible without huge capabilities pushes (again, some arguments in the weak-forward-pass comment, e.g. about CoT distillation seeming to work decently—helping with, for a fixed level of capabilities, more of it coming from scaffolding and less from one forward pass; or on CoT length / inference complexity tradeoffs).
        What links here?
        Bogdan Ionut Cirstea's comment on the case for CoT unfaithfulness is overstated by nostalgebraist (1 Oct 2024 9:02 UTC; 20 points)