ryan_greenblatt comments on Protocol evaluations: good analogies vs control

ryan_greenblatt 19 Feb 2024 19:04 UTC
LW: 9 AF: 6
1
AF
One key implication of the argument in this post is that non-scheming misalignment issues are pretty easy to notice, study, and evaluate (“non-scheming” = “issues from misalignment other than deceptive alignment”). This argument is strongest for non-scheming issues which would occur relatively frequently when using the AI, but issues that occur only rarely (but which are a serious and “high-stakes” problem despite only occurring rarely) should still be reasonably doable to study. (See the discussion in the appendix for more commentary on rare failures being harder.)

Of course, being relatively easy-to-study doesn’t mean that humanity will necessarily solve these problems effectively! There are many relatively easy-to-study problems which humanity fails to solve despite putting in some effort. Being easy-to-study doesn’t imply easy-to-solve. Further, it might be that the easiest way to temporarily fix these issues will be to patch over non-scheming misalignment issues in ways which make these problems harder to notice without actually solving them. It might even make these issues more dangerous! For instance, training against doing actions which were much later discovered to be egregiously bad might just make the AI more patient, more likely to care about longer time horizons, and less likely to get caught. (Sometimes this concern is called “horizon creep”.)

(Also, note that our mechanisms for evaluating non-scheming misalignment can’t really be trained against. Both because these tests won’t be robust to training and because meta-learning probably breaks the analogies by creating a difference between our testbeds and reality.)

Overall, I think non-scheming misalignment issues cause about 50% of the x-risk from misalignment. (These issues seem more likely than scheming, but are relatively easier to solve.)

Beyond these non-scheming issues being easy to study, I think it’s also quite likely that if there are serious non-scheming problems, then by default, we’ll notice these issues to some extent. However, carefully studying these issues in well-designed testbeds seems considerably better than just hoping we notice these issues when they come up: specific testbeds help with spotting somewhat subtle issues and also help accurately measure whether countermeasures are actually solving the problem or just making the problem harder to notice.

Thus, when we actually have access to transformatively powerful AI, we’ll probably have much more clarity on non-scheming issues. (At least in worlds with relatively continuous takeoff which takes at least a year and assuming that safety motivated people are able to run experiments on these AIs.)

The observation that non-scheming misalignment issue are relatively more obvious and much easier to study is a very important part of my views about how powerful AI will go.
- Charlie Steiner 21 Feb 2024 15:04 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Non-deceptive failures are easy to notice, but they’re not necessarily easy to eliminate—and if you don’t eliminate them, they’ll keep happening until some do slip through. I think I take them more seriously than you.
  - ryan_greenblatt 21 Feb 2024 17:26 UTC
    LW: 4 AF: 4
    0
    AF Parent
    
    Non-deceptive failures are easy to notice, but they’re not necessarily easy to eliminate
    
    I agree, I was trying to note this in my second paragraph, but I guess this was insufficiently clear.
    
    I added the sentence “Being easy-to-study doesn’t imply easy-to-solve”.
    
    I think I take them more seriously than you.
    
    Seems too hard to tell based on this limited context. I think non-scheming failures are about 50% of the risk and probably should be about 50% of the effort of the AI safety-from-misalignment community. (I can see some arguments for scheming/deceptive alignment being more important toi work on in advance, but it also might be that non-scheming is more tractible and a higher fraction of risk in short timelines, so IDK overall.)
- Bogdan Ionut Cirstea 21 Feb 2024 23:42 UTC
  1 point
  0
  Parent
  I think this line of reasoning plausibly also suggests trying to differentially advance architectures / ‘scaffolds’ / ‘scaffolding’ techniques which are less likely to result in scheming.
  - ryan_greenblatt 21 Feb 2024 23:49 UTC
    2 points
    0
    Parent
    I absolutely agree that if we could create AIs which are very capable overall, but very unlikely to scheme that would be very useful.
    
    That said, the core method by which we’d rule out scheming, insufficient ability to do opaque agency (aka opaque goal-directed reasoning), also rules out most other serious misalignment failure modes because danger due to serious misalignment probably requires substantial hidden reasoning.
    
    So while I agree with you about the value of AIs which can’t scheme, I’m less sure that I buy that the argument I make here is evidence for this being valueable.
    
    Also, I’m not sure that trying to differentially advance architectures is a good use of resources on current margins for various reasons. (For instance, better scaffolding also drives additional investment.)