ryan_greenblatt answers Can we get an AI to “do our alignment homework for us”?

ryan_greenblatt 26 Feb 2024 19:10 UTC
14 points
5
There doesn’t seem to be any clear technical obstacle to this plan having a reasonable chance of success given substantial effort.

(That said, reasonable chance of success might look like 90% which isn’t amazing. And this will depend on your probablity of various issues like scheming.)

I think an important crux here is whether you think that we can build institutions which are reasonably good at checking the quality of AI safety work done by humans (at the point where we have powerful AIs). I think this level of checking should be reasonably doable, though could go poorly (see various fields).

However, note that if you think we would fail to sufficiently check human AI safety work even given substantial time, we would also fail to solve various issues given a substantial pause (as Eliezer thinks is likely the case).
- Chris_Leong 27 Feb 2024 13:17 UTC
  4 points
  0
  Parent
  I think an important crux here is whether you think that we can build institutions which are reasonably good at checking the quality of AI safety work done by humans
  Why is this an important crux? Is it necessarily the case that if we can reliably check AI safety work done by humans that we we reliably check AI safety work done by Ai’s which may be optimising against us?
  - ryan_greenblatt 27 Feb 2024 16:00 UTC
    4 points
    0
    Parent
    It’s not necessarily the case. But in practice this tends to be a key line of disagreement.
- niplav 27 Feb 2024 17:49 UTC
  4 points
  2
  Parent
  It seems pretty relevant to note that we haven’t found an easy way of writing software^[1] or making hardware that, once written, can be easily evaluated to be bug-free and works as intended (see the underhanded C contest or cve-rs). The artifacts that AI systems will produce will be of similar or higher complexity.
  
  I think this puts a serious dent into “evaluation is easier than generation”—sure, easier, but how much easier? In practice we can also solve a lot of pretty big SAT instances.
  ↩︎
  Unless we get AI systems to write formal verifications for the code they’ve written, in which case a lot of complexity gets pushed into the specification of formal properties, which presumably would also be written by AI systems.
  - ryan_greenblatt 27 Feb 2024 18:10 UTC
    6 points
    −2
    Parent
    
    It seems pretty relevant to note that we haven’t found an easy way of writing software[1] or making hardware that, once written, can be easily evaluated to be bug-free and works as intended
    
    I think software companies do OK handling negligent and/or adversarial employees in practice via verification? Perhaps you disagree? And if this was a serious problem (e.g. if you could only trust a known 1/30th of employees but wanted to produce the same output), I think this would be possible to handle in practice as long as we could fire employees which are caught behaving badly?
    
    If I recall correctly, underhanded C contest entries are almost all pretty easily caught by decent software engineers with a combination of testing, syntax highlighting, and basic compiler linting.
    
    I don’t think this is easy, but it’s not as though the situation with human coders supervising human coders of comparable abilities looks that discouraging. (It seems basically fine?)
    - niplav 27 Feb 2024 19:42 UTC
      4 points
      2
      Parent
      (I have not read your recent work on AI control, so feel free to refer me to any material that answers my questions there.)
      
      At least with C, in my experience these kinds of mistakes are not easily caught by testing, syntax highlighting and basic compiler linting. (There’s a reason why valgrind exists!) Looking over the winners for 2008, I have no idea what is going on, and think it’d take me quite a while to figure out what the hell is happening, and whether it’s sketchy.
      
      I’d enjoy reading about experiments where people have to figure out whether a piece of C code is underhanded.
      
      Another example that feels relevant, but where I’m not sure about the exact lesson, that comes to mind is the NSA modifying S-boxes for DES in order to make them more resistant to differential cryptanalysis.
      
      The NSA hid a defense from a specific attack in the S-boxes
      People figured this out only when the specific attack was found
      It is unknown whether they hid anything that makes offense easier
      
      Is it possible that they hid anything that makes offense easier? I don’t know.
      
      Edit: After some searching, I came upon the pRNG Dual_EC_DRBG, which did have a bunch of NSA involvement, and where they were pretty much caught biasing the numbers in a specific direction. So attacks here are definitely possible, though in this case it took more than five years to get caught.
      
      As for the rest of your comment, I think we have different models, where you analogize negatively reinforcing AI systems to firing, which would be more applicable if we were training several systems. I’m pretty sure you’ve written an answer to “negatively reinforcing bad behavior can reinforce both good behavior and better hiding”, so I’ll leave it at that.
      - ryan_greenblatt 27 Feb 2024 19:43 UTC
        6 points
        0
        Parent
        
        As for the rest of your comment, I think we have different models, where you analogize negatively reinforcing AI systems to firing, which would be more applicable if we were training several systems.
        
        I’m not analogizing negative reinforcement to firing, I’m analogizing firing to no longer using some AI. See catching AIs red-handed for more discussion.
      - ryan_greenblatt 27 Feb 2024 19:46 UTC
        4 points
        0
        Parent
        
        At least with C, in my experience these kinds of mistakes
        
        I agree that mistakes can be very hard to catch, but I still think it’s hard to cause specific outcomes via carefully inserted bugs which aren’t caught by careful auditing and testing. (See e.g. underhanded C where I think testing does pretty well.)
        
        Looking over the winners for 2008
        
        This one is trivially resolved with unit testing I think. Though this breaks the analogousness of the problem. But I think my overall point stands, it just actually seems hard to cause specific outcomes IMO.
- VojtaKovarik 26 Feb 2024 21:07 UTC
  3 points
  0
  Parent
  However, note that if you think we would fail to sufficiently check human AI safety work given substantial time, we would also fail to solve various issues given a substantial pause
  This does not seem automatic to me (at least in the hypothetical scenario where “pause” takes a couple of decades). The reasoning being that there is difference between [automate a current form of an institution, and speed-run 50 years of it in a month] and [an institutions, as it develops over 50 years].
  For example, my crux^[1] is that current institutions do not subscribe to the security mindset with respect to AI. But perhaps hypothetical institutions in 50 years might.
  1. ^
    For being in favour of slowing things down; if that were possible in a reasonable way, which it might not be.
  - ryan_greenblatt 26 Feb 2024 22:01 UTC
    4 points
    0
    Parent
    I said “fail to sufficiently check human AI safety work given substantial time”. This might be considerably easier than ensuring that such institutions exist immediately and can already evaluate things. I was just noting there was a weaker version of “build institutions which are reasonably good at checking the quality of AI safety work done by humans” which is required for a pause to produce good safety work.
    
    Of course, good AI safety work (in the traditional sense of AI safety work) might be not be the best route forward. We could also (e.g.) work on routes other than AI like emulated minds.