niplav comments on Can we get an AI to “do our alignment homework for us”?

niplav 27 Feb 2024 17:49 UTC
4 points
2
It seems pretty relevant to note that we haven’t found an easy way of writing software^[1] or making hardware that, once written, can be easily evaluated to be bug-free and works as intended (see the underhanded C contest or cve-rs). The artifacts that AI systems will produce will be of similar or higher complexity.

I think this puts a serious dent into “evaluation is easier than generation”—sure, easier, but how much easier? In practice we can also solve a lot of pretty big SAT instances.
1. ↩︎
  Unless we get AI systems to write formal verifications for the code they’ve written, in which case a lot of complexity gets pushed into the specification of formal properties, which presumably would also be written by AI systems.
- ryan_greenblatt 27 Feb 2024 18:10 UTC
  6 points
  −3
  Parent
  
  It seems pretty relevant to note that we haven’t found an easy way of writing software[1] or making hardware that, once written, can be easily evaluated to be bug-free and works as intended
  
  I think software companies do OK handling negligent and/or adversarial employees in practice via verification? Perhaps you disagree? And if this was a serious problem (e.g. if you could only trust a known 1/30th of employees but wanted to produce the same output), I think this would be possible to handle in practice as long as we could fire employees which are caught behaving badly?
  
  If I recall correctly, underhanded C contest entries are almost all pretty easily caught by decent software engineers with a combination of testing, syntax highlighting, and basic compiler linting.
  
  I don’t think this is easy, but it’s not as though the situation with human coders supervising human coders of comparable abilities looks that discouraging. (It seems basically fine?)
  - niplav 27 Feb 2024 19:42 UTC
    4 points
    2
    Parent
    (I have not read your recent work on AI control, so feel free to refer me to any material that answers my questions there.)
    
    At least with C, in my experience these kinds of mistakes are not easily caught by testing, syntax highlighting and basic compiler linting. (There’s a reason why valgrind exists!) Looking over the winners for 2008, I have no idea what is going on, and think it’d take me quite a while to figure out what the hell is happening, and whether it’s sketchy.
    
    I’d enjoy reading about experiments where people have to figure out whether a piece of C code is underhanded.
    
    Another example that feels relevant, but where I’m not sure about the exact lesson, that comes to mind is the NSA modifying S-boxes for DES in order to make them more resistant to differential cryptanalysis.
    
    The NSA hid a defense from a specific attack in the S-boxes
    People figured this out only when the specific attack was found
    It is unknown whether they hid anything that makes offense easier
    
    Is it possible that they hid anything that makes offense easier? I don’t know.
    
    Edit: After some searching, I came upon the pRNG Dual_EC_DRBG, which did have a bunch of NSA involvement, and where they were pretty much caught biasing the numbers in a specific direction. So attacks here are definitely possible, though in this case it took more than five years to get caught.
    
    As for the rest of your comment, I think we have different models, where you analogize negatively reinforcing AI systems to firing, which would be more applicable if we were training several systems. I’m pretty sure you’ve written an answer to “negatively reinforcing bad behavior can reinforce both good behavior and better hiding”, so I’ll leave it at that.
    - ryan_greenblatt 27 Feb 2024 19:43 UTC
      6 points
      0
      Parent
      
      As for the rest of your comment, I think we have different models, where you analogize negatively reinforcing AI systems to firing, which would be more applicable if we were training several systems.
      
      I’m not analogizing negative reinforcement to firing, I’m analogizing firing to no longer using some AI. See catching AIs red-handed for more discussion.
    - ryan_greenblatt 27 Feb 2024 19:46 UTC
      4 points
      0
      Parent
      
      At least with C, in my experience these kinds of mistakes
      
      I agree that mistakes can be very hard to catch, but I still think it’s hard to cause specific outcomes via carefully inserted bugs which aren’t caught by careful auditing and testing. (See e.g. underhanded C where I think testing does pretty well.)
      
      Looking over the winners for 2008
      
      This one is trivially resolved with unit testing I think. Though this breaks the analogousness of the problem. But I think my overall point stands, it just actually seems hard to cause specific outcomes IMO.