niplav comments on Can we get an AI to “do our alignment homework for us”?

niplav 27 Feb 2024 19:42 UTC
4 points
2
(I have not read your recent work on AI control, so feel free to refer me to any material that answers my questions there.)

At least with C, in my experience these kinds of mistakes are not easily caught by testing, syntax highlighting and basic compiler linting. (There’s a reason why valgrind exists!) Looking over the winners for 2008, I have no idea what is going on, and think it’d take me quite a while to figure out what the hell is happening, and whether it’s sketchy.

I’d enjoy reading about experiments where people have to figure out whether a piece of C code is underhanded.

Another example that feels relevant, but where I’m not sure about the exact lesson, that comes to mind is the NSA modifying S-boxes for DES in order to make them more resistant to differential cryptanalysis.
- The NSA hid a defense from a specific attack in the S-boxes
- People figured this out only when the specific attack was found
- It is unknown whether they hid anything that makes offense easier
Is it possible that they hid anything that makes offense easier? I don’t know.

Edit: After some searching, I came upon the pRNG Dual_EC_DRBG, which did have a bunch of NSA involvement, and where they were pretty much caught biasing the numbers in a specific direction. So attacks here are definitely possible, though in this case it took more than five years to get caught.

As for the rest of your comment, I think we have different models, where you analogize negatively reinforcing AI systems to firing, which would be more applicable if we were training several systems. I’m pretty sure you’ve written an answer to “negatively reinforcing bad behavior can reinforce both good behavior and better hiding”, so I’ll leave it at that.
- ryan_greenblatt 27 Feb 2024 19:43 UTC
  6 points
  0
  Parent
  
  As for the rest of your comment, I think we have different models, where you analogize negatively reinforcing AI systems to firing, which would be more applicable if we were training several systems.
  
  I’m not analogizing negative reinforcement to firing, I’m analogizing firing to no longer using some AI. See catching AIs red-handed for more discussion.
- ryan_greenblatt 27 Feb 2024 19:46 UTC
  4 points
  0
  Parent
  
  At least with C, in my experience these kinds of mistakes
  
  I agree that mistakes can be very hard to catch, but I still think it’s hard to cause specific outcomes via carefully inserted bugs which aren’t caught by careful auditing and testing. (See e.g. underhanded C where I think testing does pretty well.)
  
  Looking over the winners for 2008
  
  This one is trivially resolved with unit testing I think. Though this breaks the analogousness of the problem. But I think my overall point stands, it just actually seems hard to cause specific outcomes IMO.