gwern comments on I attempted the AI Box Experiment (and lost)

gwern 22 Jan 2013 18:48 UTC
5 points

It certainly fuels a sense of awe and reverence for his alleged genius. All for an achievement that can’t been verified.

It really shouldn’t, any more than someone discovering a security vulnerability in C programs should make them seem impressive. In this instance, all I can think is “Oh look, someone demonstrated that ‘social engineering’ - the single most reliable and damaging strategy in hacking, responsible for millions of attacks over the history of computing—works a nontrivial fraction of the time, again? What a surprise.”

The only surprise and interesting part of the AI boxing games for me is that some people seem to think that AI boxing is somehow different—“it’s different this time”, as the mocking phrase goes.

That reminds me of the people who claim all sorts of supernatural powers, from Rhabdomancy to telepathy to various magical martial art moves. Often, when faced with the opportunity of performing in a controlled test, they run away with excuses like the energy flux being not right or something.

A perfectly reasonable analogy, surely. Because we have millions of instances of successful telepathy and magical martial arts being used to break security.

With direct, prologed contact over the course of weeks, maybe. With with a two hours text-only conversation, or even with a single line? Nope.

As time goes up, the odds succeed? Yeah, I’d agree. But what happens when you reverse that—is there any principled reason to think that the odds of just continuing the conversation goes to zero before you hit the allowed one-liner?

The most likely explanations for his victories are the other party not taking the game seriously,

A strange game to bother playing if you don’t take it seriously, and this would explain only the first time; any subsequent player is probably playing precisely because they had heard of the first game and are skeptical or interested in trying it out themselves.

or thinking poorly,

That would be conceding the point of the exercise.

or being outright colluded with him.

Might explain the first one, but why would anyone else collude? (Oh right, they’re all cultists or something. Even the ones who fail to escape the box and just come close.)
- CarlShulman 22 Jan 2013 22:57 UTC
  2 points
  Parent
  
  Might explain the first one, but why would anyone else collude? (Oh right, they’re all cultists or something. Even the ones who fail to escape the box and just come close.)
  
  If Eliezer had been engaging in collusion to get his wins, why would he have gone through with his last two (failed) attempts with large outside-of-game stakes? That shows pretty clearly that he thought he could win and was wrong, which he wouldn’t have if the wins were from collusion. It shows the absence of any persuasive superpower too.
  
  Also, I let the AI out of the box, but I remain an advocate of AI boxing (where it makes sense). So it’s not the case that Eliezer has some compelling argument against boxing that convinces everyone (ignored by those with outside-of-game stakes to protect those stakes).
  
  So the chats show something, but not that Eliezer is inhumanly persuasive nor that AI boxing won’t work.
  - gwern 22 Jan 2013 23:04 UTC
    3 points
    Parent
    
    If Eliezer had been engaging in collusion to get his wins, why would he have gone through with his last two (failed) attempts with large outside-of-game stakes? That shows pretty clearly that he thought he could win and was wrong, which he wouldn’t have if the wins were from collusion.
    
    Yes, that was my point.
    
    It shows the absence of any persuasive superpower too.
    
    I’m not sure whether Eliezer claims this, but from my point of view, failing to talk himself out a simple majority of the time is not interesting (I suppose a simple majority would count as a ‘superpower’).
    
    Rather, any chance of an AI talking itself out is catastrophic from the security point of view, since security is disjunctive and if AIs have a just 1% chance of talking themselves out, now you have a insecurity floor of 1% - no matter how many fancy techniques you throw at the box, you’ll never pass 99% security if the AI has a 1% chance of talking you into disabling the fancy techniques. Security is only as strong as its weakest link.
    - CarlShulman 22 Jan 2013 23:51 UTC
      2 points
      Parent
      
      since security is disjunctive and if AIs have a just 1% chance of talking themselves out, now you have a insecurity floor of 1%
      
      If a particular situation poses a 1% risk if it comes up, one can lower the total risk by making that situation less likely: other security measures could stop the chance from being tried, e.g. a human facing the death penalty for a failed escape from a prison and a 1% success chance would not obviously try. Even an autonomous AI with interests in conflict with humanity to some degree might be designed without such a risk-loving decision algorithm as to try an improbable escape attempt in the face of punishment for failure or reward for non-attempt.
      - gwern 23 Jan 2013 0:03 UTC
        1 point
        Parent
        
        If a particular situation poses a 1% risk if it comes up, one can lower the total risk by making that situation less likely
        
        You only do that by changing the problem; a different problem will have different security properties. The new risk will still be a floor, the disjunctive problem hasn’t gone away.
        
        a human facing the death penalty for a failed escape from a prison and a 1% success chance would not obviously try.
        
        Many do try if the circumstances are bad enough, and the death penalty for a failed escape is common throughout history and in totalitarian regimes. I read just yesterday, in fact, a story of a North Korean prison camp escapee (death penalty for escape attempts goes without saying) where given his many disadvantages and challenges, a 1% success rate of reaching South Korea alive does not seem too inaccurate.
        
        Even an autonomous AI with interests in conflict with humanity to some degree might be designed without such a risk-loving decision algorithm as to try an improbable escape attempt in the face of punishment for failure or reward for non-attempt.
        
        You don’t have to be risk-loving to make a 1% attempt if that’s your best option; the 1% chance just has to be the best option, is all.
        CarlShulman 23 Jan 2013 0:56 UTC
        0 points
        Parent
        
        You don’t have to be risk-loving to make a 1% attempt if that’s your best option; the 1% chance just has to be the best option, is all.
        
        You try to make the 99% option fairly good.