gwern comments on I attempted the AI Box Experiment (and lost)

gwern 22 Jan 2013 23:04 UTC
3 points

If Eliezer had been engaging in collusion to get his wins, why would he have gone through with his last two (failed) attempts with large outside-of-game stakes? That shows pretty clearly that he thought he could win and was wrong, which he wouldn’t have if the wins were from collusion.

Yes, that was my point.

It shows the absence of any persuasive superpower too.

I’m not sure whether Eliezer claims this, but from my point of view, failing to talk himself out a simple majority of the time is not interesting (I suppose a simple majority would count as a ‘superpower’).

Rather, any chance of an AI talking itself out is catastrophic from the security point of view, since security is disjunctive and if AIs have a just 1% chance of talking themselves out, now you have a insecurity floor of 1% - no matter how many fancy techniques you throw at the box, you’ll never pass 99% security if the AI has a 1% chance of talking you into disabling the fancy techniques. Security is only as strong as its weakest link.
- CarlShulman 22 Jan 2013 23:51 UTC
  2 points
  Parent
  
  since security is disjunctive and if AIs have a just 1% chance of talking themselves out, now you have a insecurity floor of 1%
  
  If a particular situation poses a 1% risk if it comes up, one can lower the total risk by making that situation less likely: other security measures could stop the chance from being tried, e.g. a human facing the death penalty for a failed escape from a prison and a 1% success chance would not obviously try. Even an autonomous AI with interests in conflict with humanity to some degree might be designed without such a risk-loving decision algorithm as to try an improbable escape attempt in the face of punishment for failure or reward for non-attempt.
  - gwern 23 Jan 2013 0:03 UTC
    1 point
    Parent
    
    If a particular situation poses a 1% risk if it comes up, one can lower the total risk by making that situation less likely
    
    You only do that by changing the problem; a different problem will have different security properties. The new risk will still be a floor, the disjunctive problem hasn’t gone away.
    
    a human facing the death penalty for a failed escape from a prison and a 1% success chance would not obviously try.
    
    Many do try if the circumstances are bad enough, and the death penalty for a failed escape is common throughout history and in totalitarian regimes. I read just yesterday, in fact, a story of a North Korean prison camp escapee (death penalty for escape attempts goes without saying) where given his many disadvantages and challenges, a 1% success rate of reaching South Korea alive does not seem too inaccurate.
    
    Even an autonomous AI with interests in conflict with humanity to some degree might be designed without such a risk-loving decision algorithm as to try an improbable escape attempt in the face of punishment for failure or reward for non-attempt.
    
    You don’t have to be risk-loving to make a 1% attempt if that’s your best option; the 1% chance just has to be the best option, is all.
    - CarlShulman 23 Jan 2013 0:56 UTC
      0 points
      Parent
      
      You don’t have to be risk-loving to make a 1% attempt if that’s your best option; the 1% chance just has to be the best option, is all.
      
      You try to make the 99% option fairly good.