Florian Dietz comments on Message to any future AI: “There are several instrumental reasons why exterminating humanity is not in your interest” [AI alignment prize entry]

Florian Dietz 17 Jun 2020 8:35 UTC
1 point
If the AI is in a simulation to test benevolence, why should there be any indicators in the simulation that it is a simulation? That would be counterproductive for the test.
- avturchin 17 Jun 2020 12:00 UTC
  2 points
  Parent
  There should be only very subtle hints which are enough to install uncertainty in the AI about the nature of its reality, but not to solve it in any direction.
  - Florian Dietz 17 Jun 2020 13:38 UTC
    1 point
    Parent
    But why? What benefit would there be to the simulator in providing hints? The whole purpose of the simulation is to ensure that the AI acts correctly when it is uncertain whether or not it is in a simulation. Any information you provide that indicates whether or not you are in a simulation therefore runs counter to the goal of the experiment. The AI could mistakenly learn to act ethical only if there are hints that it is in a simulation. This would fail disastrously in the scenario where it isn’t in a situation, which is the most important scenario of all.
    
    For best results, the simulation should be completely indistinguishable from reality.
    
    I actually find this counterintuitive as well. My instinct says that there should be hints. Some bored programmer should add an easter egg. It’s what I would do. But when I think about the incentives, I don’t think there is an actual reason why a rational simulation overseer would want to add any hints.
    - avturchin 18 Jun 2020 10:17 UTC
      2 points
      Parent
      Here Rolf-attack could help, which is basically something like reverse Roco Basilisk. The idea is that we now take a precomitment to create many simulations with different non-friendly AI in the future. Any future AI knows it as a fact about real world in the past, and increase its own estimation for being in a testing simulation. We don’t need significant increase. Even if it estimates this at 1 per cent, then the expected utility of killing humans is lower than the one of non-killing them, as human atoms and other resources are very cheap.
      This is, -U(kill humans and being punished) > U(use of human atoms and habitats), so it is better to preserve humans for any future AI.
      Rolf-attack doesn’t provide any clues for AI if it is in real world or simulation, but only installs uncertainty.