Wei Dai comments on A list of core AI safety problems and how I hope to solve them

Wei Dai 30 Aug 2023 22:28 UTC
LW: 7 AF: 5
0
AF

If it’s not misuse, the provisions in 5.1.4-5 will steer the search process away from policies that attempt to propagandize to humans.

Ok I’ll quote 5.1.4-5 to make it easier for others to follow this discussion:

5.1.4. It may be that the easiest plan to find involves an unacceptable degree of power-seeking and control over irrelevant variables. Therefore, the score function should penalize divergence of the trajectory of the world state from the trajectory of the status quo (in which no powerful AI systems take any actions).

5.1.5. The incentives under 5.1.4 by default are to take control over irrelevant variables so as to ensure that they proceed as in the anticipated “status quo”. Infrabayesian uncertainty about the dynamics is the final component that removes this incentive. In particular, the infrabayesian prior can (and should) have a high degree of Knightian uncertainty about human decisions and behaviour. This makes the most effective way to limit the maximum divergence (of human trajectories from the status quo) actually not interfering.

I’m not sure how these are intended to work. How do you intend to define/implement “divergence”? How does that definition/implementation combined with “high degree of Knightian uncertainty about human decisions and behaviour” actually cause the AI to “not interfere” but also still accomplish the goals that we give it?

In order to accomplish its goals, the AI has to do lots of things that will have butterfly effects on the future, so the system has to allow it to do those things, but also not allow it to “propagandize to humans”. It’s just unclear to me how you intend to achieve this.
- Roman Leventov 20 Sep 2023 19:17 UTC
  3 points
  2
  Parent
  This doesn’t directly answer your questions, but since the OAA already requires global coordination and agreement to follow the plans spit out by the superintelligent AI, maybe propagandizing people is not necessary. Especially if we consider that by the time the OAA becomes possible, the economy and science are probably already largely automated by CoEms and don’t need to involve motivated humans.
  Then, the time-boundedness of the plan raises the chances that the plan doesn’t concern with changing people’s values and preferences as a side effect (which will be relevant for the ongoing work of shaping the constraints and desiderata for the next iteration of the plan). Some such interference with values will inevitably happen, though. That’s what Davidad considers when he writes “A de-pessimizing OAA would effectively buy humanity some time, and freedom to experiment with less risk, for tackling the CEV-style alignment problem—which is harder than merely mitigating extinction risk.”