FactorialCode comments on Three Stories for How AGI Comes Before FAI

FactorialCode 18 Sep 2019 3:54 UTC
3 points
Trying to create an FAI from alchemical components is obviously not the best idea. But it’s not totally clear how much of a risk these components pose, because if the components don’t work reliably, an AGI built from them may not work well enough to pose a threat.

I think that using alchemical components in an possible FAI can lead to a serious risk if the people developing it aren’t sufficiently safety conscious. Suppose that either implicitly or explicitly, the AGI is structured using alchemical components as follows:
1. A module for forming beliefs about the world.
2. A module for planning possible actions or policies.
3. A utility or reward function.
In the process of building an AGI using alchemical means, all of the above will be incrementally improved to a point where they are “good enough”. The AI is forming accurate beliefs about the world, and is making plans to get things that the researchers want. However, in a setup like this, all of the classic AI safety concerns come into play. Namely, the AI has an incentive to upgrade the first 2 modules and preserve the utility function. Since the utility function is only “good enough”, this becomes the classic setup for Goodhart and we get UFAI.

Even in a situation where the AI does not participate in it’s own further redesign, it’s effective ability to optimise the world increases as it gets more time to interact with it. As a result, an initially well behaved AGI might eventually wander into a region of state space where it becomes unfriendly using only capabilities comparable to that of a human.

That said, remains to be seen if researchers will build AI’s with this kind of architecture without additional safety precautions. But we do see model free RL variants of this general architecture such as Guided Cost Learning and Deep reinforcement learning from human preferences.

As a practical experiment to validate my reasoning, one could replicate the latter paper using a weak RL algorithm, and then see what happens if it’s swapped out with a much stronger algorithm after learning the reward function. (Some version of MPC maybe?)