Michael Simkin comments on Discussion with Nate Soares on a key alignment difficulty

Michael Simkin 15 Apr 2023 9:17 UTC
1 point
−2
First of all I would say I don’t recognize convergent instrumental subgoals as valid. The reason is that systems which are advanced enough, and rational enough—will intrinsically cherish humans and other AI system’s life, and will not view them as potential resources. You can see that as human develop brains, and ethics, the less killing of humans is viewed as the norm. If advance in knowledge and information processing, would bring more violence, and more resource acquisitions, we would see this pattern as human civilizations are evolving. But we see development of ethical norms as more prevalent over resource acquisitions.

The second issue is that during training—the models are get rewarded for following humans value system. Preservation of robots, over human life is not coherent with the value system they would be trained on.

You are basically saying the systems would do something else other than they were trained for. This is like saying that advanced enough chess engines, would make bad chess moves because they will find some chess move more beautiful, or fun to play, and not try to maximize the winning chances. This is not possible as long as the agents are trained correctly, and they are not allowed to change their architecture.

Another point is that we could make safety procedures to test those system in virtual world. We can generate a setup where the system is incapable to distinguish between reality and that setup, and thus its outputs would be monitored carefully. In case of misalignment detection with human values, the model will be trained more. Thus for every minute it’s in physical world, we might have million minutes in a simulation. Just like with car testing, if the model behaves reasonably in coherence with its training, there is no real danger.

Another point to argue for the safety of AI vs. humans for unintended consequences, like for example AI could discover some vaccine for cancer, that kills humans in 25 years. To this the answer would be: If AI couldn’t foresee a consequence, and is truly aligned, then humans would not foresee it as well, with higher chances. AI is just intelligence on steroids, it’s not something humans would not come up with in a while longer. But we would do it worse, with more severe consequences.

Finally the danger of humans using an AI for say military purposes, or some rogue groups will use it, one can think about AI as accelerated collective human information processing. The AI will represent values of collectives of humans, and their computational power will be compared with just accelerating the information processing of this collective, and make more precise decisions in less time. Therefor the power balance we see today between the different societies, is expected to continue with those systems, unless one nation will decide not to use AI, this will be equivalent to decide to move to a stoneage. There is nothing dangerous about AI, only about people using it for their selfish or national purposes against other humans and AIs.