Donald Hobson comments on Hackable Rewards as a Safety Valve?

Donald Hobson 10 Sep 2019 23:06 UTC
LW: 3 AF: 2
AF
Intuition Dump. Safety through this mechanism seems to me like aiming a rocket at mars, and accidently hitting the moon. There might well be a region where P(doom|superintelligence) is lower, but lower in the sense of 90% is lower than 99.9%. Suppose we have the clearest, most stereotypical case of wireheading as I understand it. A mesa optimizer with a detailed model of its own workings and the terminal goal of maximizing the flow of current in some wire. (During training, current in the reward signal wire reliably correlated with reward.)

The plan that maximizes this flow long term is to take over the universe and store all the energy, to gradually turn into electricity. If the agent has access to its own internals before it really understands that it is an optimizer, or is thinking about the long term future of the universe, it might manage to brick its self. If the agent is sufficiently myopic, is not considering time travel or acausal trade, and has the choice of running high voltage current through itself now, or slowly taking over the world, it might choose the former.

Note that both of these look like hitting a fairly small target in design, and lab enviroment space. The mesa optimizer might have some other terminal goal. Suppose the prototype AI has the opportunity to write arbitrary code to an external computer, and an understanding of AI design before it has self modification access. The AI creates a subagent that cares about the amount of current in a wire in the first AI, the subagent can optimize this without destroying itself. Even if the first agent then bricks itself, we have a AI that will dig the fried circuit boards out the trashcan, and throw all the cosmic commons into protecting and powering them.

In conclusion, this is not a safe part of agentspace, just a part that’s slightly less guaranteed to kill you. I would say it was of little to no strategic importance. Especially if you think that all humans making AI will be reasonably on the same page regarding safety, scenarios where AI alignment is nearly solved, and the first people to ASI barely know the field exists are unlikely. If the first ASI self destructs for reasons like this, we have all the peaces for making superintelligence, and people with no sense safety are trying to make one. I would expect another attempt a few weeks later to doom us. (Unless the first AI bricked itself in a sufficiently spectacular manor, like hacking into nukes to create a giant EMP in its circuits. That might get enough people seeing danger to have everyone stop.)