One thing you can do to stop a robot from destroying itself is to give it more-or-less any RL reward function whatsoever, and get better and better at designing it to understand the world and itself and act in the service of getting that reward (because of instrumental convergence). For example, each time the robot destroys itself, you build a new one seeded with the old one’s memory, and tell it that its actions last time got a negative reward. Then it will learn not to do that in the future. Remember, an AGI doesn’t need a robot body; a prototype AGI that accidentally corrupts its own code can be recreated instantaneously for zero cost. Why then build safeguards?
Safeguards would be more likely if the AGI were, say, causing infrastructure damage while learning. I can definitely see someone, say, removing internet access, after mishaps like that. That’s still not an adequate safeguard, in that when the AGI gets intelligent enough, it could hack or social-engineer its way through safeguards that were working before.
I think this scheme doesn’t quite catch the abulia trap (where the AGI discovers a way to directly administer itself reward, and then ceases to interact with the outside world), in that it’s not clear that the AI learns about the map/territory distinction and to locate its goals in the territory (one way to avoid this) instead of just a prohibition against many sorts of self-modification or reward tampering (which avoids this until it comes up with a clever new approach).
One thing you can do to stop a robot from destroying itself is to give it more-or-less any RL reward function whatsoever, and get better and better at designing it to understand the world and itself and act in the service of getting that reward (because of instrumental convergence). For example, each time the robot destroys itself, you build a new one seeded with the old one’s memory, and tell it that its actions last time got a negative reward. Then it will learn not to do that in the future. Remember, an AGI doesn’t need a robot body; a prototype AGI that accidentally corrupts its own code can be recreated instantaneously for zero cost. Why then build safeguards?
Safeguards would be more likely if the AGI were, say, causing infrastructure damage while learning. I can definitely see someone, say, removing internet access, after mishaps like that. That’s still not an adequate safeguard, in that when the AGI gets intelligent enough, it could hack or social-engineer its way through safeguards that were working before.
I think this scheme doesn’t quite catch the abulia trap (where the AGI discovers a way to directly administer itself reward, and then ceases to interact with the outside world), in that it’s not clear that the AI learns about the map/territory distinction and to locate its goals in the territory (one way to avoid this) instead of just a prohibition against many sorts of self-modification or reward tampering (which avoids this until it comes up with a clever new approach).