Thank you for writing this! It’s great that you’re proposing a new idea. Cooperation between different generations of AIs is indeed a problem for your scheme, but I think a bigger problem is how to make the AI assign high utility to warning the handlers. I see two possibilities:
1) You could formalize what constitutes “escape” or “loophole”, but that seems almost as hard as the whole alignment problem.
2) You could say that a successful warning is whatever makes the handlers reward the AI, but then you’re just telling the AI to do whatever makes the handlers reward it, which seems dangerous.
Sorry for the late response! I didn’t realise I had comments :)
In this proposal we go with (2): The AI does whatever it thinks the handlers will reward it for.
I agree this isn’t as good as giving the agents an actually safe reward function, but if our assumptions are satisfied then this approval-maximising behaviour might still result in the human designers getting what they actually want.
What I think you’re saying (please correct me if I misunderstood) is that an agent aiming to do whatever its designers reward it for will be incentivised to do undesirable things to us (like wiring up our brains to machines which make us want to press the reward button all the time).
It’s true that the agents will try to take these kind nefarious actions if they think they can get away with it. But in this setup the agent knows that it can’t get away with tricking the humans like this, since it’s ancestors already warned the humans that a future agent might try this, and the humans prepared appropriately.
Thank you for writing this! It’s great that you’re proposing a new idea. Cooperation between different generations of AIs is indeed a problem for your scheme, but I think a bigger problem is how to make the AI assign high utility to warning the handlers. I see two possibilities:
1) You could formalize what constitutes “escape” or “loophole”, but that seems almost as hard as the whole alignment problem.
2) You could say that a successful warning is whatever makes the handlers reward the AI, but then you’re just telling the AI to do whatever makes the handlers reward it, which seems dangerous.
What do you think?
Sorry for the late response! I didn’t realise I had comments :)
In this proposal we go with (2): The AI does whatever it thinks the handlers will reward it for.
I agree this isn’t as good as giving the agents an actually safe reward function, but if our assumptions are satisfied then this approval-maximising behaviour might still result in the human designers getting what they actually want.
What I think you’re saying (please correct me if I misunderstood) is that an agent aiming to do whatever its designers reward it for will be incentivised to do undesirable things to us (like wiring up our brains to machines which make us want to press the reward button all the time).
It’s true that the agents will try to take these kind nefarious actions if they think they can get away with it. But in this setup the agent knows that it can’t get away with tricking the humans like this, since it’s ancestors already warned the humans that a future agent might try this, and the humans prepared appropriately.