First, thanks for this writeup. I’ve read through many of Paul’s posts on the subject, but they now make a lot more sense to me.
One question: you write, `Corrigibility requires understanding of AI safety concepts. For example, breaking down the task “What action does the user want me to take?” into the two subtasks “What are the user’s values?” and “What action is best according to these values”? is not corrigible. It produces an action optimized for some approximate model of the user’s values, which could be misaligned.`
This is something I’ve been worried about recently. But I’m not sure what the alternative would be. The default model for an agent not explicitly asking that question seems to me like one where they try their best to optimize for their values without being so explicit. This is assuming that their goal is to optimize their caller’s values. It seems like they’re either trying to maximize either someone’s values or a judgement meant to judge them against someone’s values.
Is there an alternative thing they could maximize for that would be considered corrigible?
[edited]: Originally I thought the overseer was different from the human.
I’m not sure if it’s what your thinking of, but I’m thinking of “What action is best according to these values” == “maximize reward”. One alternative that’s worth investigating more (IMO) is imposing hard constraints.
For instance, you could have an RL agent taking actions in $(a_1, a_2) \in \mathbb{R}^2$, and impose the constraint that $a_1 + a_2 < 3$ by projection.
First, thanks for this writeup. I’ve read through many of Paul’s posts on the subject, but they now make a lot more sense to me.
One question: you write, `Corrigibility requires understanding of AI safety concepts. For example, breaking down the task “What action does the user want me to take?” into the two subtasks “What are the user’s values?” and “What action is best according to these values”? is not corrigible. It produces an action optimized for some approximate model of the user’s values, which could be misaligned.`
This is something I’ve been worried about recently. But I’m not sure what the alternative would be. The default model for an agent not explicitly asking that question seems to me like one where they try their best to optimize for their values without being so explicit. This is assuming that their goal is to optimize their caller’s values. It seems like they’re either trying to maximize either someone’s values or a judgement meant to judge them against someone’s values.
Is there an alternative thing they could maximize for that would be considered corrigible?
[edited]: Originally I thought the overseer was different from the human.
Paul answered this question in this thread.
“But I’m not sure what the alternative would be.”
I’m not sure if it’s what your thinking of, but I’m thinking of “What action is best according to these values” == “maximize reward”. One alternative that’s worth investigating more (IMO) is imposing hard constraints.
For instance, you could have an RL agent taking actions in $(a_1, a_2) \in \mathbb{R}^2$, and impose the constraint that $a_1 + a_2 < 3$ by projection.
A recent near-term safety paper takes this approach: https://arxiv.org/abs/1801.08757