In the previous “Ambitious vs. narrow value learning” post, Paul Christiano characterized narrow value learning as learning “subgoals and instrumental values”. From that post, I got the impression that ambitious vs narrow was about the scope of the task. However, in this post you suggest that ambitious vs narrow value learning is about the amount of feedback the algorithm requires. I think there is actually a 2x2 matrix of possible approaches here: we can imagine approaches which do or don’t depend on feedback, and we can imagine approaches which try to learn all of my values or just some instrumental subset.
With this sort of setup, we still have the problem that we are maximizing a reward function which leads to convergent instrumental subgoals. In particular, the plan “disable the narrow value learning system” is likely very good according to the current estimate of the reward function, because it prevents the reward from changing causing all future actions to continue to optimize the current reward estimate.
I think it depends on the details of the implementation:
We could construct the system’s world model so human feedback is a special event that exists in a separate magesterium from the physical world, and it doesn’t believe any action taken in the physical world could do anything to affect the type or quantity of human feedback that’s given.
For redundancy, if the narrow value learning system is trying to learn how much humans approve of various actions, we can tell the system that the negative score from our disapproval of tampering with the value learning system outweighs any positive score it could achieve through tampering.
If the reward function weights rewards according to the certainty of the narrow value learning system that they are the correct reward, that creates incentives to keep the narrow value learning system operating, so the narrow value learning system can acquire greater certainty and provide a greater reward.
To elaborate a bit on the first two bullet points: It matters a lot whether the system thinks our approval is contingent on the physical configuration of the atoms in our brains. If the system thinks we will continue to disapprove of an action even after it’s reconfigured our brain’s atoms, that’s what we want.
However, in this post you suggest that ambitious vs narrow value learning is about the amount of feedback the algorithm requires.
That wasn’t exactly my point. My main point was that if we want an AI system that acts autonomously over a long period of time (think centuries), but it isn’t doing ambitious value learning (only narrow value learning), then we necessarily require a feedback mechanism that keeps the AI system “on track” (since my instrumental values will change over that period of time). Will add a summary sentence to the post.
I think it depends on the details of the implementation
Agreed, I was imagining the “default” implementation (eg. as in this paper).
For redundancy, if the narrow value learning system is trying to learn how much humans approve of various actions, we can tell the system that the negative score from our disapproval of tampering with the value learning system outweighs any positive score it could achieve through tampering.
Something along these lines seems promising, I hadn’t thought of this possibility before.
If the reward function weights rewards according to the certainty of the narrow value learning system that they are the correct reward, that creates incentives to keep the narrow value learning system operating, so the narrow value learning system can acquire greater certainty and provide a greater reward.
Yeah, uncertainty can definitely help get around this problem. (See also the next post, which should hopefully go up soon.)
In the previous “Ambitious vs. narrow value learning” post, Paul Christiano characterized narrow value learning as learning “subgoals and instrumental values”. From that post, I got the impression that ambitious vs narrow was about the scope of the task. However, in this post you suggest that ambitious vs narrow value learning is about the amount of feedback the algorithm requires. I think there is actually a 2x2 matrix of possible approaches here: we can imagine approaches which do or don’t depend on feedback, and we can imagine approaches which try to learn all of my values or just some instrumental subset.
I think it depends on the details of the implementation:
We could construct the system’s world model so human feedback is a special event that exists in a separate magesterium from the physical world, and it doesn’t believe any action taken in the physical world could do anything to affect the type or quantity of human feedback that’s given.
For redundancy, if the narrow value learning system is trying to learn how much humans approve of various actions, we can tell the system that the negative score from our disapproval of tampering with the value learning system outweighs any positive score it could achieve through tampering.
If the reward function weights rewards according to the certainty of the narrow value learning system that they are the correct reward, that creates incentives to keep the narrow value learning system operating, so the narrow value learning system can acquire greater certainty and provide a greater reward.
To elaborate a bit on the first two bullet points: It matters a lot whether the system thinks our approval is contingent on the physical configuration of the atoms in our brains. If the system thinks we will continue to disapprove of an action even after it’s reconfigured our brain’s atoms, that’s what we want.
That wasn’t exactly my point. My main point was that if we want an AI system that acts autonomously over a long period of time (think centuries), but it isn’t doing ambitious value learning (only narrow value learning), then we necessarily require a feedback mechanism that keeps the AI system “on track” (since my instrumental values will change over that period of time). Will add a summary sentence to the post.
Agreed, I was imagining the “default” implementation (eg. as in this paper).
Something along these lines seems promising, I hadn’t thought of this possibility before.
Yeah, uncertainty can definitely help get around this problem. (See also the next post, which should hopefully go up soon.)
Thanks for the reply! Looking forward to the next post!