This architecture is quite similar (but much more fleshed out/detailed) to what I’ve been proposing in, e.g., “one possible alternative” towards the end of this comment.
One potential issue is that the AI needs to be choosing actions based on
a) the true utility function of humans which to the best of its current knowledge is represented by distribution X, but conditional on some other information A would be represented by some other distribution Y
and not based simply on
b) utility function distribution X which represents its current best knowledge of the true utility function of humans.
For example, imagine if the AI is deciding whether to wirehead all humans, so thoroughly that they will all remain in a comatose state until death. The AI thinks that there is a 60% chance humans would want this, and that conditional on wanting it they will want it by 100 utilons and conditional on not wanting it they will disapprove by −100 utilons.
The AI considers three actions:
1) wirehead the humans, expected value assessed as 20 utilons under either (a) or (b)
2) not wirehead the humans, expected value assessed as −20 utilons under either (a) or (b)
3) ask the humans at zero cost what they want, and do what they want, expected value assessed as (under a) 60 utilons or (under b) 4 utilons.
So, under (a), the AI will correctly ask humans what they want, and under (b) will self-modify to disable the utiliscope (or otherwise prevent the utility function from being updated) and then wirehead all humans in the face of our pleas to the contrary.
I think that your architecture with the utility function separated from the world-model results in (b), unless there’s some mechanism to achieve (a) I’m not aware of. In my version, I am hoping that making the utiliscope be a pointer to a concept in the world model identified using interpretability tools, rather than extracting a utility function which is handled separately, might be able to achieve (a).
This architecture is quite similar (but much more fleshed out/detailed) to what I’ve been proposing in, e.g., “one possible alternative” towards the end of this comment.
One potential issue is that the AI needs to be choosing actions based on
a) the true utility function of humans which to the best of its current knowledge is represented by distribution X, but conditional on some other information A would be represented by some other distribution Y
and not based simply on
b) utility function distribution X which represents its current best knowledge of the true utility function of humans.
For example, imagine if the AI is deciding whether to wirehead all humans, so thoroughly that they will all remain in a comatose state until death. The AI thinks that there is a 60% chance humans would want this, and that conditional on wanting it they will want it by 100 utilons and conditional on not wanting it they will disapprove by −100 utilons.
The AI considers three actions:
1) wirehead the humans, expected value assessed as 20 utilons under either (a) or (b)
2) not wirehead the humans, expected value assessed as −20 utilons under either (a) or (b)
3) ask the humans at zero cost what they want, and do what they want, expected value assessed as (under a) 60 utilons or (under b) 4 utilons.
So, under (a), the AI will correctly ask humans what they want, and under (b) will self-modify to disable the utiliscope (or otherwise prevent the utility function from being updated) and then wirehead all humans in the face of our pleas to the contrary.
I think that your architecture with the utility function separated from the world-model results in (b), unless there’s some mechanism to achieve (a) I’m not aware of. In my version, I am hoping that making the utiliscope be a pointer to a concept in the world model identified using interpretability tools, rather than extracting a utility function which is handled separately, might be able to achieve (a).
I also agree with Charlie Steiner’s comment.