I actually don’t understand why you say they can’t be fully disentangled.
IIRC, it seemed to me during the discussion that your main objection was around whether (e.g.) “arbitrarily long deliberation (ALD)” was (or could be) fully specified in a way that accounts properly for things like deception, manipulation, etc. More concretely, I think you mentioned the possibility of an AI affecting the deliberation process in an undesirable way.
But I think it’s reasonable to assume (within the bounds of a discussion) that there is a non-terrible way (in principle) to specify things like “manipulation”. So do you disagree? Or is your objection something else entirely?
Agree that it’s useful to disentangle them, but it’s also useful to realise that they can’t be fully disentangled… yet.
I actually don’t understand why you say they can’t be fully disentangled.
IIRC, it seemed to me during the discussion that your main objection was around whether (e.g.) “arbitrarily long deliberation (ALD)” was (or could be) fully specified in a way that accounts properly for things like deception, manipulation, etc. More concretely, I think you mentioned the possibility of an AI affecting the deliberation process in an undesirable way.
But I think it’s reasonable to assume (within the bounds of a discussion) that there is a non-terrible way (in principle) to specify things like “manipulation”. So do you disagree? Or is your objection something else entirely?
Hey there!
Given a longer answer here: https://www.lesswrong.com/posts/Q7WiHdSSShkNsgDpa/how-much-can-value-learning-be-disentangled