I feel like there’s a point worth emphasizing here about myopia and the existing work around that, as OP stated. I don’t think of myopia as generally promising because of FDT-style reasoning since a VNM-style agent would continually optimise for consistency over longer time periods.
Therefore this seems a bit like the myopic reasoning RLHF will go towards the same failure modes as RLHF in the limit as the agent becomes more capable. (I’m happy to be shown I’m wrong here) This also depends on questions such as to what extent the underlying base model will be a maximiser and the agency model of the base model. (which OP also states)
If someone were to provide a convincing story that showed; 1. How this method could be used whilst counteracting deception 2. An example of how this would look from the inside of the AI 3. How the model itself doesn’t converge towards reasoning RLHF 4. How this then itself is happening inside a generally capable AI
I feel like there’s a point worth emphasizing here about myopia and the existing work around that, as OP stated. I don’t think of myopia as generally promising because of FDT-style reasoning since a VNM-style agent would continually optimise for consistency over longer time periods.
Therefore this seems a bit like the myopic reasoning RLHF will go towards the same failure modes as RLHF in the limit as the agent becomes more capable. (I’m happy to be shown I’m wrong here)
This also depends on questions such as to what extent the underlying base model will be a maximiser and the agency model of the base model. (which OP also states)
If someone were to provide a convincing story that showed;
1. How this method could be used whilst counteracting deception
2. An example of how this would look from the inside of the AI
3. How the model itself doesn’t converge towards reasoning RLHF
4. How this then itself is happening inside a generally capable AI
Then I might be convinced that it is a good idea.