So after talking w/Stuart, I guess what he means by “humans learning from the AI’s actions” is that what humans’ beliefs about U converges to actually changes (for the better). I’m not sure if that’s really desirable, atm.
On a separate note, my proposal has the practical issue that this agent only views its own potential influence on u* as undesirable (and not other agents’). So I think ultimately we want a more rich set of counter-factuals, including, e.g. that humans continue to exist indefinitely (otherwise P_Ht becomes undefined when humanity is extinct).
So after talking w/Stuart, I guess what he means by “humans learning from the AI’s actions” is that what humans’ beliefs about U converges to actually changes (for the better). I’m not sure if that’s really desirable, atm.
On a separate note, my proposal has the practical issue that this agent only views its own potential influence on u* as undesirable (and not other agents’). So I think ultimately we want a more rich set of counter-factuals, including, e.g. that humans continue to exist indefinitely (otherwise P_Ht becomes undefined when humanity is extinct).