In contrast, humans map multiple observations onto the same internal state.
Is this supposed to say something like: “Humans can map a single observation onto different internal states, depending on their previous internal state”?
UE(e,t)=the number of paperclips in e
For HCH-bot, what’s the motivation? If we can compute the KL, we can compute HCH(i), so why not just use HCH(i) instead? Or is this just exploring a potential approximation?
A consequential approval-maximizing agent takes the action that gets the highest approval from a human overseer. Such agents have an incentive to tamper with their reward channels, e.g., by persuading the human they are conscious and deserve reward.
Why does this incentive exist? Approval-maximizers take the local action which the human would rate most highly. Are we including “long speech about why human should give high approval to me because I’m suffering” as an action? I guess there’s a trade-off here, where limiting to word-level output demands too much lookahead coherence of the human, while long sentences run the risk of incentivizing reward tampering. Is that the reason you had in mind?
If the agent can act to leave itself unchanged, loops of the same sequences of internal states rule out utility functions of type {I}. Similarly, loops of the same (internal state, action) pairs rule out utility functions of type {I},{A} and {I,A}. Finally, if the agent ever takes different actions, we can rule out a utility function of type {A} (assuming the action space is not changing).
This argument doesn’t seem to work, because the zero utility function makes everything optimal. VNM theorem can’t rule that out given just an observed trajectory. However, if you know the agent’s set of optimal policies, then you can start ruling out possibilities (because I can’t have purely environment-based utility if e1>e2 | internal state 1, but e2>e1 | internal state 2).
Or is this just exploring a potential approximation?
Yeah, that’s exactly right—I’m interested in how an agent can do something like manage resource allocation to do the best HCH imitation in a resource-bounded setting.
Are we including “long speech about why human should give high approval to me because I’m suffering” as an action? I guess there’s a trade-off here, where limiting to word-level output demands too much lookahead coherence of the human, while long sentences run the risk of incentivizing reward tampering. Is that the reason you had in mind?
Yep, that’s the idea.
This argument doesn’t seem to work, because the zero utility function makes everything optimal.
Yeah, that’s fair—if you add the assumption that every trajectory has a unique utility (that is, your preferences are a total ordering), though, then I think the argument still goes through. I don’t know how realistic an assumption like that is, but it seems plausible for any complex utility function over a relatively impoverished domain (e.g. a complex utility function over observations or actions would probably have this property, but a simple utility function over world states would probably not).
Is this supposed to say something like: “Humans can map a single observation onto different internal states, depending on their previous internal state”?
For HCH-bot, what’s the motivation? If we can compute the KL, we can compute HCH(i), so why not just use HCH(i) instead? Or is this just exploring a potential approximation?
Why does this incentive exist? Approval-maximizers take the local action which the human would rate most highly. Are we including “long speech about why human should give high approval to me because I’m suffering” as an action? I guess there’s a trade-off here, where limiting to word-level output demands too much lookahead coherence of the human, while long sentences run the risk of incentivizing reward tampering. Is that the reason you had in mind?
This argument doesn’t seem to work, because the zero utility function makes everything optimal. VNM theorem can’t rule that out given just an observed trajectory. However, if you know the agent’s set of optimal policies, then you can start ruling out possibilities (because I can’t have purely environment-based utility if e1>e2 | internal state 1, but e2>e1 | internal state 2).
Yeah, that’s exactly right—I’m interested in how an agent can do something like manage resource allocation to do the best HCH imitation in a resource-bounded setting.
Yep, that’s the idea.
Yeah, that’s fair—if you add the assumption that every trajectory has a unique utility (that is, your preferences are a total ordering), though, then I think the argument still goes through. I don’t know how realistic an assumption like that is, but it seems plausible for any complex utility function over a relatively impoverished domain (e.g. a complex utility function over observations or actions would probably have this property, but a simple utility function over world states would probably not).
Fixed the LaTex.