Or is this just exploring a potential approximation?
Yeah, that’s exactly right—I’m interested in how an agent can do something like manage resource allocation to do the best HCH imitation in a resource-bounded setting.
Are we including “long speech about why human should give high approval to me because I’m suffering” as an action? I guess there’s a trade-off here, where limiting to word-level output demands too much lookahead coherence of the human, while long sentences run the risk of incentivizing reward tampering. Is that the reason you had in mind?
Yep, that’s the idea.
This argument doesn’t seem to work, because the zero utility function makes everything optimal.
Yeah, that’s fair—if you add the assumption that every trajectory has a unique utility (that is, your preferences are a total ordering), though, then I think the argument still goes through. I don’t know how realistic an assumption like that is, but it seems plausible for any complex utility function over a relatively impoverished domain (e.g. a complex utility function over observations or actions would probably have this property, but a simple utility function over world states would probably not).
Yeah, that’s exactly right—I’m interested in how an agent can do something like manage resource allocation to do the best HCH imitation in a resource-bounded setting.
Yep, that’s the idea.
Yeah, that’s fair—if you add the assumption that every trajectory has a unique utility (that is, your preferences are a total ordering), though, then I think the argument still goes through. I don’t know how realistic an assumption like that is, but it seems plausible for any complex utility function over a relatively impoverished domain (e.g. a complex utility function over observations or actions would probably have this property, but a simple utility function over world states would probably not).