It seems like this means that, for any policy, we can represent it as optimizing reward with only the minimal overhead in description/computational complexity of the wrapper.
So...
Do you think this analysis is correct? Or what is it missing? (maybe the assumption that the policy is deterministic is significant? This turns out to be the case for Orseau et al.’s “Agents and Devices” approach, I think https://arxiv.org/abs/1805.12387).
Are you trying to get around this somehow? Or are you fine with this minimal overhead being used to distinguish goal-directed from non-goal directed policies?
My framework discards such contrived reward functions because it penalizes for the complexity of the reward function. In the construction you describe, we have C(U)≈C(π). This corresponds to g≈0 (no/low intelligence). On the other hand, policies with g≫0 (high intelligence) have the property that C(π)≫C(U) for the U which “justifies” this g. In other words, your “minimal” overhead is very large from my point of view: to be acceptable, the “overhead” should be substantially negative.
I think the construction gives us $C(\pi) \leq C(U) + e$ for a small constant $e$ (representing the wrapper). It seems like any compression you can apply to the reward function can be translated to the policy via the wrapper. So then you would never have $C(\pi) >> C(U)$. What am I missing/misunderstanding?
For the contrived reward function you suggested, we would never have C(π)≫C(U). But for other reward functions, it is possible that C(π)≫C(U). Which is exactly why this framework rejects the contrived reward function in favor of those other reward functions. And also why this framework considers some policies unintelligent (despite the availability of the contrived reward function) and other policies intelligent.
Apologies, I didn’t take the time to understand all of this yet, but I have a basic question you might have an answer to...
We know how to map (deterministic) policies to reward functions using the construction at the bottom of page 6 of the reward modelling agenda (https://arxiv.org/abs/1811.07871v1): the agent is rewarded only if it has so far done exactly what the policy would do. I think of this as a wrapper function (https://en.wikipedia.org/wiki/Wrapper_function).
It seems like this means that, for any policy, we can represent it as optimizing reward with only the minimal overhead in description/computational complexity of the wrapper.
So...
Do you think this analysis is correct? Or what is it missing? (maybe the assumption that the policy is deterministic is significant? This turns out to be the case for Orseau et al.’s “Agents and Devices” approach, I think https://arxiv.org/abs/1805.12387).
Are you trying to get around this somehow? Or are you fine with this minimal overhead being used to distinguish goal-directed from non-goal directed policies?
My framework discards such contrived reward functions because it penalizes for the complexity of the reward function. In the construction you describe, we have C(U)≈C(π). This corresponds to g≈0 (no/low intelligence). On the other hand, policies with g≫0 (high intelligence) have the property that C(π)≫C(U) for the U which “justifies” this g. In other words, your “minimal” overhead is very large from my point of view: to be acceptable, the “overhead” should be substantially negative.
I think the construction gives us $C(\pi) \leq C(U) + e$ for a small constant $e$ (representing the wrapper). It seems like any compression you can apply to the reward function can be translated to the policy via the wrapper. So then you would never have $C(\pi) >> C(U)$. What am I missing/misunderstanding?
For the contrived reward function you suggested, we would never have C(π)≫C(U). But for other reward functions, it is possible that C(π)≫C(U). Which is exactly why this framework rejects the contrived reward function in favor of those other reward functions. And also why this framework considers some policies unintelligent (despite the availability of the contrived reward function) and other policies intelligent.