It seems to me that every O-maximizer can be expressed as a reward maximizer. Specifically, comparing equations (2) and (3), given an O-maximizer we can define reward r sub(m) (by this notation I mean “r subscript m”) as:
r sub(m) = SUM(r in R) U(r)P(r|yx sub(<=m))
and r sub(i) = 0 for i<m, where the paper sets m to the final time step, following Nick Hay. The reward maximizer so defined will behave identically with the O-maximizer.
In the reward-maximization framework, rewards are part of observations and come from the environment. You cannot define “r sub(m)” to be equal to something mathematically, then call the result a reward-maximizer; therefore, Hibbard’s formulation of an O-maximizer as a reward-maximizer doesn’t work.
If this is correct, doesn’t the “characteristic behavior pattern” shown for reward maximizers in Appendix B, as stated in Section 3.1, also apply to O-maximizers?
Since the construction was incorrect, this argument does not hold.
you could, in theory, train a Solomonoff Induction-based reinforcement learning agent to produce arbitrary finite sequences of actions (non-self-destructive ones anyway) in response to specified sets of finite sense data—assuming you are allowed to program its reward function and give it fake memories dating back from before it was born.
Basically Solomonoff Induction is a powerful learning mechanism, and with sufficient time and test cases, you could configure an agent based on it to behave in an arbitrary way[*] in response to any finite sense-stream after its “birth”—by giving it sufficient pre-birth training “memories”—which laboriously say: “if you see this, do this, and don’t do this or this or this”—for every possible bunch of observations, up to some finite length limit.
I call this sort of thing universal action—and I think reinforcement learning systems are capable of it.
Response to Bill Hibbard:
In the reward-maximization framework, rewards are part of observations and come from the environment. You cannot define “r sub(m)” to be equal to something mathematically, then call the result a reward-maximizer; therefore, Hibbard’s formulation of an O-maximizer as a reward-maximizer doesn’t work.
Since the construction was incorrect, this argument does not hold.
My way of putting much the same idea was:
Basically Solomonoff Induction is a powerful learning mechanism, and with sufficient time and test cases, you could configure an agent based on it to behave in an arbitrary way[*] in response to any finite sense-stream after its “birth”—by giving it sufficient pre-birth training “memories”—which laboriously say: “if you see this, do this, and don’t do this or this or this”—for every possible bunch of observations, up to some finite length limit.
I call this sort of thing universal action—and I think reinforcement learning systems are capable of it.
Bill responds here. It is pretty much what I expected him to say.