Sorry for the format here, and I still try to figure out how to use markdown in the comment.
I find difficulty understanding inferences about parameters $ \alpha,\beta,\gamma $ in the “Example:regret” part.
Take the fully rational planner p for example.
Since the human will say h following s, the different between reward functions for h and -h is non-negative, which implies that: $ (\beta R(h)+\gamma R(h|s)) - (\beta R(\sim h)+\gamma R(\sim h|s)) \geq 0 $
Then it is concluded that $ \beta R(h-\sim h)+\gamma R(h-\sim h|s)\geq0$
Similarly, from the human will say $ \sim h$ following i, we have $ \beta R(h-\sim h)+\delta R(h-\sim h|i)\leq0$
It seems that more information about the reward function is need in order to arrive at the final model with
$ (p,R(\alpha,\beta,\gamma,\delta)|\gamma\geq-\beta\geq\delta) $
I saw the R’s as normalised to 1 or zero, and the coefficients as giving them weights. So instead of βR(h−∼h)+γ(h−∼h|s)≥0, I’d write β+γ≥0 (given the behaviour and assumptions).
But this is an old post, and is mainly superseded by new ones, so I wouldn’t spend too much time on it.
Sorry for the format here, and I still try to figure out how to use markdown in the comment.
I find difficulty understanding inferences about parameters $ \alpha,\beta,\gamma $ in the “Example:regret” part.
Take the fully rational planner
p
for example.Since the human will say
h
followings
, the different between reward functions forh
and-h
is non-negative, which implies that: $ (\beta R(h)+\gamma R(h|s)) - (\beta R(\sim h)+\gamma R(\sim h|s)) \geq 0 $Then it is concluded that $ \beta R(h-\sim h)+\gamma R(h-\sim h|s)\geq0$
Similarly, from the human will say $ \sim h$ following
i
, we have $ \beta R(h-\sim h)+\delta R(h-\sim h|i)\leq0$It seems that more information about the reward function is need in order to arrive at the final model with $ (p,R(\alpha,\beta,\gamma,\delta)|\gamma\geq-\beta\geq\delta) $
I saw the R’s as normalised to 1 or zero, and the coefficients as giving them weights. So instead of βR(h−∼h)+γ(h−∼h|s)≥0, I’d write β+γ≥0 (given the behaviour and assumptions).
But this is an old post, and is mainly superseded by new ones, so I wouldn’t spend too much time on it.