>If you already know how the human is going to answer, then it’s not high value of information to ask.
That’s the entire problem, if “ask a human” is programmed as a an endorsement of this being the right path to take, rather than as a genuine need for information.
>If the scorekeeper operates by conservation of expected evidence, it can never believe any sequence of questions will modify the score of any particular scenario on average.
>If you already know how the human is going to answer, then it’s not high value of information to ask.
That’s the entire problem, if “ask a human” is programmed as a an endorsement of this being the right path to take, rather than as a genuine need for information.
>If the scorekeeper operates by conservation of expected evidence, it can never believe any sequence of questions will modify the score of any particular scenario on average.
That’s precisely my definition for “unriggable” learning processes, in the next post:https://www.lesswrong.com/posts/upLot6eG8cbXdKiFS/reward-function-learning-the-learning-process
That’s a link to this post, right? ;)
Ooops, yes! Sorry, for some reason, I thought this was the post on the value function.