Like Koen, I’m here to give more detailed feedback on the post that was asked for by Victoria Krakovna at WebTaisu.
About the LEGO example, it’s obvious after the second reading of the sentence, but I took some time to understand that “the bottom” didn’t mean the face that was at lowest height, but the concave one (sort of). Also, I asked myself why the robot didn’t put the brick upside down ON the other one, which would have maximized the height of the bottom. Is it because it was too costly compared to the local extremum of turning over the brick?
I like the two perspectives on specification gaming. One way I like to put it is that “we don’t want to tell the agent how to do their task, but we still want them to accomplish it correctly”.
For the coasters example, I think it would be clearer if the example was explained before the mention of potentials. Also, I would have liked a sentence or two explaining the potential part.
Lastly, I feel like the transition between the simulator bugs part and the reward tampering part is rough.
With all that being said, I still enjoyed the post, and I think it accomplish its goal, without any specification gaming. ;)
Thanks Adam for the feedback—glad you enjoyed the post!
For the Lego example, the agent received a fixed shaping reward for grasping the red brick if the bottom face was above a certain height (3cm), rather than being rewarded in proportion to the height of the bottom face. Thus, it found an easy way to collect the shaping reward by flipping the brick, while stacking it upside down on the blue brick would be a more difficult way to get the same shaping reward. The current description of the example in the post does make it sound like the reward is proportional to the height—I’ll make a note to fix this in a later version of the post.
Like Koen, I’m here to give more detailed feedback on the post that was asked for by Victoria Krakovna at WebTaisu.
About the LEGO example, it’s obvious after the second reading of the sentence, but I took some time to understand that “the bottom” didn’t mean the face that was at lowest height, but the concave one (sort of). Also, I asked myself why the robot didn’t put the brick upside down ON the other one, which would have maximized the height of the bottom. Is it because it was too costly compared to the local extremum of turning over the brick?
I like the two perspectives on specification gaming. One way I like to put it is that “we don’t want to tell the agent how to do their task, but we still want them to accomplish it correctly”.
For the coasters example, I think it would be clearer if the example was explained before the mention of potentials. Also, I would have liked a sentence or two explaining the potential part.
Lastly, I feel like the transition between the simulator bugs part and the reward tampering part is rough.
With all that being said, I still enjoyed the post, and I think it accomplish its goal, without any specification gaming. ;)
Thanks Adam for the feedback—glad you enjoyed the post!
For the Lego example, the agent received a fixed shaping reward for grasping the red brick if the bottom face was above a certain height (3cm), rather than being rewarded in proportion to the height of the bottom face. Thus, it found an easy way to collect the shaping reward by flipping the brick, while stacking it upside down on the blue brick would be a more difficult way to get the same shaping reward. The current description of the example in the post does make it sound like the reward is proportional to the height—I’ll make a note to fix this in a later version of the post.
Ok, that makes much more sense. I was indeed assuming a proportional reward.