I’m sorry but I don’t get the explanation regarding the coinrun. I claim that the “reward as incentivization” framing still “explains” the behaviour in this case. As an analogy, we can go back to training a dog and rewarding it with biscuits: let’s say you write numbers on the floor from 1 to 10. You ask the dog a simple calculus question (whose answer is between 1 to 10), and each time he puts its paw on the right number he gets a biscuit. Let’s just say that during the training it so happens that the answer to all the calculus questions is always 6. Would you claim that you taught the dog to answer simple calculus questions, or rather that you taught it to put his paw on 6 when you ask him a calculus question? If the answer is the latter then I don’t get why the interpretation through the “reward as incentivization” framing in the CoinRun setting is that the model “wants to get the coin” in the CoinRun.
Strong agree and up vote. The issue is simply that the training did not uniquely constrain the designer’s intended objectives, and that’s independent of whether the training was incentivisation or selection.
I’m sorry but I don’t get the explanation regarding the coinrun. I claim that the “reward as incentivization” framing still “explains” the behaviour in this case. As an analogy, we can go back to training a dog and rewarding it with biscuits: let’s say you write numbers on the floor from 1 to 10. You ask the dog a simple calculus question (whose answer is between 1 to 10), and each time he puts its paw on the right number he gets a biscuit. Let’s just say that during the training it so happens that the answer to all the calculus questions is always 6. Would you claim that you taught the dog to answer simple calculus questions, or rather that you taught it to put his paw on 6 when you ask him a calculus question? If the answer is the latter then I don’t get why the interpretation through the “reward as incentivization” framing in the CoinRun setting is that the model “wants to get the coin” in the CoinRun.
Strong agree and up vote. The issue is simply that the training did not uniquely constrain the designer’s intended objectives, and that’s independent of whether the training was incentivisation or selection.