Person A has mental model X and tries to explain X with explanation Q
Person B doesn’t get model X from Q, thinks a bit, and then writes explanation P, reads P and thinks: P is how it should have been explained all along, and Q didn’t actually contain the insights, but P does.
Person C doesn’t get model X from P, thinks a bit, and then writes explanation R, reads R and thinks: …
It seems to me quite likely that you are person B, thinking they explained something because THEY think their explanation is very good and contains all the insights that the previous ones didn’t. Some of the evidence for this is in fact contained in your very comment:
“1. Pointing out the “reward chisels computation” point. 2. Having some people tell me it’s obvious, or already known, or that they already invented it. 3. Seeing some of the same people continue making similar mistakes (according to me)” So point 3 basically almost definitively proves that your mental model is not conveyed to those people in your post, does it not? I think a similar thing happened where that mental model was not conveyed to you from RFLO, even though we tried to convey it. (btw not saying the models that RFLO tried to explain are the same as this post, but the basic idea of this post definitely is a part of RFLO).
BTW, it could in fact be that person B’s explanation is clearer. (otoh, I think some things are less clear, e.g. you talk about “the” optimization target, which I would say is referring to that of the mesa-optimizer, without clearly assuming there is a mesa-optimizer. We stated the terms mesa- and base-optimizer to clearly make the distinction. There are a bunch of other things that I think are just imprecise, but let’s not get into it).
“Continuing (AFAICT) to correct people on (what I claim to be) mistakes around reward and optimization targets, and (for a while) was ~the only one doing so.”
I have been correcting people for a while on stuff like that (though not on LW, I’m not often on LW), such as that in the generic case we shouldn’t expect wireheading from RL agents unless the option of wireheading is in the training environment, for basically these reasons. I would also have expected people to just get this after reading RFLO, but many didn’t (others did), so your points 1/2/3 also apply to me.
“I do totally buy that you all had good implicit models of the reward-chiseling point”. I don’t think we just “implicitly” modeled it, we very explicitly understood it and it ran throughout our whole thinking about the topic. Again, explaining stuff is hard though, I’m not claiming we conveyed everything well to everyone (clearly you haven’t either).
I want to note that I just reread Utility ≠ Reward and was pleasantly surprised by its treatment, as well as the hedges. I’m making an upwards update on these points having been understood by at least some thinkers, although I’ve also made a lot of downward updates for other reasons.
There is a general phenomenon where:
Person A has mental model X and tries to explain X with explanation Q
Person B doesn’t get model X from Q, thinks a bit, and then writes explanation P, reads P and thinks: P is how it should have been explained all along, and Q didn’t actually contain the insights, but P does.
Person C doesn’t get model X from P, thinks a bit, and then writes explanation R, reads R and thinks: …
It seems to me quite likely that you are person B, thinking they explained something because THEY think their explanation is very good and contains all the insights that the previous ones didn’t. Some of the evidence for this is in fact contained in your very comment:
“1. Pointing out the “reward chisels computation” point. 2. Having some people tell me it’s obvious, or already known, or that they already invented it. 3. Seeing some of the same people continue making similar mistakes (according to me)”
So point 3 basically almost definitively proves that your mental model is not conveyed to those people in your post, does it not? I think a similar thing happened where that mental model was not conveyed to you from RFLO, even though we tried to convey it. (btw not saying the models that RFLO tried to explain are the same as this post, but the basic idea of this post definitely is a part of RFLO).
BTW, it could in fact be that person B’s explanation is clearer. (otoh, I think some things are less clear, e.g. you talk about “the” optimization target, which I would say is referring to that of the mesa-optimizer, without clearly assuming there is a mesa-optimizer. We stated the terms mesa- and base-optimizer to clearly make the distinction. There are a bunch of other things that I think are just imprecise, but let’s not get into it).
“Continuing (AFAICT) to correct people on (what I claim to be) mistakes around reward and optimization targets, and (for a while) was ~the only one doing so.”
I have been correcting people for a while on stuff like that (though not on LW, I’m not often on LW), such as that in the generic case we shouldn’t expect wireheading from RL agents unless the option of wireheading is in the training environment, for basically these reasons. I would also have expected people to just get this after reading RFLO, but many didn’t (others did), so your points 1/2/3 also apply to me.
“I do totally buy that you all had good implicit models of the reward-chiseling point”. I don’t think we just “implicitly” modeled it, we very explicitly understood it and it ran throughout our whole thinking about the topic. Again, explaining stuff is hard though, I’m not claiming we conveyed everything well to everyone (clearly you haven’t either).
I want to note that I just reread Utility ≠ Reward and was pleasantly surprised by its treatment, as well as the hedges. I’m making an upwards update on these points having been understood by at least some thinkers, although I’ve also made a lot of downward updates for other reasons.