exactly, is it trying to adjust its model to achieve?
Reduce the error between the utility predicted by the model and the utility given by a measure.
What’s the relationship between its goals and the reward signal it’s receiving?
In a real world system: Very complicated! I think for a useful system the relationship will have to be comparable to the relationship between human goals and the reward signals we receive.
I’m trying to say: “A naive measure-optimizing system will do X, which is bad; let’s explore a system that could do possibly not do X”.
If your system is (1) trying to achieve goals suggested by its model and (2) trying to reduce the difference between the model’s predictions and some measure, then it is optimizing for that measure, just in a roundabout way, and I don’t see what will make it any less subject to Goodhart’s law than another system trying to optimize for that same measure.
(That doesn’t mean I think the overall structure you describe is a bad one. In fact, I think it’s a more or less inevitable one. But I don’t see that it does what you want it to.)
If your system is (1) trying to achieve goals suggested by its model and (2) trying to reduce the difference between the model’s predictions and some measure, then it is optimizing for that measure,
If and only if the model creates goals that optimize for the measure, it doesn’t need to!
Consider the humans choice between different snacks, a carrot or the sweet sweet dopamine hit of a sugary fatty chocolate biscuit. The dopamine here is the measure. If the model can predict that eating carrots will not feel great, but be better in some way of hitting the thing that the measure is actually pointing at, say survival/health, it might decide to have the strategy of picking the carrot. It has to just correctly predict that it won’t get much positive feedback from the dopamine measure for it, and it won’t be penalized for it.
I’m not saying that this is a silver bullet or will solve all our problems.
So are you assuming that we already know “the thing that the measure is actually pointing at”? Because it seems like that, rather than anything to do with the structure of models and measures and so forth, is what’s helpful here.
So are you assuming that we already know “the thing that the measure is actually pointing at”?
Nope, I’m assuming that you want to be able to know what the measure is actually pointing at. To do so you need an architecture that can support that type of idea. It may be wrong, but I want the chance that it will be correct.
With dopamine for sugary things, we started our lives without knowing what the measure is actually pointing at, we manage to get to a state where we think we know what the measure is pointing at. This would have been impossible if we did not have a system of capable of believing it knew better than the measure.
Edit to add: Other ways we could be wrong about what the dopamine measure is pointing to but still useful is things like. Sweet things are of the devil you should not eat of them, they are delicious but will destroy your immortal soul. Carrots are virtuous but taste bad. This gives the same predictions and actions but is wrong. The system should be able to support this type of thing as well.
Reduce the error between the utility predicted by the model and the utility given by a measure.
In a real world system: Very complicated! I think for a useful system the relationship will have to be comparable to the relationship between human goals and the reward signals we receive.
I’m trying to say: “A naive measure-optimizing system will do X, which is bad; let’s explore a system that could do possibly not do X”.
It is a small step. But a worthwhile one I think.
If your system is (1) trying to achieve goals suggested by its model and (2) trying to reduce the difference between the model’s predictions and some measure, then it is optimizing for that measure, just in a roundabout way, and I don’t see what will make it any less subject to Goodhart’s law than another system trying to optimize for that same measure.
(That doesn’t mean I think the overall structure you describe is a bad one. In fact, I think it’s a more or less inevitable one. But I don’t see that it does what you want it to.)
If and only if the model creates goals that optimize for the measure, it doesn’t need to!
Consider the humans choice between different snacks, a carrot or the sweet sweet dopamine hit of a sugary fatty chocolate biscuit. The dopamine here is the measure. If the model can predict that eating carrots will not feel great, but be better in some way of hitting the thing that the measure is actually pointing at, say survival/health, it might decide to have the strategy of picking the carrot. It has to just correctly predict that it won’t get much positive feedback from the dopamine measure for it, and it won’t be penalized for it.
I’m not saying that this is a silver bullet or will solve all our problems.
So are you assuming that we already know “the thing that the measure is actually pointing at”? Because it seems like that, rather than anything to do with the structure of models and measures and so forth, is what’s helpful here.
Nope, I’m assuming that you want to be able to know what the measure is actually pointing at. To do so you need an architecture that can support that type of idea. It may be wrong, but I want the chance that it will be correct.
With dopamine for sugary things, we started our lives without knowing what the measure is actually pointing at, we manage to get to a state where we think we know what the measure is pointing at. This would have been impossible if we did not have a system of capable of believing it knew better than the measure.
Edit to add: Other ways we could be wrong about what the dopamine measure is pointing to but still useful is things like. Sweet things are of the devil you should not eat of them, they are delicious but will destroy your immortal soul. Carrots are virtuous but taste bad. This gives the same predictions and actions but is wrong. The system should be able to support this type of thing as well.