I don’t understand how your hypothetical chess-playing agent is supposed to work out that (1) when the model says “maximize value of pieces” but it gets rewarded for something else, that means that the model needs revising, but (2) when the model says “checkmate opponent” but it gets rewarded for something else, that means the rewards are being allocated wrongly.
You are right it could go to the goal. ”I must play chess as badly as a human”
Both it and, “My goal is to Checkmate the other player, but the measure that is used is fallible in these ways: it can’t see so far ahead or as well as me”, would minimise the expected error of the model predicted utility vs the utility given by the measure.
The benefit of what I am suggesting is that the system can entertain ideas that the measure is wrong.
So it sucks a little less than one that purely tries to optimise the measure.
You can only go so far with incentive structures alone. To improve on that you need communication. Luckily we can communicate with any systems that we make
We don’t have that when trying to deal with the hard problems, we can’t talk to our genes and ask them what they thinking when they made us like freedom or dislike suffering. But it seems like a good guess that it has something to do with survival/genetic propagation.
What, then, is the system trying to do? Not purely trying to optimize the measure, OK. But what instead? I understand that it is (alternately, or even concurrently) trying to optimize goals suggested by its model and adjusting its model—but what, exactly, is it trying to adjust its model to achieve? What’s the relationship between its goals and the reward signal it’s receiving?
It feels like you’re saying “A naive measure-optimizing system will do X, which is bad; let’s make a system that does Y instead” but I don’t see how your very partial description of the system actually leads to it doing Y instead of X, and it seems possible that all the stuff that would have that consequence is in the bits you haven’t described.
exactly, is it trying to adjust its model to achieve?
Reduce the error between the utility predicted by the model and the utility given by a measure.
What’s the relationship between its goals and the reward signal it’s receiving?
In a real world system: Very complicated! I think for a useful system the relationship will have to be comparable to the relationship between human goals and the reward signals we receive.
I’m trying to say: “A naive measure-optimizing system will do X, which is bad; let’s explore a system that could do possibly not do X”.
If your system is (1) trying to achieve goals suggested by its model and (2) trying to reduce the difference between the model’s predictions and some measure, then it is optimizing for that measure, just in a roundabout way, and I don’t see what will make it any less subject to Goodhart’s law than another system trying to optimize for that same measure.
(That doesn’t mean I think the overall structure you describe is a bad one. In fact, I think it’s a more or less inevitable one. But I don’t see that it does what you want it to.)
If your system is (1) trying to achieve goals suggested by its model and (2) trying to reduce the difference between the model’s predictions and some measure, then it is optimizing for that measure,
If and only if the model creates goals that optimize for the measure, it doesn’t need to!
Consider the humans choice between different snacks, a carrot or the sweet sweet dopamine hit of a sugary fatty chocolate biscuit. The dopamine here is the measure. If the model can predict that eating carrots will not feel great, but be better in some way of hitting the thing that the measure is actually pointing at, say survival/health, it might decide to have the strategy of picking the carrot. It has to just correctly predict that it won’t get much positive feedback from the dopamine measure for it, and it won’t be penalized for it.
I’m not saying that this is a silver bullet or will solve all our problems.
So are you assuming that we already know “the thing that the measure is actually pointing at”? Because it seems like that, rather than anything to do with the structure of models and measures and so forth, is what’s helpful here.
So are you assuming that we already know “the thing that the measure is actually pointing at”?
Nope, I’m assuming that you want to be able to know what the measure is actually pointing at. To do so you need an architecture that can support that type of idea. It may be wrong, but I want the chance that it will be correct.
With dopamine for sugary things, we started our lives without knowing what the measure is actually pointing at, we manage to get to a state where we think we know what the measure is pointing at. This would have been impossible if we did not have a system of capable of believing it knew better than the measure.
Edit to add: Other ways we could be wrong about what the dopamine measure is pointing to but still useful is things like. Sweet things are of the devil you should not eat of them, they are delicious but will destroy your immortal soul. Carrots are virtuous but taste bad. This gives the same predictions and actions but is wrong. The system should be able to support this type of thing as well.
I don’t understand how your hypothetical chess-playing agent is supposed to work out that (1) when the model says “maximize value of pieces” but it gets rewarded for something else, that means that the model needs revising, but (2) when the model says “checkmate opponent” but it gets rewarded for something else, that means the rewards are being allocated wrongly.
You are right it could go to the goal.
”I must play chess as badly as a human”
Both it and, “My goal is to Checkmate the other player, but the measure that is used is fallible in these ways: it can’t see so far ahead or as well as me”, would minimise the expected error of the model predicted utility vs the utility given by the measure.
The benefit of what I am suggesting is that the system can entertain ideas that the measure is wrong.
So it sucks a little less than one that purely tries to optimise the measure.
You can only go so far with incentive structures alone. To improve on that you need communication. Luckily we can communicate with any systems that we make
We don’t have that when trying to deal with the hard problems, we can’t talk to our genes and ask them what they thinking when they made us like freedom or dislike suffering. But it seems like a good guess that it has something to do with survival/genetic propagation.
What, then, is the system trying to do? Not purely trying to optimize the measure, OK. But what instead? I understand that it is (alternately, or even concurrently) trying to optimize goals suggested by its model and adjusting its model—but what, exactly, is it trying to adjust its model to achieve? What’s the relationship between its goals and the reward signal it’s receiving?
It feels like you’re saying “A naive measure-optimizing system will do X, which is bad; let’s make a system that does Y instead” but I don’t see how your very partial description of the system actually leads to it doing Y instead of X, and it seems possible that all the stuff that would have that consequence is in the bits you haven’t described.
Reduce the error between the utility predicted by the model and the utility given by a measure.
In a real world system: Very complicated! I think for a useful system the relationship will have to be comparable to the relationship between human goals and the reward signals we receive.
I’m trying to say: “A naive measure-optimizing system will do X, which is bad; let’s explore a system that could do possibly not do X”.
It is a small step. But a worthwhile one I think.
If your system is (1) trying to achieve goals suggested by its model and (2) trying to reduce the difference between the model’s predictions and some measure, then it is optimizing for that measure, just in a roundabout way, and I don’t see what will make it any less subject to Goodhart’s law than another system trying to optimize for that same measure.
(That doesn’t mean I think the overall structure you describe is a bad one. In fact, I think it’s a more or less inevitable one. But I don’t see that it does what you want it to.)
If and only if the model creates goals that optimize for the measure, it doesn’t need to!
Consider the humans choice between different snacks, a carrot or the sweet sweet dopamine hit of a sugary fatty chocolate biscuit. The dopamine here is the measure. If the model can predict that eating carrots will not feel great, but be better in some way of hitting the thing that the measure is actually pointing at, say survival/health, it might decide to have the strategy of picking the carrot. It has to just correctly predict that it won’t get much positive feedback from the dopamine measure for it, and it won’t be penalized for it.
I’m not saying that this is a silver bullet or will solve all our problems.
So are you assuming that we already know “the thing that the measure is actually pointing at”? Because it seems like that, rather than anything to do with the structure of models and measures and so forth, is what’s helpful here.
Nope, I’m assuming that you want to be able to know what the measure is actually pointing at. To do so you need an architecture that can support that type of idea. It may be wrong, but I want the chance that it will be correct.
With dopamine for sugary things, we started our lives without knowing what the measure is actually pointing at, we manage to get to a state where we think we know what the measure is pointing at. This would have been impossible if we did not have a system of capable of believing it knew better than the measure.
Edit to add: Other ways we could be wrong about what the dopamine measure is pointing to but still useful is things like. Sweet things are of the devil you should not eat of them, they are delicious but will destroy your immortal soul. Carrots are virtuous but taste bad. This gives the same predictions and actions but is wrong. The system should be able to support this type of thing as well.