So then your process says to adjust the model (it’s a bit unclear on how, but let’s say it’s something like “make it more accurate until it accurately predicts the bad consequences we encountered”) and try again. OK, fine, but this isn’t changing X and it won’t help at all if we are facing the Hard Problem and X doesn’t correctly capture what we really care about.
Sorry I did not see this earlier. My notifications aren’t working.
I think the Model can include X. If you allow the Model to include questions about what goal you should be following.
Let us take a problem that we know the end goal of it and put an agent in the position of not knowing the end goal, with imperfect feedback and see how we can use a system I describe in it.
So lets say we have a chess agent that gets rewarded for singular actions based upon what the human judge thinks of that action. The real goal is to win the chess game, not to maximise the reward, but it doesn’t know that.
So if it loses a queen that might be a bad action in one context, but a good one if it allows check mate sooner.
Lets say the first Model it has is. “My goal is to not lose chess pieces”
It compares what the Model predicts about the measure and what the measure actually is, if they are out of whack updates the model.
The first discrepancy it finds is when it gets rewarded well for when it accidentally strategically sacrifices a pawn, so the next Model is:
“My goal is to not lose low value chess pieces, if if it saves a high value chess piece”
The next update happens when it accidentally sacrifices a high value piece to save a strategically valuable pawn. It then finds enlightenment.
“My goal is to Checkmate the other player”.
However this doesn’t allow it to predict the values of the measure perfectly. As the human is fallible they might give bad utility to a specific move.
So it needs to refine it’s model to.
“My goal is to Checkmate the other player, but the measure that is used is fallible in these ways: it can’t see so far ahead or as well as me”.
Importantly, this means it would still have the goal to checkmate the other player.
This can be seen as arguing in a similar vane to the arguments for goal uncertainty.
I don’t understand how your hypothetical chess-playing agent is supposed to work out that (1) when the model says “maximize value of pieces” but it gets rewarded for something else, that means that the model needs revising, but (2) when the model says “checkmate opponent” but it gets rewarded for something else, that means the rewards are being allocated wrongly.
You are right it could go to the goal. ”I must play chess as badly as a human”
Both it and, “My goal is to Checkmate the other player, but the measure that is used is fallible in these ways: it can’t see so far ahead or as well as me”, would minimise the expected error of the model predicted utility vs the utility given by the measure.
The benefit of what I am suggesting is that the system can entertain ideas that the measure is wrong.
So it sucks a little less than one that purely tries to optimise the measure.
You can only go so far with incentive structures alone. To improve on that you need communication. Luckily we can communicate with any systems that we make
We don’t have that when trying to deal with the hard problems, we can’t talk to our genes and ask them what they thinking when they made us like freedom or dislike suffering. But it seems like a good guess that it has something to do with survival/genetic propagation.
What, then, is the system trying to do? Not purely trying to optimize the measure, OK. But what instead? I understand that it is (alternately, or even concurrently) trying to optimize goals suggested by its model and adjusting its model—but what, exactly, is it trying to adjust its model to achieve? What’s the relationship between its goals and the reward signal it’s receiving?
It feels like you’re saying “A naive measure-optimizing system will do X, which is bad; let’s make a system that does Y instead” but I don’t see how your very partial description of the system actually leads to it doing Y instead of X, and it seems possible that all the stuff that would have that consequence is in the bits you haven’t described.
exactly, is it trying to adjust its model to achieve?
Reduce the error between the utility predicted by the model and the utility given by a measure.
What’s the relationship between its goals and the reward signal it’s receiving?
In a real world system: Very complicated! I think for a useful system the relationship will have to be comparable to the relationship between human goals and the reward signals we receive.
I’m trying to say: “A naive measure-optimizing system will do X, which is bad; let’s explore a system that could do possibly not do X”.
If your system is (1) trying to achieve goals suggested by its model and (2) trying to reduce the difference between the model’s predictions and some measure, then it is optimizing for that measure, just in a roundabout way, and I don’t see what will make it any less subject to Goodhart’s law than another system trying to optimize for that same measure.
(That doesn’t mean I think the overall structure you describe is a bad one. In fact, I think it’s a more or less inevitable one. But I don’t see that it does what you want it to.)
If your system is (1) trying to achieve goals suggested by its model and (2) trying to reduce the difference between the model’s predictions and some measure, then it is optimizing for that measure,
If and only if the model creates goals that optimize for the measure, it doesn’t need to!
Consider the humans choice between different snacks, a carrot or the sweet sweet dopamine hit of a sugary fatty chocolate biscuit. The dopamine here is the measure. If the model can predict that eating carrots will not feel great, but be better in some way of hitting the thing that the measure is actually pointing at, say survival/health, it might decide to have the strategy of picking the carrot. It has to just correctly predict that it won’t get much positive feedback from the dopamine measure for it, and it won’t be penalized for it.
I’m not saying that this is a silver bullet or will solve all our problems.
So are you assuming that we already know “the thing that the measure is actually pointing at”? Because it seems like that, rather than anything to do with the structure of models and measures and so forth, is what’s helpful here.
So are you assuming that we already know “the thing that the measure is actually pointing at”?
Nope, I’m assuming that you want to be able to know what the measure is actually pointing at. To do so you need an architecture that can support that type of idea. It may be wrong, but I want the chance that it will be correct.
With dopamine for sugary things, we started our lives without knowing what the measure is actually pointing at, we manage to get to a state where we think we know what the measure is pointing at. This would have been impossible if we did not have a system of capable of believing it knew better than the measure.
Edit to add: Other ways we could be wrong about what the dopamine measure is pointing to but still useful is things like. Sweet things are of the devil you should not eat of them, they are delicious but will destroy your immortal soul. Carrots are virtuous but taste bad. This gives the same predictions and actions but is wrong. The system should be able to support this type of thing as well.
Sorry I did not see this earlier. My notifications aren’t working.
I think the Model can include X. If you allow the Model to include questions about what goal you should be following.
Let us take a problem that we know the end goal of it and put an agent in the position of not knowing the end goal, with imperfect feedback and see how we can use a system I describe in it.
So lets say we have a chess agent that gets rewarded for singular actions based upon what the human judge thinks of that action. The real goal is to win the chess game, not to maximise the reward, but it doesn’t know that.
So if it loses a queen that might be a bad action in one context, but a good one if it allows check mate sooner.
Lets say the first Model it has is.
“My goal is to not lose chess pieces”
It compares what the Model predicts about the measure and what the measure actually is, if they are out of whack updates the model.
The first discrepancy it finds is when it gets rewarded well for when it accidentally strategically sacrifices a pawn, so the next Model is:
“My goal is to not lose low value chess pieces, if if it saves a high value chess piece”
The next update happens when it accidentally sacrifices a high value piece to save a strategically valuable pawn. It then finds enlightenment.
“My goal is to Checkmate the other player”.
However this doesn’t allow it to predict the values of the measure perfectly. As the human is fallible they might give bad utility to a specific move.
So it needs to refine it’s model to.
“My goal is to Checkmate the other player, but the measure that is used is fallible in these ways: it can’t see so far ahead or as well as me”.
Importantly, this means it would still have the goal to checkmate the other player.
This can be seen as arguing in a similar vane to the arguments for goal uncertainty.
I don’t understand how your hypothetical chess-playing agent is supposed to work out that (1) when the model says “maximize value of pieces” but it gets rewarded for something else, that means that the model needs revising, but (2) when the model says “checkmate opponent” but it gets rewarded for something else, that means the rewards are being allocated wrongly.
You are right it could go to the goal.
”I must play chess as badly as a human”
Both it and, “My goal is to Checkmate the other player, but the measure that is used is fallible in these ways: it can’t see so far ahead or as well as me”, would minimise the expected error of the model predicted utility vs the utility given by the measure.
The benefit of what I am suggesting is that the system can entertain ideas that the measure is wrong.
So it sucks a little less than one that purely tries to optimise the measure.
You can only go so far with incentive structures alone. To improve on that you need communication. Luckily we can communicate with any systems that we make
We don’t have that when trying to deal with the hard problems, we can’t talk to our genes and ask them what they thinking when they made us like freedom or dislike suffering. But it seems like a good guess that it has something to do with survival/genetic propagation.
What, then, is the system trying to do? Not purely trying to optimize the measure, OK. But what instead? I understand that it is (alternately, or even concurrently) trying to optimize goals suggested by its model and adjusting its model—but what, exactly, is it trying to adjust its model to achieve? What’s the relationship between its goals and the reward signal it’s receiving?
It feels like you’re saying “A naive measure-optimizing system will do X, which is bad; let’s make a system that does Y instead” but I don’t see how your very partial description of the system actually leads to it doing Y instead of X, and it seems possible that all the stuff that would have that consequence is in the bits you haven’t described.
Reduce the error between the utility predicted by the model and the utility given by a measure.
In a real world system: Very complicated! I think for a useful system the relationship will have to be comparable to the relationship between human goals and the reward signals we receive.
I’m trying to say: “A naive measure-optimizing system will do X, which is bad; let’s explore a system that could do possibly not do X”.
It is a small step. But a worthwhile one I think.
If your system is (1) trying to achieve goals suggested by its model and (2) trying to reduce the difference between the model’s predictions and some measure, then it is optimizing for that measure, just in a roundabout way, and I don’t see what will make it any less subject to Goodhart’s law than another system trying to optimize for that same measure.
(That doesn’t mean I think the overall structure you describe is a bad one. In fact, I think it’s a more or less inevitable one. But I don’t see that it does what you want it to.)
If and only if the model creates goals that optimize for the measure, it doesn’t need to!
Consider the humans choice between different snacks, a carrot or the sweet sweet dopamine hit of a sugary fatty chocolate biscuit. The dopamine here is the measure. If the model can predict that eating carrots will not feel great, but be better in some way of hitting the thing that the measure is actually pointing at, say survival/health, it might decide to have the strategy of picking the carrot. It has to just correctly predict that it won’t get much positive feedback from the dopamine measure for it, and it won’t be penalized for it.
I’m not saying that this is a silver bullet or will solve all our problems.
So are you assuming that we already know “the thing that the measure is actually pointing at”? Because it seems like that, rather than anything to do with the structure of models and measures and so forth, is what’s helpful here.
Nope, I’m assuming that you want to be able to know what the measure is actually pointing at. To do so you need an architecture that can support that type of idea. It may be wrong, but I want the chance that it will be correct.
With dopamine for sugary things, we started our lives without knowing what the measure is actually pointing at, we manage to get to a state where we think we know what the measure is pointing at. This would have been impossible if we did not have a system of capable of believing it knew better than the measure.
Edit to add: Other ways we could be wrong about what the dopamine measure is pointing to but still useful is things like. Sweet things are of the devil you should not eat of them, they are delicious but will destroy your immortal soul. Carrots are virtuous but taste bad. This gives the same predictions and actions but is wrong. The system should be able to support this type of thing as well.