Me: “You observe a that a bunch of attempts to write down what we want get Goodharted, and so you suggest writing down what we want using data. This seems like it will have all the same problems.”
Straw You: “The reason you fail is because you can’t specify what we really want, because value is complex. Trying to write down human values is qualitatively different from trying to write down human values using a pointer to all the data that happened in the past. That pointer cheats the argument from complexity, since it lets us fit lots of data into a simple instruction.”
Me: “But the instruction is not simple! Pointing at what the “human” is is hard. Dealing with the fact that the human in inconsistent with itself gives more degrees of freedom. If you just look at the human actions, and don’t look inside the brain, there are many many goals consistent with the actions you see. If you do look inside the brain, you need to know how to interpret that data. None of these are objective facts about the universe that you can just learn. You have to specify them, or specify a way to specify them, and when you do that, you do it wrong and you get Goodharted.”
The next four posts are basically making exactly these points (except for “pointing at what the human is is hard”). Or actually, it doesn’t talk about the “look inside the brain” part either, but I agree with your argument there as well.
I’m going to argue that ambitious value learning is difficult and probably not what we should be aiming for. (Or rather, I’m going to add posts that other people wrote to this sequence, that argue for that claim or weaker versions of it.)
A conversation that just went down in my head:
Me: “You observe a that a bunch of attempts to write down what we want get Goodharted, and so you suggest writing down what we want using data. This seems like it will have all the same problems.”
Straw You: “The reason you fail is because you can’t specify what we really want, because value is complex. Trying to write down human values is qualitatively different from trying to write down human values using a pointer to all the data that happened in the past. That pointer cheats the argument from complexity, since it lets us fit lots of data into a simple instruction.”
Me: “But the instruction is not simple! Pointing at what the “human” is is hard. Dealing with the fact that the human in inconsistent with itself gives more degrees of freedom. If you just look at the human actions, and don’t look inside the brain, there are many many goals consistent with the actions you see. If you do look inside the brain, you need to know how to interpret that data. None of these are objective facts about the universe that you can just learn. You have to specify them, or specify a way to specify them, and when you do that, you do it wrong and you get Goodharted.”
The next four posts are basically making exactly these points (except for “pointing at what the human is is hard”). Or actually, it doesn’t talk about the “look inside the brain” part either, but I agree with your argument there as well.
I’m going to argue that ambitious value learning is difficult and probably not what we should be aiming for. (Or rather, I’m going to add posts that other people wrote to this sequence, that argue for that claim or weaker versions of it.)