These posts are quite good, thank you for writing them.
I no longer think that the desiderata I listed in Impact Measure Desiderata should be our guiding star (although I think Rohin’s three are about right). Let’s instead look directly at the process of getting a (goal-directed) AI to do what we want, and think about what designs do well.
First, we specify the utility function. Second, the agent computes and follows a high-performing policy. This process continues, where we refine the goal if the agent isn’t doing what we want.
What we want is for the AI to eventually be doing the right thing (even if we have to correct it a few times). The first way this can not happen is that the agent can act to make what we want no longer feasible, or at least more expensive. That is, the agent changes the world so that even if it had the goal we wanted to give it, it would be significantly harder to accomplish:
The second problem is that the agent can prevent us from being able to correct it properly (by gaining or preserving too much power for itself, generally):
Together, these are catastrophes—we’re no longer able to get what we want in either situation. We should consider what designs preclude these failures naturally.
When considering debates over desiderata, it seems to me that we’re debating whether the desideratum will lead to good things (and each of us probably secretly had a different goal in mind for what impact measures should do). I’m interested in making the goal of this research explicit and getting it right. My upcoming sequence will cover this at length.
These posts are quite good, thank you for writing them.
I no longer think that the desiderata I listed in Impact Measure Desiderata should be our guiding star (although I think Rohin’s three are about right). Let’s instead look directly at the process of getting a (goal-directed) AI to do what we want, and think about what designs do well.
First, we specify the utility function. Second, the agent computes and follows a high-performing policy. This process continues, where we refine the goal if the agent isn’t doing what we want.
What we want is for the AI to eventually be doing the right thing (even if we have to correct it a few times). The first way this can not happen is that the agent can act to make what we want no longer feasible, or at least more expensive. That is, the agent changes the world so that even if it had the goal we wanted to give it, it would be significantly harder to accomplish:
The second problem is that the agent can prevent us from being able to correct it properly (by gaining or preserving too much power for itself, generally):
Together, these are catastrophes—we’re no longer able to get what we want in either situation. We should consider what designs preclude these failures naturally.When considering debates over desiderata, it seems to me that we’re debating whether the desideratum will lead to good things (and each of us probably secretly had a different goal in mind for what impact measures should do). I’m interested in making the goal of this research explicit and getting it right. My upcoming sequence will cover this at length.