I enjoyed this post and recommended someone else in the office read it. You explain your thoughts well.
In general could someone explain how these alignment approaches do not simply shift the question from “how do we align this one system” to “how do we align this one system (that consists of two interacting sub-systems)”
I like assumption 1 and think its the correct way to think about things. I personally am think finding a true name for human morality is impossible.
Assumption 2 (linear capabilities ordering) strikes me as too strong. I’m not sure if this the correct ontology, with the “gaming” being quite a dynamic interactive process involving two agents modeling each other. It seems unlikely you will be able to rank performance in this task along a single axis.
Here is a rough sketch of where I think it fails:
Consider very weak evaluator, e and very strong evaluator, E.
Take an agent, A, capable of gaming the very strong evaluator E.
The very weak evaluator e runs a very simple algorithm. It avoids being gamed if the agent it is evaluating has the same source code as A.
The weak evaluator e is then clearly ranked very low. It is gamed by almost all agents. Yet it avoids being gamed by the agent that could game E.
In general could someone explain how these alignment approaches do not simply shift the question from “how do we align this one system” to “how do we align this one system (that consists of two interacting sub-systems)”
Thanks for pointing out another assumption I didn’t even consider articulating. The way this proposal answers the second question is:
1. (Outer) align one subsystem (agent) to the other subsystem (evaluator), which we know how to do because the evaluator runs on a computer. 2. Attempt to (outer) align the other subsystem (evaluator) to the human’s true objective through a fixed set of positive examples (initial behaviors or outcomes specified by humans) and a growing set of increasingly nuanced negative examples (specified by the improving agent).
The very weak evaluator e runs a very simple algorithm. It avoids being gamed if the agent it is evaluating has the same source code as A.
Oh, that’s interesting. I think this is indeed the most fragile assumption from the ones invoked here. Though I’m wondering if you could actually obtain such an evaluator using the described procedure.
Very quick thoughts from my phone:
I enjoyed this post and recommended someone else in the office read it. You explain your thoughts well.
In general could someone explain how these alignment approaches do not simply shift the question from “how do we align this one system” to “how do we align this one system (that consists of two interacting sub-systems)”
I like assumption 1 and think its the correct way to think about things. I personally am think finding a true name for human morality is impossible.
Assumption 2 (linear capabilities ordering) strikes me as too strong. I’m not sure if this the correct ontology, with the “gaming” being quite a dynamic interactive process involving two agents modeling each other. It seems unlikely you will be able to rank performance in this task along a single axis.
Here is a rough sketch of where I think it fails:
Consider very weak evaluator, e and very strong evaluator, E.
Take an agent, A, capable of gaming the very strong evaluator E.
The very weak evaluator e runs a very simple algorithm. It avoids being gamed if the agent it is evaluating has the same source code as A.
The weak evaluator e is then clearly ranked very low. It is gamed by almost all agents. Yet it avoids being gamed by the agent that could game E.
Thanks for pointing out another assumption I didn’t even consider articulating. The way this proposal answers the second question is:
1. (Outer) align one subsystem (agent) to the other subsystem (evaluator), which we know how to do because the evaluator runs on a computer.
2. Attempt to (outer) align the other subsystem (evaluator) to the human’s true objective through a fixed set of positive examples (initial behaviors or outcomes specified by humans) and a growing set of increasingly nuanced negative examples (specified by the improving agent).
Oh, that’s interesting. I think this is indeed the most fragile assumption from the ones invoked here. Though I’m wondering if you could actually obtain such an evaluator using the described procedure.