In general could someone explain how these alignment approaches do not simply shift the question from “how do we align this one system” to “how do we align this one system (that consists of two interacting sub-systems)”
Thanks for pointing out another assumption I didn’t even consider articulating. The way this proposal answers the second question is:
1. (Outer) align one subsystem (agent) to the other subsystem (evaluator), which we know how to do because the evaluator runs on a computer. 2. Attempt to (outer) align the other subsystem (evaluator) to the human’s true objective through a fixed set of positive examples (initial behaviors or outcomes specified by humans) and a growing set of increasingly nuanced negative examples (specified by the improving agent).
The very weak evaluator e runs a very simple algorithm. It avoids being gamed if the agent it is evaluating has the same source code as A.
Oh, that’s interesting. I think this is indeed the most fragile assumption from the ones invoked here. Though I’m wondering if you could actually obtain such an evaluator using the described procedure.
Thanks for pointing out another assumption I didn’t even consider articulating. The way this proposal answers the second question is:
1. (Outer) align one subsystem (agent) to the other subsystem (evaluator), which we know how to do because the evaluator runs on a computer.
2. Attempt to (outer) align the other subsystem (evaluator) to the human’s true objective through a fixed set of positive examples (initial behaviors or outcomes specified by humans) and a growing set of increasingly nuanced negative examples (specified by the improving agent).
Oh, that’s interesting. I think this is indeed the most fragile assumption from the ones invoked here. Though I’m wondering if you could actually obtain such an evaluator using the described procedure.