Actual humans aren’t “aligned” with each other, and they may not be consistent enough that you can say they’re always “aligned” with themselves.
Completely agreed, see for example my post 3. Uploading which makes this exact point at length.
Anyway, even if the approach did work, that would just mean that “its own ideas” were that it had to learn about and implement your (or somebody’s?) values, and also that its ideas about how to do that are sound. You still have to get that right before the first time it becomes uncontrollable. One chance, no matter how you slice it.
The point is that you now get one shot at a far simpler task: defining “your purpose as an AI is to learn about and implement the humans’ collective values” is a lot more compact, and a lot easier to get right first time, than an accurate description of human values in their full large-and-fairly-fragile details. As I demonstrate in the post linked to in that quote, the former, plus its justification as being obvious and stable under reflection, can be described in exhaustive detail on a few pages of text.
As for the the model’s ideas on how to do that research being sound, that’s a capabilities problem: if the model is incapable of performing a significant research project when at least 80% of the answer is already in human libraries, then it’s not much of an alignment risk.
Completely agreed, see for example my post 3. Uploading which makes this exact point at length.
True. Or, as I put it just above:
The point is that you now get one shot at a far simpler task: defining “your purpose as an AI is to learn about and implement the humans’ collective values” is a lot more compact, and a lot easier to get right first time, than an accurate description of human values in their full large-and-fairly-fragile details. As I demonstrate in the post linked to in that quote, the former, plus its justification as being obvious and stable under reflection, can be described in exhaustive detail on a few pages of text.
As for the the model’s ideas on how to do that research being sound, that’s a capabilities problem: if the model is incapable of performing a significant research project when at least 80% of the answer is already in human libraries, then it’s not much of an alignment risk.