This is a strong candidate for best of the year. Clarifying the arguments for why alignment is hard seems like one of the two most important things we could be working on. If we could make a very clear argument for alignment being hard, we might actually have a shot at getting a slowdown or meaningful regulations. This post goes a long way toward putting those arguments in plain language. It stands alongside Zvi’s On A List of Lethalities, Yudkowsky’s original AGI Ruin: A List of Lethalities, Ruthenis’ A Case for the Least Forgiving Take On Alignment, and similar. This would be, for many novice readers, the best of those summaries. It puts everything into plain langague, while addressing the biggest problems.
The section on paths forward seems much less useful; that’s fine, that wasn’t the intended focus.
All of these except the original List lean far too heavily on the difficulty of understanding and defining human values. I think this is the biggest Crux of disagreement on alignment difficulty. Optimists don’t think that’s part of the alignent problem, and that’s a large part of why they’re optimistic.
People who are informed and thoughtful but more optimistic, like Paul Christiano, typically do not envision giving AI values aligned with humans, but rather something like Corrigibility as Singular Target or Instruction following. This alignment target seems to vastly simplify that portion of the challlenge; it’s the difference between making a machine that both wants to figure out what people want and then does that perfectly reliably, and making a machine that just wants to do what this one human meant by what they said to do. This is not only much simpler to define and to learn, but it means they can correct mistakes instead of having to get everything right on the first shot.
That’s my two cents for where work should follow up on this set of alignment difficulties. I’d also like to see people continuing to refine and clarify the arguments for alignment difficulty, particularly in regard to the specific types of AGI we’re working on, and I intend to spend part of my own time doing that as well.
I basically agree with this being a good post mostly because it distills the core argument in a way such that we can make more productive progress on the issue.
This is a strong candidate for best of the year. Clarifying the arguments for why alignment is hard seems like one of the two most important things we could be working on. If we could make a very clear argument for alignment being hard, we might actually have a shot at getting a slowdown or meaningful regulations. This post goes a long way toward putting those arguments in plain language. It stands alongside Zvi’s On A List of Lethalities, Yudkowsky’s original AGI Ruin: A List of Lethalities, Ruthenis’ A Case for the Least Forgiving Take On Alignment, and similar. This would be, for many novice readers, the best of those summaries. It puts everything into plain langague, while addressing the biggest problems.
The section on paths forward seems much less useful; that’s fine, that wasn’t the intended focus.
All of these except the original List lean far too heavily on the difficulty of understanding and defining human values. I think this is the biggest Crux of disagreement on alignment difficulty. Optimists don’t think that’s part of the alignent problem, and that’s a large part of why they’re optimistic.
People who are informed and thoughtful but more optimistic, like Paul Christiano, typically do not envision giving AI values aligned with humans, but rather something like Corrigibility as Singular Target or Instruction following. This alignment target seems to vastly simplify that portion of the challlenge; it’s the difference between making a machine that both wants to figure out what people want and then does that perfectly reliably, and making a machine that just wants to do what this one human meant by what they said to do. This is not only much simpler to define and to learn, but it means they can correct mistakes instead of having to get everything right on the first shot.
That’s my two cents for where work should follow up on this set of alignment difficulties. I’d also like to see people continuing to refine and clarify the arguments for alignment difficulty, particularly in regard to the specific types of AGI we’re working on, and I intend to spend part of my own time doing that as well.
I basically agree with this being a good post mostly because it distills the core argument in a way such that we can make more productive progress on the issue.