I agree that getting an AGI to care about goals that it understands is incredibly difficult and perhaps the most challenging part of alignment. But I think John’s original claim is true: “Many arguments for AGI misalignment depend on our inability to imbue AGI with a sufficiently rich understanding of what individual humans want and how to take actions that respect societal values more broadly.” Here are some prominent articulations of the argument.
Unsolved Problems in ML Safety (Hendrycks et al., 2021) says that “Encoding human goals and intent is challenging.” Section 4.1 is therefore about value learning. (Section 4.2 is about the difficulty of getting a system to internalize those values.)
Robin Shah’s sequence on Value Learning argues that “Standard AI research will continue to make progress on learning what to do; catastrophe happens when our AI system doesn’t know what not to do. This is the part that we need to make progress on.”
Specification Gaming: The Flip Side of AI Ingenuity (Krakovna et al., 2020) says the following: “Designing task specifications (reward functions, environments, etc.) that accurately reflect the intent of the human designer tends to be difficult. Even for a slight misspecification, a very good RL algorithm might be able to find an intricate solution that is quite different from the intended solution, even if a poorer algorithm would not be able to find this solution and thus yield solutions that are closer to the intended outcome. This means that correctly specifying intent can become more important for achieving the desired outcome as RL algorithms improve. It will therefore be essential that the ability of researchers to correctly specify tasks keeps up with the ability of agents to find novel solutions.
John is suggesting one way of working on the outer alignment problem, while Zach is pointing out that inner alignment is arguably more dangerous. These are both fair points IMO. In my experience, people on this website often reject work on specifying human values in favor of problems that are more abstract but seen as more fundamentally difficult. Personally I’m glad there are people working on both.
I agree that getting an AGI to care about goals that it understands is incredibly difficult and perhaps the most challenging part of alignment. But I think John’s original claim is true: “Many arguments for AGI misalignment depend on our inability to imbue AGI with a sufficiently rich understanding of what individual humans want and how to take actions that respect societal values more broadly.” Here are some prominent articulations of the argument.
Unsolved Problems in ML Safety (Hendrycks et al., 2021) says that “Encoding human goals and intent is challenging.” Section 4.1 is therefore about value learning. (Section 4.2 is about the difficulty of getting a system to internalize those values.)
Robin Shah’s sequence on Value Learning argues that “Standard AI research will continue to make progress on learning what to do; catastrophe happens when our AI system doesn’t know what not to do. This is the part that we need to make progress on.”
Specification Gaming: The Flip Side of AI Ingenuity (Krakovna et al., 2020) says the following: “Designing task specifications (reward functions, environments, etc.) that accurately reflect the intent of the human designer tends to be difficult. Even for a slight misspecification, a very good RL algorithm might be able to find an intricate solution that is quite different from the intended solution, even if a poorer algorithm would not be able to find this solution and thus yield solutions that are closer to the intended outcome. This means that correctly specifying intent can become more important for achieving the desired outcome as RL algorithms improve. It will therefore be essential that the ability of researchers to correctly specify tasks keeps up with the ability of agents to find novel solutions.
John is suggesting one way of working on the outer alignment problem, while Zach is pointing out that inner alignment is arguably more dangerous. These are both fair points IMO. In my experience, people on this website often reject work on specifying human values in favor of problems that are more abstract but seen as more fundamentally difficult. Personally I’m glad there are people working on both.