Overall, I’ve updated from “just aim for ambitious value learning” to “empirically figure out what potential medium-term alignment targets (e.g. human values, corrigibility, Do What I Mean, human mimicry, etc) are naturally expressible in an AGI’s internal concept-language”.
I like this. In fact, I would argue that some of those medium-term alignment targets are actually necessary stepping stones toward ambitious value learning.
Human mimicry, for one, could serve as a good behavioral prior for IRL agents. AI that can reverse-engineer the policy function of a human (e.g., by minimizing the error between the world-state-trajectory caused by its own actions and that produced by a human’s actions) is probably already most of the way there toward reverse-engineering the value function that drives it (e.g., start by looking for common features among the stable fixed points of the learned policy function). I would argue that the intrinsic drive to mimic other humans is a big part of why humans are so adept at aligning to each other.
Do What I Mean (DWIM) would also require modeling humans in a way that would help greatly in modeling human values. A human that gives an AI instructions is mapping some high-dimensional, internally represented goal state into a linear sequence of symbols (or a 2D diagram or whatever). DWIM would require the AI to generate its own high-dimensional, internally represented goal states, optimizing for goals that give a high likelihood to the instructions it received. If achievable, DWIM could also help transform the local incentives for general AI capabilities research into something with a better Nash equilibrium. Systems that are capable of predicting what humans intended for them to do could prove far more valuable to existing stakeholders in AI research than current DL and RL systems, which tend to be rather brittle and prone to overfitting to the heuristics we give them.
I like this. In fact, I would argue that some of those medium-term alignment targets are actually necessary stepping stones toward ambitious value learning.
Human mimicry, for one, could serve as a good behavioral prior for IRL agents. AI that can reverse-engineer the policy function of a human (e.g., by minimizing the error between the world-state-trajectory caused by its own actions and that produced by a human’s actions) is probably already most of the way there toward reverse-engineering the value function that drives it (e.g., start by looking for common features among the stable fixed points of the learned policy function). I would argue that the intrinsic drive to mimic other humans is a big part of why humans are so adept at aligning to each other.
Do What I Mean (DWIM) would also require modeling humans in a way that would help greatly in modeling human values. A human that gives an AI instructions is mapping some high-dimensional, internally represented goal state into a linear sequence of symbols (or a 2D diagram or whatever). DWIM would require the AI to generate its own high-dimensional, internally represented goal states, optimizing for goals that give a high likelihood to the instructions it received. If achievable, DWIM could also help transform the local incentives for general AI capabilities research into something with a better Nash equilibrium. Systems that are capable of predicting what humans intended for them to do could prove far more valuable to existing stakeholders in AI research than current DL and RL systems, which tend to be rather brittle and prone to overfitting to the heuristics we give them.