The usual framing is to align agent policies in a way that also ensures that they don’t have an out-of-distribution phase change. With deceptive alignment, sharp left turn, and goodharting, capability to consider complicated plans or mere optimization-in-actuality prompts situations that are too out-of-distribution. The behavior on-distribution stops being a good indication of what’s going to happen there, regardless of difficulties of modern ML with robustness. In this framing, the training distribution also needs to confer alignment, so it anchors to more usual situations that can currently be captured in practice, with more capable agents or over-optimizing agents discovering more unusual situations than that. This anchoring of the character of the training distribution might be a problem.
It seems useful to deconfuse agents that don’t have that sort of phase change in their behavior, without being distracted by the separate problem of their alignment. When not focusing on aligned agents, it becomes more natural to consider arbitrary agents in arbitrary situations, including situations that can’t be obtained from the world as training data, so won’t be useful for ensuring alignment. And to treat such situations as training data for ML purposes.
The simulator framing is an example of this point, except it still pays too much attention to currently observable reality (even as simulators themselves are bad at that task). A simulator is a map of counterfactuals and not an agent, and it’s not about describing particular agents, instead it describes all sorts of agents and their interaction in all sorts of situations simultaneously. Can a simulator be trained to be particularly good at describing agents that make coherent decisions across a wide variety of situations, including situations that can’t be collected as training data from the world, or won’t occur in actuality, and need to instead be generated, with little hope or intent of remaining predictive of the real world? Or maybe such coherence is more naturally found in something smaller than agents, a collective of intents/purposes/concepts/norms that each shapes a scope of situations where it’s relevant, competing/bargaining with others of its kind to find an equilibrium that doesn’t have the more jarring sorts of behavioral phase changes, especially capability-related ones.
The usual framing is to align agent policies in a way that also ensures that they don’t have an out-of-distribution phase change. With deceptive alignment, sharp left turn, and goodharting, capability to consider complicated plans or mere optimization-in-actuality prompts situations that are too out-of-distribution. The behavior on-distribution stops being a good indication of what’s going to happen there, regardless of difficulties of modern ML with robustness. In this framing, the training distribution also needs to confer alignment, so it anchors to more usual situations that can currently be captured in practice, with more capable agents or over-optimizing agents discovering more unusual situations than that. This anchoring of the character of the training distribution might be a problem.
It seems useful to deconfuse agents that don’t have that sort of phase change in their behavior, without being distracted by the separate problem of their alignment. When not focusing on aligned agents, it becomes more natural to consider arbitrary agents in arbitrary situations, including situations that can’t be obtained from the world as training data, so won’t be useful for ensuring alignment. And to treat such situations as training data for ML purposes.
The simulator framing is an example of this point, except it still pays too much attention to currently observable reality (even as simulators themselves are bad at that task). A simulator is a map of counterfactuals and not an agent, and it’s not about describing particular agents, instead it describes all sorts of agents and their interaction in all sorts of situations simultaneously. Can a simulator be trained to be particularly good at describing agents that make coherent decisions across a wide variety of situations, including situations that can’t be collected as training data from the world, or won’t occur in actuality, and need to instead be generated, with little hope or intent of remaining predictive of the real world? Or maybe such coherence is more naturally found in something smaller than agents, a collective of intents/purposes/concepts/norms that each shapes a scope of situations where it’s relevant, competing/bargaining with others of its kind to find an equilibrium that doesn’t have the more jarring sorts of behavioral phase changes, especially capability-related ones.