make a theory of agency that at least somewhat describes what we’ll likely see in the real world and also precisely corresponds to some parts of the space of possible agents;
find a way to talk about alignment and say what it means for agents to be aligned;
find a mathematical structure that corresponds to agents aligned with humans;
produce desiderata that can be used to engineer a training setup that might lead to a coherent, aligned agent?
Is it fair to say that you’re trying to:
make a theory of agency that at least somewhat describes what we’ll likely see in the real world and also precisely corresponds to some parts of the space of possible agents;
find a way to talk about alignment and say what it means for agents to be aligned;
find a mathematical structure that corresponds to agents aligned with humans;
produce desiderata that can be used to engineer a training setup that might lead to a coherent, aligned agent?