This explanation seems clear and helpful to me. I wonder if you can extend it to also explain non-agentic approaches to Prosaic AI Alignment (and why some people prefer those). For example you cite Paul Christiano a number of times, but I don’t think he intends to end up with a model “implementing highly effective expected utility maximization for some utility function”.
If we are dealing with a model doing expected utility maximization we can ‘just’ try to understand whether we agree with its goal, and then essentially trust that it will correctly and stably generalize to almost any situation.
I wonder if you can extend it to also explain non-agentic approaches to Prosaic AI Alignment (and why some people prefer those).
I’m quite confused about what a non-agentic approach actually looks like, and I agree that extending this to give a proper account would be really interesting. A possible argument for actively avoiding ‘agentic’ models from this framework is:
Models which generalize very competently also seem more likely to have malign failures, so we might want to avoid them.
If we believe H then things which generalize very competently are likely to have agent-like internal architecture.
Having a selection criteria or model-space/prior which actively pushes away from such agent-like architectures could then help push away from things which generalize too broadly.
I think my main problem with this argument is that step 3 might make step 2 invalid—it might be that if you actively punish agent-like architecture in your search then you will break the conditions that made ‘too broad generalization’ imply ‘agent-like architecture’, and thus end up with things that still generalize very broadly (with all the downsides of this) but just look a lot weirder.
This seems too optimistic/trusting. See Ontology identification problem, Modeling distant superintelligences, and more recently The “Commitment Races” problem.
Thanks for the links, I definitely agree that I was drastically oversimplifying this problem. I still think this task might be much simpler than the task of trying to understand the generalization of some strange model whose internal working we don’t even have a vocabulary to describe.
(black) hat tip to johnswentworth for the notion that the choice of boundary for the agent is arbitrary in the sense that you can think of a thermostat optimizing the environment or think of the environment as optimizing the thermostat. Collapsing sensor/control duality for at least some types of feedback circuits.
These sorts of problems are what caused me to want a presentation which didn’t assume well-defined agents and boundaries in the ontology, but I’m not sure how it applies to the above—I am not looking for optimization as a behavioral pattern but as a concrete type of computation, which involves storing world-models and goals and doing active search for actions which further the goals. Neither a thermostat nor the world outside seem to do this from what I can see? I think I’m likely missing your point.
Motivating example: consider a primitive bacterium with a thermostat, a light sensor, and a salinity detector, each of which has functional mappings to some movement pattern. You could say this system has a 3 dimensional map and appears to search over it.
In light of this exchange, it seems like it would be interesting to analyze how much arguments for problematic properties of superintelligent utility-maximizing agents (like instrumental convergence) actually generalize to more general well-generalizing systems.
This explanation seems clear and helpful to me. I wonder if you can extend it to also explain non-agentic approaches to Prosaic AI Alignment (and why some people prefer those). For example you cite Paul Christiano a number of times, but I don’t think he intends to end up with a model “implementing highly effective expected utility maximization for some utility function”.
This seems too optimistic/trusting. See Ontology identification problem, Modeling distant superintelligences, and more recently The “Commitment Races” problem.
I’m quite confused about what a non-agentic approach actually looks like, and I agree that extending this to give a proper account would be really interesting. A possible argument for actively avoiding ‘agentic’ models from this framework is:
Models which generalize very competently also seem more likely to have malign failures, so we might want to avoid them.
If we believe H then things which generalize very competently are likely to have agent-like internal architecture.
Having a selection criteria or model-space/prior which actively pushes away from such agent-like architectures could then help push away from things which generalize too broadly.
I think my main problem with this argument is that step 3 might make step 2 invalid—it might be that if you actively punish agent-like architecture in your search then you will break the conditions that made ‘too broad generalization’ imply ‘agent-like architecture’, and thus end up with things that still generalize very broadly (with all the downsides of this) but just look a lot weirder.
Thanks for the links, I definitely agree that I was drastically oversimplifying this problem. I still think this task might be much simpler than the task of trying to understand the generalization of some strange model whose internal working we don’t even have a vocabulary to describe.
(black) hat tip to johnswentworth for the notion that the choice of boundary for the agent is arbitrary in the sense that you can think of a thermostat optimizing the environment or think of the environment as optimizing the thermostat. Collapsing sensor/control duality for at least some types of feedback circuits.
These sorts of problems are what caused me to want a presentation which didn’t assume well-defined agents and boundaries in the ontology, but I’m not sure how it applies to the above—I am not looking for optimization as a behavioral pattern but as a concrete type of computation, which involves storing world-models and goals and doing active search for actions which further the goals. Neither a thermostat nor the world outside seem to do this from what I can see? I think I’m likely missing your point.
Motivating example: consider a primitive bacterium with a thermostat, a light sensor, and a salinity detector, each of which has functional mappings to some movement pattern. You could say this system has a 3 dimensional map and appears to search over it.
In light of this exchange, it seems like it would be interesting to analyze how much arguments for problematic properties of superintelligent utility-maximizing agents (like instrumental convergence) actually generalize to more general well-generalizing systems.
Strongly agree with this, I think this seems very important.