And this is where the fundamental AGI-doom arguments – all these coherence theorems, utility-maximization frameworks, et cetera – come in. At their core, they’re claims that any “artificial generally intelligent system capable of autonomously optimizing the world the way humans can” would necessarily be well-approximated as a game-theoretic agent. Which, in turn, means that any system that has the set of capabilities the AI researchers ultimately want their AI models to have, would inevitably have a set of potentially omnicidal failure modes.
If you drop the “artificially” from the claim, you are left with a claim that any “generally intelligent system capable of autonomously optimizing the world the way humans can” would necessarily be well-approximated as a game-theoretic agent. Do you endorse that claim, or do think that there is some particular reason a biological or hybrid generally intelligent system capable of autonomously optimizing the world the way a human or an organization based on humans might not be well-approximated as a game-theoretic agent?
Because humans sure don’t seem like paperclipper-style utility maximizers to me.
Do you endorse [the claim that any “generally intelligent system capable of autonomously optimizing the world the way humans can” would necessarily be well-approximated as a game-theoretic agent?]
Yes.
Because humans sure don’t seem like paperclipper-style utility maximizers to me.
Humans are indeed hybrid systems. But I would say that inasmuch as they act as generally intelligent systems capable of autonomously optimizing the world in scarily powerful ways, they do act as game-theoretic agents. E. g., people who are solely focused on resource accumulation, and don’t have self-destructive vices or any distracting values they’re not willing to sacrifice to Moloch, tend to indeed accumulate power at a steady rate. At a smaller scope, people tend to succeed at those of their long-term goals that they’ve clarified for themselves and doggedly pursue; and not succeed at them if they flip-flop between different passions on a daily basis.
I’ve been meaning to do some sort of literature review solidly backing this claim, actually, but it hasn’t been a priority for me. Hmm, maybe it’d be easy with the current AI tools...
By “hybrid system” I actually meant “system composed of multiple humans plus external structure”, sorry if that was unclear. Concretely I’m thinking of things like “companies” and “countries”.
people who are solely focused on resource accumulation, and don’t have self-destructive vices or any distracting values they’re not willing to sacrifice to Moloch, tend to indeed accumulate power at a steady rate.
I don’t see how one gets from this observation to the conclusion that humans are well-approximated as paperclipper-style agents.
I suppose it may be worth stepping back to clarify that when I say “paperclipper-style agents”, I mean “utility maximizers whose utility function is a function of the configuration of matter at some specific time in the future”. That’s a super-finicky-sounding definition but my understanding is that you have to have a definition that looks like that if you want to use coherence theorems, and otherwise you end up saying that a rock is an agent that maximizes the utility function “behave like a rock”.
It does not seem to me that very many humans are trying to maximize the resources under their control at the time of their death, nor does it seem like the majority of the resources in the world are under the control of the few people who have decided to do that. It is the case that people who care at all about obtaining resources control a significant fraction of the resources, but I don’t see a trend where the people who care maximally about controlling resources actually control a lot more resources than the people who care somewhat about controlling resources, as long as they still have time to play a round of golf or do whatever else they enjoy.
I like this exchange and the clarifications on both sides. I’ll add my response:
You’re right that coherence arguments work by assuming a goal is about the future. But preferences over a single future timeslice is too specific, the arguments still work if it’s multiple timeslices, or an integral over time, or larger time periods that are still in the future. The argument starts breaking down only when it has strong preferences over immediate actions, and those preferences are stronger than any preferences over the future-that-is-causally-downstream-from-those-actions. But even then it could be reasonable to model the system as a coherent agent during the times when its actions aren’t determined by near-term constraints, when longer-term goals dominate.
(a relevant part of Eliezer’s recent thread is “then probably one of those pieces runs over enough of the world-model (or some piece of reality causally downstream of enough of the world-model) that It can always do a little better by expending one more erg of energy.”, but it should be read in context)
Another missing piece here might be: The whole point of building an intelligent agent is that you know more about the future-outcomes you want than you do about the process to get there. This is the thing that makes agents useful and valuable. And it’s the main thing that separates agents from most other computer programs.
On the other hand, it does look like the anti-corrigibility results can be overcome by sometimes having strong preferences over intermediate times (i.e. over particular ways the world should go) rather than final-outcomes. This does seem important in terms of alignment solutions. And it takes some steam out of the arguments that go “coherent therefore incorrigible” (or it at least should add some caveats). But this only helps us if we have a lot of control over the preferences&constraints of the agent, and it has a couple of stability properties.
I like this exchange and the clarifications on both sides.
Yeah, it feels like it’s getting at a crux between the “backchaining / coherence theorems / solve-for-the-equilibrium / law thinking” cluster of world models and the “OODA loop / shard theory / interpolate and extrapolate / toolbox thinking” cluster of world models.
You’re right that coherence arguments work by assuming a goal is about the future. But preferences over a single future timeslice is too specific, the arguments still work if it’s multiple timeslices, or an integral over time, or larger time periods that are still in the future. The argument starts breaking down only when it has strong preferences over immediate actions, and those preferences are stronger than any preferences over the future-that-is-causally-downstream-from-those-actions
Humans do seem to have strong preferences over immediate actions. For example, many people prefer not to lie, even if they think that lying will help them achieve their goals and they are confident that they will not get caught.
I expect that in multi-agent environments, there is significant pressure towards legibly having these kinds of strong preferences over immediate actions. As such, I expect that that structure of thing will show up in future intelligent agents, rather than being a human-specific anomaly.
But even then it could be reasonable to model the system as a coherent agent during the times when its actions aren’t determined by near-term constraints, when longer-term goals dominate. [...] The whole point of building an intelligent agent is that you know more about the future-outcomes you want than you do about the process to get there.
I expect that agents which predictably behave in the way EY describes as “going hard” (i.e. attempting to achieve their long-term goal at any cost) will find it harder to find other agents who will cooperate with them. It’s not a binary choice between “care about process” and “care about outcomes”—it is possible and common to care about outcomes, and also to care about the process used to achieve those outcomes.
On the other hand, it does look like the anti-corrigibility results can be overcome by sometimes having strong preferences over intermediate times (i.e. over particular ways the world should go) rather than final-outcomes.
Yeah. Or strong preferences over processes (although I suppose you can frame a preference over process as a preference over there not being any intermediate time where the agent is actively executing some specific undesired behavior).
But this only helps us if we have a lot of control over the preferences&constraints of the agent,
It does seem to me that “we have a lot of control over the approaches the agent tends to take” is true and becoming more true over time.
and it has a couple of stability properties.
I doubt that systems trained with ML techniques have these properties. But I don’t think e.g. humans or organizations built out of humans + scaffolding have these properties either, and I have a sneaking suspicion that the properties in question are incompatible with competitiveness.
If you drop the “artificially” from the claim, you are left with a claim that any “generally intelligent system capable of autonomously optimizing the world the way humans can” would necessarily be well-approximated as a game-theoretic agent. Do you endorse that claim, or do think that there is some particular reason a biological or hybrid generally intelligent system capable of autonomously optimizing the world the way a human or an organization based on humans might not be well-approximated as a game-theoretic agent?
Because humans sure don’t seem like paperclipper-style utility maximizers to me.
Yes.
Humans are indeed hybrid systems. But I would say that inasmuch as they act as generally intelligent systems capable of autonomously optimizing the world in scarily powerful ways, they do act as game-theoretic agents. E. g., people who are solely focused on resource accumulation, and don’t have self-destructive vices or any distracting values they’re not willing to sacrifice to Moloch, tend to indeed accumulate power at a steady rate. At a smaller scope, people tend to succeed at those of their long-term goals that they’ve clarified for themselves and doggedly pursue; and not succeed at them if they flip-flop between different passions on a daily basis.
I’ve been meaning to do some sort of literature review solidly backing this claim, actually, but it hasn’t been a priority for me. Hmm, maybe it’d be easy with the current AI tools...
By “hybrid system” I actually meant “system composed of multiple humans plus external structure”, sorry if that was unclear. Concretely I’m thinking of things like “companies” and “countries”.
I don’t see how one gets from this observation to the conclusion that humans are well-approximated as paperclipper-style agents.
I suppose it may be worth stepping back to clarify that when I say “paperclipper-style agents”, I mean “utility maximizers whose utility function is a function of the configuration of matter at some specific time in the future”. That’s a super-finicky-sounding definition but my understanding is that you have to have a definition that looks like that if you want to use coherence theorems, and otherwise you end up saying that a rock is an agent that maximizes the utility function “behave like a rock”.
It does not seem to me that very many humans are trying to maximize the resources under their control at the time of their death, nor does it seem like the majority of the resources in the world are under the control of the few people who have decided to do that. It is the case that people who care at all about obtaining resources control a significant fraction of the resources, but I don’t see a trend where the people who care maximally about controlling resources actually control a lot more resources than the people who care somewhat about controlling resources, as long as they still have time to play a round of golf or do whatever else they enjoy.
I like this exchange and the clarifications on both sides. I’ll add my response:
You’re right that coherence arguments work by assuming a goal is about the future. But preferences over a single future timeslice is too specific, the arguments still work if it’s multiple timeslices, or an integral over time, or larger time periods that are still in the future. The argument starts breaking down only when it has strong preferences over immediate actions, and those preferences are stronger than any preferences over the future-that-is-causally-downstream-from-those-actions. But even then it could be reasonable to model the system as a coherent agent during the times when its actions aren’t determined by near-term constraints, when longer-term goals dominate.
(a relevant part of Eliezer’s recent thread is “then probably one of those pieces runs over enough of the world-model (or some piece of reality causally downstream of enough of the world-model) that It can always do a little better by expending one more erg of energy.”, but it should be read in context)
Another missing piece here might be: The whole point of building an intelligent agent is that you know more about the future-outcomes you want than you do about the process to get there. This is the thing that makes agents useful and valuable. And it’s the main thing that separates agents from most other computer programs.
On the other hand, it does look like the anti-corrigibility results can be overcome by sometimes having strong preferences over intermediate times (i.e. over particular ways the world should go) rather than final-outcomes. This does seem important in terms of alignment solutions. And it takes some steam out of the arguments that go “coherent therefore incorrigible” (or it at least should add some caveats). But this only helps us if we have a lot of control over the preferences&constraints of the agent, and it has a couple of stability properties.
Yeah, it feels like it’s getting at a crux between the “backchaining / coherence theorems / solve-for-the-equilibrium / law thinking” cluster of world models and the “OODA loop / shard theory / interpolate and extrapolate / toolbox thinking” cluster of world models.
Humans do seem to have strong preferences over immediate actions. For example, many people prefer not to lie, even if they think that lying will help them achieve their goals and they are confident that they will not get caught.
I expect that in multi-agent environments, there is significant pressure towards legibly having these kinds of strong preferences over immediate actions. As such, I expect that that structure of thing will show up in future intelligent agents, rather than being a human-specific anomaly.
I expect that agents which predictably behave in the way EY describes as “going hard” (i.e. attempting to achieve their long-term goal at any cost) will find it harder to find other agents who will cooperate with them. It’s not a binary choice between “care about process” and “care about outcomes”—it is possible and common to care about outcomes, and also to care about the process used to achieve those outcomes.
Yeah. Or strong preferences over processes (although I suppose you can frame a preference over process as a preference over there not being any intermediate time where the agent is actively executing some specific undesired behavior).
It does seem to me that “we have a lot of control over the approaches the agent tends to take” is true and becoming more true over time.
I doubt that systems trained with ML techniques have these properties. But I don’t think e.g. humans or organizations built out of humans + scaffolding have these properties either, and I have a sneaking suspicion that the properties in question are incompatible with competitiveness.