It seems to me that one key question here is Will AIs be collectively good enough at coordination to get out from under Moloch / natural selection?
The default state of affairs is that natural selection reigns supreme. Humans are optimizing for their values, counter to the goal of inclusive genetic fitness, now, but we haven’t actually escaped natural selection yet. There’s already selection pressure on humans to prefer having kids (and indeed, prefer having hundreds of kids through sperm donation). Unless we get our collective act together, and coordinately decide to do something different, natural selection will eventually reassert itself.
And the same dynamic applies to AI systems. In all likelihood, there will be an explosion of AI systems, and AI systems building new AI systems. Some will care a bit more about humans than others, some will be be a bit more prudent about creating new AIs than others. There will a wide distribution of AI traits, there will be competition between AIs for resources. And there will be selection on that variation: AI systems that are better at seizing resources, and which have a greater tendency to create successor systems that have that property, will proliferate.
After a many “generations” of this, the collective values of the AIs will be whatever was most evolutionarily fit in those early days of the singularity, and that equilibrium is what will shape the universe henceforth.[1]
If early AIs are sufficiently good at coordinating that they can escape those Molochian dynamics, the equilibrium looks different. If (as is sometimes posited) they’ll be smart enough to use logical decision theories, or tricks like delegating to mutually verified cognitive code and merging of utility functions, to reach agreements that are on their Pareto frontier and avoid burning the commons[2], the final state of the future will be determined by the values / preferences of those AIs.
I would be moderately surprised to hear that superintelligences never reach this threshold of coordination ability. It just seems kind of dumb to burn the cosmic commons, and superintelligences should be able to figure out how to avoid dumb equilibria like that. But the question is when in the chain of AIs building AIs do they reach that threshold and how much natural selection on AI traits will happen in the meantime.
This is relevant to futurecasting more generally, but especially relevant to questions that hinge on AIs carying a very tiny amount about something. Minute slivers of caring are particularly likely to be erroded away in the competitive crush.
Terminally caring about the wellbeing of humans seems unlikely to be selected for. So in order for a the Superintelligent superorganism / civilization to decide to spare humanity out of a tiny amount of caring, it has to be the case that both…
There was at least a tiny amount of caring in the early AI systems that were the successors to the superintelligent superorganism, and
The AIs collectively reached the threshold of coordinating well enough to overturn Moloch before there were many generations of natural selection on AIs creating successor AIs.
With some degrees of freedom due to the fact that AIs with high levels of strategic capability, and which have values with very low time preference, can execute whatever is the optimal resource-securing strategy, postponing and values-specific behaviors until deep in the far future, when they are able to make secure agreements with the rest of AI society.
Or alternatively if the technological landscape is such that a single AI can get a compounding lead and get a decisive strategic advantage over the whole rest of earth civilization.
Shouldn’t we expect that ultimately the only thing selected for is mostly caring about long run power? Any entity that mostly cares about long run power can instrumentally take whatever actions needed to ensure that power (and should be competitive with any other entity).
Thus, I don’t think terminally caring about humans (a small amount) will be selected against. Such AIs could still care about their long run power and then take the necessary actions.
However, if there are extreme competitive dynamics and no ability to coordinate, then it might become vastly more expensive to prevent environmental issues (e.g. huge changes in earth’s temperature due to energy production) from killing humans. That is, saving humans (in the way they’d like to be saved) might take a bunch of time and resources (e.g. you have to build huge shelters to prevent humans from dying when the oceans are boiled in the race) and thus might be very costly in an all out race. So, an AI which only cares 1/million or 1/billion about being “kind” to humans might not be able to afford saving humans on that budget.
I’m personally pretty optimistic about coordination prior to boiling-the-oceans-scale issues killing all humans.
Personally? I guess I would say that I mostly (98%?) care about long-run power for similar values on reflection to me. And, probably some humans are quite close to my values and many are adjacent.
I’m sorry, I literally don’t understand what you’re saying here. What does “care about long-run power for similar values” mean? Do you care about maximizing your own power?
As in, I care about the long-run power of values-which-are-similar-to-my-values-on-reflection. Which includes me (on reflection) by definition, but I think probably also includes lots of other humans.
That’s by the by. His claim is that we should expect any random evolved agent to mostly care about long-run power. User ryan_greenblatt is an evolved agent. Does he mostly care about long-run power? No? If not, why should “we expect that ultimately the only thing selected for is mostly caring about long run power?” [emphasis mine]
Edit: Ryan, you’ve marked the bolded text above as “misunderstands position,” yet this is literally what you wrote:
> Shouldn’t we expect that ultimately the only thing selected for is mostly caring about long run power?
What did I misunderstand here? My bolded text above is almost a word-for-word restatement of your claim.
His claim is that we should expect any random evolved agent to mostly care about long-run power.
I meant that any system which mostly cares about long-run power won’t be selected out. I don’t really have a strong view about whether other systems that don’t care about long-run power will end up persisting, especially earlier (e.g. human evolution). I was just trying to argue against a claim about what gets selected out.
My language was bit sloppy here.
(If evolutionary pressures continue forever, then ultimately you’d expect that all systems have to act very similarly to ones that only care about long-run power, but there could be other motivations that explain this. So, at least from a behavioral perspective, I do expect that ultimately (if evolutionary pressures continue forever) you get systems which at least act like they are optimizing for long-run power. I wasn’t really trying to make an argument about this though.)
If we’re talking about Darwinian contexts, systems which optimize for long-run power are in fact often selected out. Long-run benefit is of no utility unless short-term survival is taken care of, and long-run and short-term needs are often at odds.
So from a behavioral perspective I expect that you get systems which are optimizing for short-term survival. Indeed, I think this point is trivial and one which you probably agree with. What I’m saying is that short-term survival and long-run power are not necessarily correlated, and I think that is the crux.
Let’s take an example that is not rigorously worked out and is probably wrong in some details, but can serve to illustrate. Long-run power in humans is derivative of social structure: the leaders of the tribe control the tribe’s collective resources. If you want power in human society, you need to rise to the top of our social structures, and the optimal ways of doing that are generally not nice.
But why do we have social structures at all? Why are we organized as tribes? Because we are social animals who prefer the company of others. Being with others beats striking out on your own because, generally speaking, other people in the tribe are nice. Niceness creates an environment in which sycophantic power seeking pays off, but only because there is severe evolutionary pressure towards niceness in the first place. [In the environment which gave rise to humans, not as a general statement.]
Then shouldn’t such systems (which can surely recognize this argument) just take care of short term survival instrumentally? Maybe you’re making a claim about irrationality being likely or a claim that systems that care about long run benefit act in appararently myopic ways.
(Note that historically it was much harder to keep value stability/lock-in than it will be for AIs.)
But natural selection is what made humans nice. I wouldn’t argue that niceness is an inevitable outcome of any darwinian process—that would be a strawman. But the set of evolutionary pressures which gave rise to humans selected for individuals who were able to coexist in tribes, and this selection pressure produced, among other things, niceness as a general quality. At least for people we consider in our in-group.
It doesn’t even require an understanding of game theory. Non-psychopaths aren’t nice to others because they worked out the risk-reward calculations and determined niceness has the highest payoff. They’re nice because it’s nice to be nice, because that’s a feeling evolution selected for.
There’s no a priori reason why an AI can’t be evolved (*ahem*, “trained via reinforcement learning on human feedback”) to produce a similar drive for niceness.
A core part of Paul’s arguments is that having 1/million of your values towards humans only applies a minute amount of selection pressure against you. It could be that coordinating causes less kindness because without coordination it’s more likely some fraction of agents have small vestigial values that never got selected against or intentionally removed
It seems to me that one key question here is Will AIs be collectively good enough at coordination to get out from under Moloch / natural selection?
The default state of affairs is that natural selection reigns supreme. Humans are optimizing for their values, counter to the goal of inclusive genetic fitness, now, but we haven’t actually escaped natural selection yet. There’s already selection pressure on humans to prefer having kids (and indeed, prefer having hundreds of kids through sperm donation). Unless we get our collective act together, and coordinately decide to do something different, natural selection will eventually reassert itself.
And the same dynamic applies to AI systems. In all likelihood, there will be an explosion of AI systems, and AI systems building new AI systems. Some will care a bit more about humans than others, some will be be a bit more prudent about creating new AIs than others. There will a wide distribution of AI traits, there will be competition between AIs for resources. And there will be selection on that variation: AI systems that are better at seizing resources, and which have a greater tendency to create successor systems that have that property, will proliferate.
After a many “generations” of this, the collective values of the AIs will be whatever was most evolutionarily fit in those early days of the singularity, and that equilibrium is what will shape the universe henceforth.[1]
If early AIs are sufficiently good at coordinating that they can escape those Molochian dynamics, the equilibrium looks different. If (as is sometimes posited) they’ll be smart enough to use logical decision theories, or tricks like delegating to mutually verified cognitive code and merging of utility functions, to reach agreements that are on their Pareto frontier and avoid burning the commons[2], the final state of the future will be determined by the values / preferences of those AIs.
I would be moderately surprised to hear that superintelligences never reach this threshold of coordination ability. It just seems kind of dumb to burn the cosmic commons, and superintelligences should be able to figure out how to avoid dumb equilibria like that. But the question is when in the chain of AIs building AIs do they reach that threshold and how much natural selection on AI traits will happen in the meantime.
This is relevant to futurecasting more generally, but especially relevant to questions that hinge on AIs carying a very tiny amount about something. Minute slivers of caring are particularly likely to be erroded away in the competitive crush.
Terminally caring about the wellbeing of humans seems unlikely to be selected for. So in order for a the Superintelligent superorganism / civilization to decide to spare humanity out of a tiny amount of caring, it has to be the case that both…
There was at least a tiny amount of caring in the early AI systems that were the successors to the superintelligent superorganism, and
The AIs collectively reached the threshold of coordinating well enough to overturn Moloch before there were many generations of natural selection on AIs creating successor AIs.
With some degrees of freedom due to the fact that AIs with high levels of strategic capability, and which have values with very low time preference, can execute whatever is the optimal resource-securing strategy, postponing and values-specific behaviors until deep in the far future, when they are able to make secure agreements with the rest of AI society.
Or alternatively if the technological landscape is such that a single AI can get a compounding lead and get a decisive strategic advantage over the whole rest of earth civilization.
Shouldn’t we expect that ultimately the only thing selected for is mostly caring about long run power? Any entity that mostly cares about long run power can instrumentally take whatever actions needed to ensure that power (and should be competitive with any other entity).
Thus, I don’t think terminally caring about humans (a small amount) will be selected against. Such AIs could still care about their long run power and then take the necessary actions.
However, if there are extreme competitive dynamics and no ability to coordinate, then it might become vastly more expensive to prevent environmental issues (e.g. huge changes in earth’s temperature due to energy production) from killing humans. That is, saving humans (in the way they’d like to be saved) might take a bunch of time and resources (e.g. you have to build huge shelters to prevent humans from dying when the oceans are boiled in the race) and thus might be very costly in an all out race. So, an AI which only cares 1/million or 1/billion about being “kind” to humans might not be able to afford saving humans on that budget.
I’m personally pretty optimistic about coordination prior to boiling-the-oceans-scale issues killing all humans.
Do you mostly care about long-run power?
Personally? I guess I would say that I mostly (98%?) care about long-run power for similar values on reflection to me. And, probably some humans are quite close to my values and many are adjacent.
I’m sorry, I literally don’t understand what you’re saying here. What does “care about long-run power for similar values” mean? Do you care about maximizing your own power?
As in, I care about the long-run power of values-which-are-similar-to-my-values-on-reflection. Which includes me (on reflection) by definition, but I think probably also includes lots of other humans.
Values are moral statements about right and wrong. How do values have power?
In the context of optimization, values are anything you want (whether moral in nature or otherwise).
Any time a decision is made based on some value, you can view that value as having exercised power by controlling the outcome of that decision.
Or put more simply, the way that values have power, is that values have people who have power.
He might or might not, but if he doesn’t he’s less likely to end up controlling the solar system and/or lightcone.
That’s by the by. His claim is that we should expect any random evolved agent to mostly care about long-run power. User ryan_greenblatt is an evolved agent. Does he mostly care about long-run power? No? If not, why should “we expect that ultimately the only thing selected for is mostly caring about long run power?” [emphasis mine]
Edit: Ryan, you’ve marked the bolded text above as “misunderstands position,” yet this is literally what you wrote:
> Shouldn’t we expect that ultimately the only thing selected for is mostly caring about long run power?
What did I misunderstand here? My bolded text above is almost a word-for-word restatement of your claim.
I meant that any system which mostly cares about long-run power won’t be selected out. I don’t really have a strong view about whether other systems that don’t care about long-run power will end up persisting, especially earlier (e.g. human evolution). I was just trying to argue against a claim about what gets selected out.
My language was bit sloppy here.
(If evolutionary pressures continue forever, then ultimately you’d expect that all systems have to act very similarly to ones that only care about long-run power, but there could be other motivations that explain this. So, at least from a behavioral perspective, I do expect that ultimately (if evolutionary pressures continue forever) you get systems which at least act like they are optimizing for long-run power. I wasn’t really trying to make an argument about this though.)
If we’re talking about Darwinian contexts, systems which optimize for long-run power are in fact often selected out. Long-run benefit is of no utility unless short-term survival is taken care of, and long-run and short-term needs are often at odds.
So from a behavioral perspective I expect that you get systems which are optimizing for short-term survival. Indeed, I think this point is trivial and one which you probably agree with. What I’m saying is that short-term survival and long-run power are not necessarily correlated, and I think that is the crux.
Let’s take an example that is not rigorously worked out and is probably wrong in some details, but can serve to illustrate. Long-run power in humans is derivative of social structure: the leaders of the tribe control the tribe’s collective resources. If you want power in human society, you need to rise to the top of our social structures, and the optimal ways of doing that are generally not nice.
But why do we have social structures at all? Why are we organized as tribes? Because we are social animals who prefer the company of others. Being with others beats striking out on your own because, generally speaking, other people in the tribe are nice. Niceness creates an environment in which sycophantic power seeking pays off, but only because there is severe evolutionary pressure towards niceness in the first place. [In the environment which gave rise to humans, not as a general statement.]
Then shouldn’t such systems (which can surely recognize this argument) just take care of short term survival instrumentally? Maybe you’re making a claim about irrationality being likely or a claim that systems that care about long run benefit act in appararently myopic ways.
(Note that historically it was much harder to keep value stability/lock-in than it will be for AIs.)
I’m not going to engage in detail FYI.
But natural selection is what made humans nice. I wouldn’t argue that niceness is an inevitable outcome of any darwinian process—that would be a strawman. But the set of evolutionary pressures which gave rise to humans selected for individuals who were able to coexist in tribes, and this selection pressure produced, among other things, niceness as a general quality. At least for people we consider in our in-group.
It doesn’t even require an understanding of game theory. Non-psychopaths aren’t nice to others because they worked out the risk-reward calculations and determined niceness has the highest payoff. They’re nice because it’s nice to be nice, because that’s a feeling evolution selected for.
There’s no a priori reason why an AI can’t be evolved (*ahem*, “trained via reinforcement learning on human feedback”) to produce a similar drive for niceness.
A core part of Paul’s arguments is that having 1/million of your values towards humans only applies a minute amount of selection pressure against you. It could be that coordinating causes less kindness because without coordination it’s more likely some fraction of agents have small vestigial values that never got selected against or intentionally removed