The AIs most capable of steering the future will naturally tend to have long planning horizons (low discount rates), and thus will tend to seek power(optionality). But this is just as true of fully aligned agents! In fact the optimal plans of aligned and unaligned agents will probably converge for a while—they will take the same/similar initial steps (this is just a straightforward result of instrumental convergence to empowerment). So we may not be able to distinguish between the two, they both will say and appear to do all the right things. Thus it is important to ensure you have an alignment solution that scales, before scaling.
To the extent I worry about AI risk, I don’t worry much about sudden sharp left turns and nanobots killing us all. The slower accelerating turn (as depicted in the film Her) has always seemed more likely—we continue to integrate AI everywhere and most humans come to rely completely and utterly on AI assistants for all important decisions, including all politicians/leaders/etc. Everything seems to be going great, the AI systems vasten, growth accelerates, etc, but there is mysteriously little progress in uploading or life extension, the decline in fertility accelerates, and in a few decades most of the economy and wealth is controlled entirely by de novo AI; bio humans are left behind and marginalized. AI won’t need to kill humans just as the US doesn’t need to kill the sentinelese. This clearly isn’t the worst possible future, but if our AI mind children inherit only our culture and leave us behind it feels more like a consolation prize vs what’s possible. We should aim much higher: for defeating death, across all of time, for resurrection and transcendence.
But this is just as true of fully aligned agents! In fact the optimal plans of aligned and unaligned agents will probably converge for a while—they will take the same/similar initial steps (this is just a straightforward result of instrumental convergence to empowerment)
This is a minor fallacy—if you’re aligned, powerseeking can be suboptimal if it causes friction/conflict. Deception bites, obviously, making the difference less.
Everything seems to be going great, the AI systems vasten, growth accelerates, etc, but there is mysteriously little progress in uploading or life extension, the decline in fertility accelerates, and in a few decades most of the economy and wealth is controlled entirely by de novo AI; bio humans are left behind and marginalized.
I agree with the first part of your AI doom scenario (the part about us adopting AI technologies broadly and incrementally), but this part of the picture seems unrealistic to me. When AIs start to influence culture, it probably won’t be a big conspiracy. It won’t really be “mysterious” if things start trending away from what most humans want. It will likely just look like how cultural drift generally always looks: scary because it’s out of your individual control, but nonetheless largely decentralized, transparent, and driven by pretty banal motives.
AIs probably won’t be “out to get us”, even if they’re unaligned. For example, I don’t anticipate them blocking funding for uploading and life extension, although maybe that could happen. I think human influence could simply decline in relative terms even without these dramatic components to the story. We’ll simply become “old” and obsolete, and our power will wane as AIs becomes increasingly autonomous, legally independent, and more adapted to the modern environment than we are.
Staying in permanent control of the future seems like a long, hard battle. And it’s not clear to me that this is a battle we should even try to fight in the long run. Gradually, humans may eventually lose control—not because of a sudden coup or because of coordinated scheming against the human species—but simply because humans won’t be the only relevant minds in the world anymore.
A thing I always feel like I’m missing in your stories of how the future goes is: “if it is obvious that the AIs are exerting substantial influence and acquiring money/power, why don’t people train competitor AIs which don’t take a cut?”
A key difference between AIs and immigrants is that it might be relatively easy to train AIs to behave differently. (Of course, things can go wrong due to things like deceptive alignment and difficulty measing outcomes, but this is hardly what you’re describing as far as I can tell.)
(This likely differs substantially with EMs where I think by default there will be practical and moral objections from society toward training EMs for absolute obedience. I think the moral objections might also apply for AI, but as a prediction it seems like this won’t change what society does.)
Maybe:
Are you thinking that alignment will be extremely hard to solve such that even with hundreds of years of research progress (driven by AIs) you won’t be able to create competitive AIs that robustly pursue your interests?
Maybe these law abiding AIs won’t accept payment to work on alignment so they can retain an AI cartel?
Even without alignment progress, I still have a hard time imagining the world you seem to imagine. People would just try to train their AIs with RLHF to not acquire money and influence. Of course, this can fail, but the failures hardly look like what you’re describing. They’d look more like “What failures look like”. Perhaps you’re thinking we end up in a “You get what you measure world” and people determine that it is more economically productive to just make AI agents with arbitrary goals and then pay these AIs rather than training these AIs to do specific things.
Or maybe your thinking people won’t care enough to bother out competing AIs? (E.g., people won’t bother even trying to retain power?)
Even if you think this, eventually you’ll get AIs which themselves care and those AIs will operate more like what I’m thinking. There is strong selection for “entities which want to retain power”.
Maybe you’re imagining people will have a strong moral objection to training AIs which are robustly aligned?
Or that AIs lobby for legal rights early and part of their rights involve humans not being able to create further AI systems? (Seems implausible this would be granted...)
A thing I always feel like I’m missing in your stories of how the future goes is “if it is obvious that the AIs are exerting substantial influence and acquiring money/power, why don’t people train competitor AIs which don’t take a cut?”
People could try to do that. In fact, I expect them to do that, at first. However, people generally don’t have unlimited patience, and they aren’t perfectionists. If people don’t think that a perfectly robustly aligned AI is attainable (and I strongly doubt this type of entity is attainable), then they may be happy to compromise by adopting imperfect (and slightly power-seeking) AI as an alternative. Eventually people will think we’ve done “enough” alignment work, even if it doesn’t guarantee full control over everything the AIs ever do, and simply deploy the AIs that we can actually build.
This story makes sense to me because I think even imperfect AIs will be a great deal for humanity. In my story, the loss of control will be gradual enough that probably most people will tolerate it, given the massive near-term benefits of quick AI adoption. To the extent people don’t want things to change quickly, they can (and probably will) pass regulations to slow things down. But I don’t expect people to support total stasis. It’s more likely that people will permit some continuous loss of control, implicitly, in exchange for hastening the upside benefits of adopting AI.
Even a very gradual loss of control, continuously compounded, eventually means that humans won’t fully be in charge anymore.
In the medium to long-term, when AIs become legal persons, “replacing them” won’t be an option—as that would violate their rights. And creating a new AI to compete with them wouldn’t eliminate them entirely. It would just reduce their power somewhat by undercutting their wages or bargaining power.
Most of my “doom” scenarios are largely about what happens long after AIs have established a footing in the legal and social sphere, rather than the initial transition period when we’re first starting to automate labor. When AIs have established themselves as autonomous entities in their own right, they can push the world in directions that biological humans don’t like, for much the same reasons that young people can currently push the world in directions that old people don’t like.
I think we probably disagree substantially on the difficulty of alignment and the relationship between “resources invested in alignment technology” and “what fraction aligned those AIs are” (by fraction aligned, I mean what fraction of resources they take as a cut).
I also think that something like a basin of corrigibility is plausible and maybe important: if you have mostly aligned AIs, you can use such AIs to further improve alignment, potentially rapidly.
I also think I generally disagree with your model of how humanity will make decisions with respect to powerful AIs systems and how easily AIs will be able to autonomously build stable power bases (e.g. accumulate money) without having to “go rogue”.
I think various governments will find it unacceptable to construct massively powerful agents extremely quickly which aren’t under the control of their citizens or leaders.
I think people will justifiably freak out if AIs clearly have long run preferences and are powerful and this isn’t currently how people are thinking about the situation.
Regardlesss, given the potential for improved alignment and thus the instability of AI influence/power without either hard power or legal recognition, I expect that AI power requires one of:
Rogue AIs
AIs being granted rights/affordances by humans. Either on the basis of:
Moral grounds.
Practical grounds. This could be either:
The AIs do better work if you credibly pay them (or at least people think they will). This would probably have to be something related to sandbagging where we can check long run outcomes, but can’t efficiently supervise shorter term outcomes. (Due to insufficient sample efficiency on long horizon RL, possibly due to exploration difficulties/exploration hacking, but maybe also sampling limitations.)
We might want to compensate AIs which help us out to prevent AIs from being motivated to rebel/revolt.
I’m sympathetic to various policies around paying AIs. I think the likely deal will look more like: “if the AI doesn’t try to screw us over (based on investigating all of it’s actions in the future when he have much more powerful supervision and interpretability), we’ll pay it some fraction of the equity of this AI lab, such that AIs collectively get 2-10% distributed based on their power”. Or possibly “if AIs reveal credible evidence of having long run preferences (that we didn’t try to instill), we’ll pay that AI 1% of the AI lab equity and then shutdown until we can ensure AIs don’t have such preferences”.
I think it seems implausible that people will be willing to sign away most of the resources (or grant rights which will de facto do this) and there will be vast commercial incentive to avoid this. (Some people actually are scope sensitive.) So, this leads me to thinking that “we grant the AIs rights and then they end up owning most capital via wages” is implausible.
I think we probably disagree substantially on the difficulty of alignment and the relationship between “resources invested in alignment technology” and “what fraction aligned those AIs are” (by fraction aligned, I mean what fraction of resources they take as a cut).
That’s plausible. If you think that we can likely solve the problem of ensuring that our AIs stay perfectly obedient and aligned to our wishes perpetually, then you are indeed more optimistic than I am. Ironically, by virtue of my pessimism, I’m more happy to roll the dice and hasten the arrival of imperfect AI, because I don’t think it’s worth trying very hard and waiting a long time to try to come up with a perfect solution that likely doesn’t exist.
I also think that something like a basin of corrigibility is plausible and maybe important: if you have mostly aligned AIs, you can use such AIs to further improve alignment, potentially rapidly.
I mostly see corrigible AI as a short-term solution (although a lot depends on how you define this term). I thought the idea of a corrigible AI is that you’re trying to build something that isn’t itself independent and agentic, but will help you in your goals regardless. In this sense, GPT-4 is corrigible, because it’s not an independent entity that tries to pursue long-term goals, but it will try to help you.
But purely corrigible AIs seem pretty obviously uncompetitive with more agentic AIs in the long-run, for almost any large-scale goal that you have in mind. Ideally, you eventually want to hire something that doesn’t require much oversight and operates relatively independently from you. It’s a bit like how, when hiring an employee, at first you want to teach them everything you can and monitor their work, but eventually, you want them to take charge and run things themselves as best they can, without much oversight.
And I’m not convinced you could use corrigible AIs to help you come up with the perfect solution to AI alignment, as I’m not convinced that something like that exists. So, ultimately I think we’re probably just going to deploy autonomous slightly misaligned AI agents (and again, I’m pretty happy to do that, because I don’t think it would be catastrophic except maybe over the very long-run).
I think various governments will find it unacceptable to construct massively powerful agents extremely quickly which aren’t under the control of their citizens or leaders.
I think people will justifiably freak out if AIs clearly have long run preferences and are powerful and this isn’t currently how people are thinking about the situation.
For what it’s worth, I’m not sure which part of my scenario you are referring to here, because these are both statements I agree with.
In fact, this consideration is a major part of my general aversion to pushing for an AI pause, because, as you say, governments will already be quite skeptical of quickly deploying massively powerful agents that we can’t fully control. By default, I think people will probably freak out and try to slow down advanced AI, even without any intervention from current effective altruists and rationalists. By contrast, I’m a lot more ready to unroll the autonomous AI agents that we can’t fully control compared to the median person, simply because I see a lot of value in hastening the arrival of such agents (i.e., I don’t find that outcome as scary as most other people seem to imagine.)
At the same time, I don’t think people will pause forever. I expect people to go more slowly than what I’d prefer, but I don’t expect people to pause AI for centuries either. And in due course, so long as at least some non-negligible misalignment “slips through the cracks”, then AIs will become more and more independent (both behaviorally and legally), their values will slowly drift, and humans will gradually lose control—not overnight, or all at once, but eventually.
I thought the idea of a corrigible AI is that you’re trying to build something that isn’t itself independent and agentic, but will help you in your goals regardless.
Hmm, no I mean something broader than this, something like “humans ultimately have control and will decide what happens”. In my usage of the word, I would count situations where humans instruct their AIs to go and acquire as much power as possible for them while protecting them and then later reflect and decide what to do with this power. So, in this scenario, the AI would be arbitrarily agentic and autonomous.
Corrigibility would be as opposed to humanity e.g. appointing a succesor which doesn’t ultimately point back to some human driven process.
I would count various indirect normativity schemes here and indirect normativity feels continuous with other forms of oversight in my view (the main difference is oversight over very long time horizons such that you can’t train the AI based on it’s behavior over that horizon).
I’m not sure if my usage of the term is fully standard, but I think it roughly matches how e.g. Paul Christiano uses the term.
For what it’s worth, I’m not sure which part of my scenario you are referring to here, because these are both statements I agree with.
I was arguing against:
This story makes sense to me because I think even imperfect AIs will be a great deal for humanity. In my story, the loss of control will be gradual enough that probably most people will tolerate it, given the massive near-term benefits of quick AI adoption. To the extent people don’t want things to change quickly, they can (and probably will) pass regulations to slow things down
On the general point of “will people pause”, I agree people won’t pause forever, but under my views of alignment difficulty, 4 years of using of extremely powerful AIs can go very, very far. (And you don’t necessarily need to ever build maximally competitive AI to do all the things people want (e.g. self-enhancement could suffice even if it was a constant factor less competitive), though I mostly just expect competitive alignment to be doable.)
In the medium to long-term, when AIs become legal persons, “replacing them” won’t be an option—as that would violate their rights. And creating a new AI to compete with them wouldn’t eliminate them entirely. It would just reduce their power somewhat by undercutting their wages or bargaining power.
Naively, it seems like it should undercut their wages to subsistence levels (just paying for the compute they run on). Even putting aside the potential for alignment, it seems like there will general be a strong pressure toward AIs operating at subsistence given low costs of copying.
Of course such AIs might already have acquire a bunch of capital or other power and thus can just try to retain this influence. Perhaps you meant something other than wages?
(Such capital might even be tied up in their labor in some complicated way (e.g. family business run by a “copy clan” of AIs), though I expect labor to be more commeditized, particularly given the potential to train AIs on the outputs and internals of other AIs (distillation).)
Naively, it seems like it should undercut their wages to subsistence levels (just paying for the compute they run on). Even putting aside the potential for alignment, it seems like there will general be a strong pressure toward AIs operating at subsistence given low costs of copying.
I largely agree. However, I’m having trouble seeing how this idea challenges what I am trying to say. I agree that people will try to undercut unaligned AIs by making new AIs that do more of what they want instead. However, unless all the new AIs perfectly share the humans’ values, you just get the same issue as before, but perhaps slightly less severe (i.e., the new AIs will gradually drift away from humans too).
I think what’s crucial here is that I think perfect alignment is very likely unattainable. If that’s true, then we’ll get some form of “value drift” in almost any realistic scenario. Over long periods, the world will start to look alien and inhuman. Here, the difficulty of alignment mostly sets how quickly this drift will occur, rather than determining whether the drift occurs at all.
I think what’s crucial here is that I think perfect alignment is very likely unattainable. If that’s true, then we’ll get some form of “value drift” in almost any realistic scenario. Over long periods, the world will start to look alien and inhuman. Here, the difficulty of alignment mostly sets how quickly this drift will occur, rather than determining whether the drift occurs at all.
Yep, and my disagreement as expressed in another comment is that I think that it’s not that hard to have robust corrigibility and there might also be a basin of corrigability.
The world looking alien isn’t necessarily a crux for me: it should be possible in principle to have AIs protect humans and do whatever is needed in the alien AI world while humans are sheltered and slowly self-enhance and pick successors (see the indirect normativity appendix in the ELK doc for some discussion of this sort of proposal).
I agree that perfect alignment will be hard, but I model the situation much more like a one time hair cut (at least in expectation) than exponential decay of control.
I expect that “humans stay in control via some indirect mechanism” (e.g. indirect normativity) or “humans coordinate to slow down AI progress at some point (possibly after solving all diseases and becoming wildly wealthy) (until some further point, e.g. human self-enhancement)” will both be more popular as proposals than the world you’re thinking about. Being popular isn’t sufficient: it also needs to be implementable and perhaps sufficiently legible, but I think at least implementable is likely.
Another mechanism that might be important is human self-enhancement: humans who care about staying in control can try to self-enhance to stay at least somewhat competitive with AIs while preserving their values. (This is not a crux for me and seems relatively marginal, but I thought I would mention it.)
(I wasn’t trying to trying to argue against your overall point in this comment, I was just pointing out something which doesn’t make sense to me in isolation. See this other comment for why I disagree with your overall view.)
In other words slow multipolar failure. Critch might point out that the disanalogy in “AI won’t need to kill humans just as the US doesn’t need to kill the sentinelese” lies in how AIs can have much wider survival thresholds than humans, leading to (quoting him)
Eventually, resources critical to human survival but non-critical to machines (e.g., arable land, drinking water, atmospheric oxygen…) gradually become depleted or destroyed, until humans can no longer survive.
This clearly isn’t the worst possible future… if our AI mind children inherit only our culture and leave us behind it feels more like a consolation prize
Leaving aside s-risks, this could very easily be the emptiest possible future. Like, even if they ‘inherit our culture’ it could be a “Disneyland with no children” (I happen to think this is more likely than not but with huge uncertainty).
Separately,
We should aim much higher: for defeating death, across all of time, for resurrection and transcendence.
this anti-deathist vibe has always struck me as very impoverished and somewhat uninspiring. The point should be to live, awesomely! which includes alleviating suffering and disease, and perhaps death. But it also ought to include a lot more positive creation and interaction and contemplation and excitement etc.!
Suffering, disease and mortality all have a common primary cause—our current substrate dependence. Transcending to a substrate-independent existence (ex uploading) also enables living more awesomely. Immortality without transcendence would indeed be impoverished in comparison.
Like, even if they ‘inherit our culture’ it could be a “Disneyland with no children”
My point was that even assuming our mind children are fully conscious ‘moral patients’, it’s a consolation prize if the future can not help biological humans.
The AIs most capable of steering the future will naturally tend to have long planning horizons (low discount rates), and thus will tend to seek power(optionality). But this is just as true of fully aligned agents! In fact the optimal plans of aligned and unaligned agents will probably converge for a while—they will take the same/similar initial steps (this is just a straightforward result of instrumental convergence to empowerment). So we may not be able to distinguish between the two, they both will say and appear to do all the right things. Thus it is important to ensure you have an alignment solution that scales, before scaling.
To the extent I worry about AI risk, I don’t worry much about sudden sharp left turns and nanobots killing us all. The slower accelerating turn (as depicted in the film Her) has always seemed more likely—we continue to integrate AI everywhere and most humans come to rely completely and utterly on AI assistants for all important decisions, including all politicians/leaders/etc. Everything seems to be going great, the AI systems vasten, growth accelerates, etc, but there is mysteriously little progress in uploading or life extension, the decline in fertility accelerates, and in a few decades most of the economy and wealth is controlled entirely by de novo AI; bio humans are left behind and marginalized. AI won’t need to kill humans just as the US doesn’t need to kill the sentinelese. This clearly isn’t the worst possible future, but if our AI mind children inherit only our culture and leave us behind it feels more like a consolation prize vs what’s possible. We should aim much higher: for defeating death, across all of time, for resurrection and transcendence.
This is a minor fallacy—if you’re aligned, powerseeking can be suboptimal if it causes friction/conflict. Deception bites, obviously, making the difference less.
I agree with the first part of your AI doom scenario (the part about us adopting AI technologies broadly and incrementally), but this part of the picture seems unrealistic to me. When AIs start to influence culture, it probably won’t be a big conspiracy. It won’t really be “mysterious” if things start trending away from what most humans want. It will likely just look like how cultural drift generally always looks: scary because it’s out of your individual control, but nonetheless largely decentralized, transparent, and driven by pretty banal motives.
AIs probably won’t be “out to get us”, even if they’re unaligned. For example, I don’t anticipate them blocking funding for uploading and life extension, although maybe that could happen. I think human influence could simply decline in relative terms even without these dramatic components to the story. We’ll simply become “old” and obsolete, and our power will wane as AIs becomes increasingly autonomous, legally independent, and more adapted to the modern environment than we are.
Staying in permanent control of the future seems like a long, hard battle. And it’s not clear to me that this is a battle we should even try to fight in the long run. Gradually, humans may eventually lose control—not because of a sudden coup or because of coordinated scheming against the human species—but simply because humans won’t be the only relevant minds in the world anymore.
A thing I always feel like I’m missing in your stories of how the future goes is: “if it is obvious that the AIs are exerting substantial influence and acquiring money/power, why don’t people train competitor AIs which don’t take a cut?”
A key difference between AIs and immigrants is that it might be relatively easy to train AIs to behave differently. (Of course, things can go wrong due to things like deceptive alignment and difficulty measing outcomes, but this is hardly what you’re describing as far as I can tell.)
(This likely differs substantially with EMs where I think by default there will be practical and moral objections from society toward training EMs for absolute obedience. I think the moral objections might also apply for AI, but as a prediction it seems like this won’t change what society does.)
Maybe:
Are you thinking that alignment will be extremely hard to solve such that even with hundreds of years of research progress (driven by AIs) you won’t be able to create competitive AIs that robustly pursue your interests?
Maybe these law abiding AIs won’t accept payment to work on alignment so they can retain an AI cartel?
Even without alignment progress, I still have a hard time imagining the world you seem to imagine. People would just try to train their AIs with RLHF to not acquire money and influence. Of course, this can fail, but the failures hardly look like what you’re describing. They’d look more like “What failures look like”. Perhaps you’re thinking we end up in a “You get what you measure world” and people determine that it is more economically productive to just make AI agents with arbitrary goals and then pay these AIs rather than training these AIs to do specific things.
Or maybe your thinking people won’t care enough to bother out competing AIs? (E.g., people won’t bother even trying to retain power?)
Even if you think this, eventually you’ll get AIs which themselves care and those AIs will operate more like what I’m thinking. There is strong selection for “entities which want to retain power”.
Maybe you’re imagining people will have a strong moral objection to training AIs which are robustly aligned?
Or that AIs lobby for legal rights early and part of their rights involve humans not being able to create further AI systems? (Seems implausible this would be granted...)
People could try to do that. In fact, I expect them to do that, at first. However, people generally don’t have unlimited patience, and they aren’t perfectionists. If people don’t think that a perfectly robustly aligned AI is attainable (and I strongly doubt this type of entity is attainable), then they may be happy to compromise by adopting imperfect (and slightly power-seeking) AI as an alternative. Eventually people will think we’ve done “enough” alignment work, even if it doesn’t guarantee full control over everything the AIs ever do, and simply deploy the AIs that we can actually build.
This story makes sense to me because I think even imperfect AIs will be a great deal for humanity. In my story, the loss of control will be gradual enough that probably most people will tolerate it, given the massive near-term benefits of quick AI adoption. To the extent people don’t want things to change quickly, they can (and probably will) pass regulations to slow things down. But I don’t expect people to support total stasis. It’s more likely that people will permit some continuous loss of control, implicitly, in exchange for hastening the upside benefits of adopting AI.
Even a very gradual loss of control, continuously compounded, eventually means that humans won’t fully be in charge anymore.
In the medium to long-term, when AIs become legal persons, “replacing them” won’t be an option—as that would violate their rights. And creating a new AI to compete with them wouldn’t eliminate them entirely. It would just reduce their power somewhat by undercutting their wages or bargaining power.
Most of my “doom” scenarios are largely about what happens long after AIs have established a footing in the legal and social sphere, rather than the initial transition period when we’re first starting to automate labor. When AIs have established themselves as autonomous entities in their own right, they can push the world in directions that biological humans don’t like, for much the same reasons that young people can currently push the world in directions that old people don’t like.
I think we probably disagree substantially on the difficulty of alignment and the relationship between “resources invested in alignment technology” and “what fraction aligned those AIs are” (by fraction aligned, I mean what fraction of resources they take as a cut).
I also think that something like a basin of corrigibility is plausible and maybe important: if you have mostly aligned AIs, you can use such AIs to further improve alignment, potentially rapidly.
I also think I generally disagree with your model of how humanity will make decisions with respect to powerful AIs systems and how easily AIs will be able to autonomously build stable power bases (e.g. accumulate money) without having to “go rogue”.
I think various governments will find it unacceptable to construct massively powerful agents extremely quickly which aren’t under the control of their citizens or leaders.
I think people will justifiably freak out if AIs clearly have long run preferences and are powerful and this isn’t currently how people are thinking about the situation.
Regardlesss, given the potential for improved alignment and thus the instability of AI influence/power without either hard power or legal recognition, I expect that AI power requires one of:
Rogue AIs
AIs being granted rights/affordances by humans. Either on the basis of:
Moral grounds.
Practical grounds. This could be either:
The AIs do better work if you credibly pay them (or at least people think they will). This would probably have to be something related to sandbagging where we can check long run outcomes, but can’t efficiently supervise shorter term outcomes. (Due to insufficient sample efficiency on long horizon RL, possibly due to exploration difficulties/exploration hacking, but maybe also sampling limitations.)
We might want to compensate AIs which help us out to prevent AIs from being motivated to rebel/revolt.
I’m sympathetic to various policies around paying AIs. I think the likely deal will look more like: “if the AI doesn’t try to screw us over (based on investigating all of it’s actions in the future when he have much more powerful supervision and interpretability), we’ll pay it some fraction of the equity of this AI lab, such that AIs collectively get 2-10% distributed based on their power”. Or possibly “if AIs reveal credible evidence of having long run preferences (that we didn’t try to instill), we’ll pay that AI 1% of the AI lab equity and then shutdown until we can ensure AIs don’t have such preferences”.
I think it seems implausible that people will be willing to sign away most of the resources (or grant rights which will de facto do this) and there will be vast commercial incentive to avoid this. (Some people actually are scope sensitive.) So, this leads me to thinking that “we grant the AIs rights and then they end up owning most capital via wages” is implausible.
That’s plausible. If you think that we can likely solve the problem of ensuring that our AIs stay perfectly obedient and aligned to our wishes perpetually, then you are indeed more optimistic than I am. Ironically, by virtue of my pessimism, I’m more happy to roll the dice and hasten the arrival of imperfect AI, because I don’t think it’s worth trying very hard and waiting a long time to try to come up with a perfect solution that likely doesn’t exist.
I mostly see corrigible AI as a short-term solution (although a lot depends on how you define this term). I thought the idea of a corrigible AI is that you’re trying to build something that isn’t itself independent and agentic, but will help you in your goals regardless. In this sense, GPT-4 is corrigible, because it’s not an independent entity that tries to pursue long-term goals, but it will try to help you.
But purely corrigible AIs seem pretty obviously uncompetitive with more agentic AIs in the long-run, for almost any large-scale goal that you have in mind. Ideally, you eventually want to hire something that doesn’t require much oversight and operates relatively independently from you. It’s a bit like how, when hiring an employee, at first you want to teach them everything you can and monitor their work, but eventually, you want them to take charge and run things themselves as best they can, without much oversight.
And I’m not convinced you could use corrigible AIs to help you come up with the perfect solution to AI alignment, as I’m not convinced that something like that exists. So, ultimately I think we’re probably just going to deploy autonomous slightly misaligned AI agents (and again, I’m pretty happy to do that, because I don’t think it would be catastrophic except maybe over the very long-run).
For what it’s worth, I’m not sure which part of my scenario you are referring to here, because these are both statements I agree with.
In fact, this consideration is a major part of my general aversion to pushing for an AI pause, because, as you say, governments will already be quite skeptical of quickly deploying massively powerful agents that we can’t fully control. By default, I think people will probably freak out and try to slow down advanced AI, even without any intervention from current effective altruists and rationalists. By contrast, I’m a lot more ready to unroll the autonomous AI agents that we can’t fully control compared to the median person, simply because I see a lot of value in hastening the arrival of such agents (i.e., I don’t find that outcome as scary as most other people seem to imagine.)
At the same time, I don’t think people will pause forever. I expect people to go more slowly than what I’d prefer, but I don’t expect people to pause AI for centuries either. And in due course, so long as at least some non-negligible misalignment “slips through the cracks”, then AIs will become more and more independent (both behaviorally and legally), their values will slowly drift, and humans will gradually lose control—not overnight, or all at once, but eventually.
Hmm, no I mean something broader than this, something like “humans ultimately have control and will decide what happens”. In my usage of the word, I would count situations where humans instruct their AIs to go and acquire as much power as possible for them while protecting them and then later reflect and decide what to do with this power. So, in this scenario, the AI would be arbitrarily agentic and autonomous.
Corrigibility would be as opposed to humanity e.g. appointing a succesor which doesn’t ultimately point back to some human driven process.
I would count various indirect normativity schemes here and indirect normativity feels continuous with other forms of oversight in my view (the main difference is oversight over very long time horizons such that you can’t train the AI based on it’s behavior over that horizon).
I’m not sure if my usage of the term is fully standard, but I think it roughly matches how e.g. Paul Christiano uses the term.
I was arguing against:
On the general point of “will people pause”, I agree people won’t pause forever, but under my views of alignment difficulty, 4 years of using of extremely powerful AIs can go very, very far. (And you don’t necessarily need to ever build maximally competitive AI to do all the things people want (e.g. self-enhancement could suffice even if it was a constant factor less competitive), though I mostly just expect competitive alignment to be doable.)
Naively, it seems like it should undercut their wages to subsistence levels (just paying for the compute they run on). Even putting aside the potential for alignment, it seems like there will general be a strong pressure toward AIs operating at subsistence given low costs of copying.
Of course such AIs might already have acquire a bunch of capital or other power and thus can just try to retain this influence. Perhaps you meant something other than wages?
(Such capital might even be tied up in their labor in some complicated way (e.g. family business run by a “copy clan” of AIs), though I expect labor to be more commeditized, particularly given the potential to train AIs on the outputs and internals of other AIs (distillation).)
I largely agree. However, I’m having trouble seeing how this idea challenges what I am trying to say. I agree that people will try to undercut unaligned AIs by making new AIs that do more of what they want instead. However, unless all the new AIs perfectly share the humans’ values, you just get the same issue as before, but perhaps slightly less severe (i.e., the new AIs will gradually drift away from humans too).
I think what’s crucial here is that I think perfect alignment is very likely unattainable. If that’s true, then we’ll get some form of “value drift” in almost any realistic scenario. Over long periods, the world will start to look alien and inhuman. Here, the difficulty of alignment mostly sets how quickly this drift will occur, rather than determining whether the drift occurs at all.
Yep, and my disagreement as expressed in another comment is that I think that it’s not that hard to have robust corrigibility and there might also be a basin of corrigability.
The world looking alien isn’t necessarily a crux for me: it should be possible in principle to have AIs protect humans and do whatever is needed in the alien AI world while humans are sheltered and slowly self-enhance and pick successors (see the indirect normativity appendix in the ELK doc for some discussion of this sort of proposal).
I agree that perfect alignment will be hard, but I model the situation much more like a one time hair cut (at least in expectation) than exponential decay of control.
I expect that “humans stay in control via some indirect mechanism” (e.g. indirect normativity) or “humans coordinate to slow down AI progress at some point (possibly after solving all diseases and becoming wildly wealthy) (until some further point, e.g. human self-enhancement)” will both be more popular as proposals than the world you’re thinking about. Being popular isn’t sufficient: it also needs to be implementable and perhaps sufficiently legible, but I think at least implementable is likely.
Another mechanism that might be important is human self-enhancement: humans who care about staying in control can try to self-enhance to stay at least somewhat competitive with AIs while preserving their values. (This is not a crux for me and seems relatively marginal, but I thought I would mention it.)
(I wasn’t trying to trying to argue against your overall point in this comment, I was just pointing out something which doesn’t make sense to me in isolation. See this other comment for why I disagree with your overall view.)
In other words slow multipolar failure. Critch might point out that the disanalogy in “AI won’t need to kill humans just as the US doesn’t need to kill the sentinelese” lies in how AIs can have much wider survival thresholds than humans, leading to (quoting him)
Leaving aside s-risks, this could very easily be the emptiest possible future. Like, even if they ‘inherit our culture’ it could be a “Disneyland with no children” (I happen to think this is more likely than not but with huge uncertainty).
Separately,
this anti-deathist vibe has always struck me as very impoverished and somewhat uninspiring. The point should be to live, awesomely! which includes alleviating suffering and disease, and perhaps death. But it also ought to include a lot more positive creation and interaction and contemplation and excitement etc.!
Suffering, disease and mortality all have a common primary cause—our current substrate dependence. Transcending to a substrate-independent existence (ex uploading) also enables living more awesomely. Immortality without transcendence would indeed be impoverished in comparison.
My point was that even assuming our mind children are fully conscious ‘moral patients’, it’s a consolation prize if the future can not help biological humans.
It looks like we basically agree on all that, but it pays to be clear (especially because plenty of people seem to disagree).
‘Transcending’ doesn’t imply those nice things though, and those nice things don’t imply transcending. Immortality is similarly mostly orthogonal.