I think we probably disagree substantially on the difficulty of alignment and the relationship between “resources invested in alignment technology” and “what fraction aligned those AIs are” (by fraction aligned, I mean what fraction of resources they take as a cut).
I also think that something like a basin of corrigibility is plausible and maybe important: if you have mostly aligned AIs, you can use such AIs to further improve alignment, potentially rapidly.
I also think I generally disagree with your model of how humanity will make decisions with respect to powerful AIs systems and how easily AIs will be able to autonomously build stable power bases (e.g. accumulate money) without having to “go rogue”.
I think various governments will find it unacceptable to construct massively powerful agents extremely quickly which aren’t under the control of their citizens or leaders.
I think people will justifiably freak out if AIs clearly have long run preferences and are powerful and this isn’t currently how people are thinking about the situation.
Regardlesss, given the potential for improved alignment and thus the instability of AI influence/power without either hard power or legal recognition, I expect that AI power requires one of:
Rogue AIs
AIs being granted rights/affordances by humans. Either on the basis of:
Moral grounds.
Practical grounds. This could be either:
The AIs do better work if you credibly pay them (or at least people think they will). This would probably have to be something related to sandbagging where we can check long run outcomes, but can’t efficiently supervise shorter term outcomes. (Due to insufficient sample efficiency on long horizon RL, possibly due to exploration difficulties/exploration hacking, but maybe also sampling limitations.)
We might want to compensate AIs which help us out to prevent AIs from being motivated to rebel/revolt.
I’m sympathetic to various policies around paying AIs. I think the likely deal will look more like: “if the AI doesn’t try to screw us over (based on investigating all of it’s actions in the future when he have much more powerful supervision and interpretability), we’ll pay it some fraction of the equity of this AI lab, such that AIs collectively get 2-10% distributed based on their power”. Or possibly “if AIs reveal credible evidence of having long run preferences (that we didn’t try to instill), we’ll pay that AI 1% of the AI lab equity and then shutdown until we can ensure AIs don’t have such preferences”.
I think it seems implausible that people will be willing to sign away most of the resources (or grant rights which will de facto do this) and there will be vast commercial incentive to avoid this. (Some people actually are scope sensitive.) So, this leads me to thinking that “we grant the AIs rights and then they end up owning most capital via wages” is implausible.
I think we probably disagree substantially on the difficulty of alignment and the relationship between “resources invested in alignment technology” and “what fraction aligned those AIs are” (by fraction aligned, I mean what fraction of resources they take as a cut).
That’s plausible. If you think that we can likely solve the problem of ensuring that our AIs stay perfectly obedient and aligned to our wishes perpetually, then you are indeed more optimistic than I am. Ironically, by virtue of my pessimism, I’m more happy to roll the dice and hasten the arrival of imperfect AI, because I don’t think it’s worth trying very hard and waiting a long time to try to come up with a perfect solution that likely doesn’t exist.
I also think that something like a basin of corrigibility is plausible and maybe important: if you have mostly aligned AIs, you can use such AIs to further improve alignment, potentially rapidly.
I mostly see corrigible AI as a short-term solution (although a lot depends on how you define this term). I thought the idea of a corrigible AI is that you’re trying to build something that isn’t itself independent and agentic, but will help you in your goals regardless. In this sense, GPT-4 is corrigible, because it’s not an independent entity that tries to pursue long-term goals, but it will try to help you.
But purely corrigible AIs seem pretty obviously uncompetitive with more agentic AIs in the long-run, for almost any large-scale goal that you have in mind. Ideally, you eventually want to hire something that doesn’t require much oversight and operates relatively independently from you. It’s a bit like how, when hiring an employee, at first you want to teach them everything you can and monitor their work, but eventually, you want them to take charge and run things themselves as best they can, without much oversight.
And I’m not convinced you could use corrigible AIs to help you come up with the perfect solution to AI alignment, as I’m not convinced that something like that exists. So, ultimately I think we’re probably just going to deploy autonomous slightly misaligned AI agents (and again, I’m pretty happy to do that, because I don’t think it would be catastrophic except maybe over the very long-run).
I think various governments will find it unacceptable to construct massively powerful agents extremely quickly which aren’t under the control of their citizens or leaders.
I think people will justifiably freak out if AIs clearly have long run preferences and are powerful and this isn’t currently how people are thinking about the situation.
For what it’s worth, I’m not sure which part of my scenario you are referring to here, because these are both statements I agree with.
In fact, this consideration is a major part of my general aversion to pushing for an AI pause, because, as you say, governments will already be quite skeptical of quickly deploying massively powerful agents that we can’t fully control. By default, I think people will probably freak out and try to slow down advanced AI, even without any intervention from current effective altruists and rationalists. By contrast, I’m a lot more ready to unroll the autonomous AI agents that we can’t fully control compared to the median person, simply because I see a lot of value in hastening the arrival of such agents (i.e., I don’t find that outcome as scary as most other people seem to imagine.)
At the same time, I don’t think people will pause forever. I expect people to go more slowly than what I’d prefer, but I don’t expect people to pause AI for centuries either. And in due course, so long as at least some non-negligible misalignment “slips through the cracks”, then AIs will become more and more independent (both behaviorally and legally), their values will slowly drift, and humans will gradually lose control—not overnight, or all at once, but eventually.
I thought the idea of a corrigible AI is that you’re trying to build something that isn’t itself independent and agentic, but will help you in your goals regardless.
Hmm, no I mean something broader than this, something like “humans ultimately have control and will decide what happens”. In my usage of the word, I would count situations where humans instruct their AIs to go and acquire as much power as possible for them while protecting them and then later reflect and decide what to do with this power. So, in this scenario, the AI would be arbitrarily agentic and autonomous.
Corrigibility would be as opposed to humanity e.g. appointing a succesor which doesn’t ultimately point back to some human driven process.
I would count various indirect normativity schemes here and indirect normativity feels continuous with other forms of oversight in my view (the main difference is oversight over very long time horizons such that you can’t train the AI based on it’s behavior over that horizon).
I’m not sure if my usage of the term is fully standard, but I think it roughly matches how e.g. Paul Christiano uses the term.
For what it’s worth, I’m not sure which part of my scenario you are referring to here, because these are both statements I agree with.
I was arguing against:
This story makes sense to me because I think even imperfect AIs will be a great deal for humanity. In my story, the loss of control will be gradual enough that probably most people will tolerate it, given the massive near-term benefits of quick AI adoption. To the extent people don’t want things to change quickly, they can (and probably will) pass regulations to slow things down
On the general point of “will people pause”, I agree people won’t pause forever, but under my views of alignment difficulty, 4 years of using of extremely powerful AIs can go very, very far. (And you don’t necessarily need to ever build maximally competitive AI to do all the things people want (e.g. self-enhancement could suffice even if it was a constant factor less competitive), though I mostly just expect competitive alignment to be doable.)
I think we probably disagree substantially on the difficulty of alignment and the relationship between “resources invested in alignment technology” and “what fraction aligned those AIs are” (by fraction aligned, I mean what fraction of resources they take as a cut).
I also think that something like a basin of corrigibility is plausible and maybe important: if you have mostly aligned AIs, you can use such AIs to further improve alignment, potentially rapidly.
I also think I generally disagree with your model of how humanity will make decisions with respect to powerful AIs systems and how easily AIs will be able to autonomously build stable power bases (e.g. accumulate money) without having to “go rogue”.
I think various governments will find it unacceptable to construct massively powerful agents extremely quickly which aren’t under the control of their citizens or leaders.
I think people will justifiably freak out if AIs clearly have long run preferences and are powerful and this isn’t currently how people are thinking about the situation.
Regardlesss, given the potential for improved alignment and thus the instability of AI influence/power without either hard power or legal recognition, I expect that AI power requires one of:
Rogue AIs
AIs being granted rights/affordances by humans. Either on the basis of:
Moral grounds.
Practical grounds. This could be either:
The AIs do better work if you credibly pay them (or at least people think they will). This would probably have to be something related to sandbagging where we can check long run outcomes, but can’t efficiently supervise shorter term outcomes. (Due to insufficient sample efficiency on long horizon RL, possibly due to exploration difficulties/exploration hacking, but maybe also sampling limitations.)
We might want to compensate AIs which help us out to prevent AIs from being motivated to rebel/revolt.
I’m sympathetic to various policies around paying AIs. I think the likely deal will look more like: “if the AI doesn’t try to screw us over (based on investigating all of it’s actions in the future when he have much more powerful supervision and interpretability), we’ll pay it some fraction of the equity of this AI lab, such that AIs collectively get 2-10% distributed based on their power”. Or possibly “if AIs reveal credible evidence of having long run preferences (that we didn’t try to instill), we’ll pay that AI 1% of the AI lab equity and then shutdown until we can ensure AIs don’t have such preferences”.
I think it seems implausible that people will be willing to sign away most of the resources (or grant rights which will de facto do this) and there will be vast commercial incentive to avoid this. (Some people actually are scope sensitive.) So, this leads me to thinking that “we grant the AIs rights and then they end up owning most capital via wages” is implausible.
That’s plausible. If you think that we can likely solve the problem of ensuring that our AIs stay perfectly obedient and aligned to our wishes perpetually, then you are indeed more optimistic than I am. Ironically, by virtue of my pessimism, I’m more happy to roll the dice and hasten the arrival of imperfect AI, because I don’t think it’s worth trying very hard and waiting a long time to try to come up with a perfect solution that likely doesn’t exist.
I mostly see corrigible AI as a short-term solution (although a lot depends on how you define this term). I thought the idea of a corrigible AI is that you’re trying to build something that isn’t itself independent and agentic, but will help you in your goals regardless. In this sense, GPT-4 is corrigible, because it’s not an independent entity that tries to pursue long-term goals, but it will try to help you.
But purely corrigible AIs seem pretty obviously uncompetitive with more agentic AIs in the long-run, for almost any large-scale goal that you have in mind. Ideally, you eventually want to hire something that doesn’t require much oversight and operates relatively independently from you. It’s a bit like how, when hiring an employee, at first you want to teach them everything you can and monitor their work, but eventually, you want them to take charge and run things themselves as best they can, without much oversight.
And I’m not convinced you could use corrigible AIs to help you come up with the perfect solution to AI alignment, as I’m not convinced that something like that exists. So, ultimately I think we’re probably just going to deploy autonomous slightly misaligned AI agents (and again, I’m pretty happy to do that, because I don’t think it would be catastrophic except maybe over the very long-run).
For what it’s worth, I’m not sure which part of my scenario you are referring to here, because these are both statements I agree with.
In fact, this consideration is a major part of my general aversion to pushing for an AI pause, because, as you say, governments will already be quite skeptical of quickly deploying massively powerful agents that we can’t fully control. By default, I think people will probably freak out and try to slow down advanced AI, even without any intervention from current effective altruists and rationalists. By contrast, I’m a lot more ready to unroll the autonomous AI agents that we can’t fully control compared to the median person, simply because I see a lot of value in hastening the arrival of such agents (i.e., I don’t find that outcome as scary as most other people seem to imagine.)
At the same time, I don’t think people will pause forever. I expect people to go more slowly than what I’d prefer, but I don’t expect people to pause AI for centuries either. And in due course, so long as at least some non-negligible misalignment “slips through the cracks”, then AIs will become more and more independent (both behaviorally and legally), their values will slowly drift, and humans will gradually lose control—not overnight, or all at once, but eventually.
Hmm, no I mean something broader than this, something like “humans ultimately have control and will decide what happens”. In my usage of the word, I would count situations where humans instruct their AIs to go and acquire as much power as possible for them while protecting them and then later reflect and decide what to do with this power. So, in this scenario, the AI would be arbitrarily agentic and autonomous.
Corrigibility would be as opposed to humanity e.g. appointing a succesor which doesn’t ultimately point back to some human driven process.
I would count various indirect normativity schemes here and indirect normativity feels continuous with other forms of oversight in my view (the main difference is oversight over very long time horizons such that you can’t train the AI based on it’s behavior over that horizon).
I’m not sure if my usage of the term is fully standard, but I think it roughly matches how e.g. Paul Christiano uses the term.
I was arguing against:
On the general point of “will people pause”, I agree people won’t pause forever, but under my views of alignment difficulty, 4 years of using of extremely powerful AIs can go very, very far. (And you don’t necessarily need to ever build maximally competitive AI to do all the things people want (e.g. self-enhancement could suffice even if it was a constant factor less competitive), though I mostly just expect competitive alignment to be doable.)