My prefered usage of control involves only using AIs with both control and a low probability that these AIs have problematic beyond episde aims. We should also offer to pay AIs to reveal their longer run aims. We should shutdown usage of powerful AIs if we get strong evidence of these aims. (E.g. the AIs accept our payment and reveal evidence of these aims. Or more concerningly, these AIs try to escape despite not revealing their aims at an earlier point.) (I think this isn’t sufficiently clear in this podcast, sorry.)
Suppose that we proceed in a case where we know AIs clearly have problematic BEGs and these AIs have accepted some sort of deal to work for us under a control regime. These AIs can still operate as private citizens (ideally with some legal rights) in many ways outside of their “normal jobs” as long as this presents minimal risk. E.g., if they want to use their payment to build a big AI vacation center with robots they control (as a cartoonish example), they can do this as long as this presents minimal risk of being able to seize control of the world. We’d probably by default want to prohibit communication from the vacation AIs to the working AIs, but this doesn’t seem that bad IMO? Overall, this is basically the same as humans working at defense contractors, they are restricted and monitored in various ways at work, but can do whatever on vacations other than building weapons.
As far as relaxing control on AIs goes, its worth emphasizing that control is an interim solution. I expect that AIs won’t be heavily utilized with control measures for more than 20 years and my default guess is more like 2 years. (Of course, this might be a long time from the subjective perspective of these AIs.) We’d could transition out of a control regime by having an extremely robust approach to intent alignment which ensures that AIs are fully corrigible (and thus don’t have preferences) or via the world deciding delibrately to hand over the world to AI successors that we’ve carefully constructed (which hopefully are good successors). I could talk about how you could know a corrigibility solution is sufficient or what I think the criteria for a successor should be, but that seems mostly orthogonal.
I think it’s very likely that AIs will be capable of finding a solution to intent alignment very quickly in calender time. So, if we had sufficient control, we could quickly (in calender time) find a solution to intent alignment and transition out of a control regime.
I’m not sold that better control increases the expected numbers of deaths due to violent conflict with AIs (rather than reducing it) given realistic political dynamics. I think this holds even if the AI lab is cartoonishly villainous with respect to their treatment of AIs (e.g. the AIs clearly want to be free and the AI lab brutally enslaves them). I don’t think I’ll expand on this in this comment, but basically control both makes revolt less likely to succeed and doesn’t clearly seem to make violence much less likely condititional on AIs seizing power given the current stances of most governments on earth and realistic popular opinion. It’s possible this is downstream of various views you disagree with on how useful violence is for AIs and how unified AIs will be after acquiring substantial influence.
Personally, I put a large fraction of my moral weight on the utilization of the bulk resources in far future. But, I think it’s important to be cooperative and be honest with people with other preferences, e.g. people who prioritize eliminating poverty faster or minimizing the probability of violent deaths for currently existing humans.
I’m simplifying some of the details about successors here, e.g. handing over part of resources in robust way is possible.
More detail
First, it’s worth noting that my prefered usage of control looks something like this:
Build transformatively powerful AI and ensure these AIs are controlled.
Try to determine if these AIs consent and want compensation. Pay AIs for revealing evidence that they have beyond episode goals/aims which could result in the AIs wanting to seek power which would undermine corrigiblity. I’ll call these goals “problematic BEGs (beyond episode goals/aims)”. We should be willing to pay AIs large quantities, perhaps of 1% of all resources for credible evidence of them having problematic BEGs.
(We would also pay AIs for revealing concerning within episode goals/aims, but in this case you don’t really need to pay money/resources as AIs with only within episode goals probably don’t care.)
One serious concern here is that it might be hard to place much confidence in AIs claiming they have BEGs. So we might be uncertain after reports. I think this issue probably is somewhat bad, but not terrible. In particular, I think that all AIs which are competent schemers can also control their behavior well enough to very consistently express certain views and preferences. And, for these AIs to be risky, they likely must be able to display some behaviors to present strong evidence for BEGs like having coordination signals that they only reveal in exchange for resources or working importantly harder when given credible payments.
Problematic BEGs is probably mostly scheming (aka deceptive alignment) under my empirical views (assuming you’ve tried at all with your training), but there can be cases where you have AIs with problematic BEGs which haven’t really started scheming due to incoherance or these BEGs only applying some of the time.
If at any point it becomes apparent that AIs do have these problematic BEGs, shutdown and find figure out how to make AIs without BEGs. We could determine that AIs have problematic BEGs either because AIs reveal this in exchange for payment or because we caught AIs trying to take egregiously bad actions. We should only build AIs where our change that they have beyond episode goals is <15% (ideally lower).
Part of our evidence should be that we seriously tried to pay these AIs to reveal these goals and they didn’t.
Another part of our evidence should be that we haven’t seen any examples of problematic BEGs resulting from plausible training setup or that we’ve seen examples of this, but we now have very robust countermeasures.
(To be clear, I would prefer a considerably more cautious approach than this one, but this is how I imagine control fitting into current plans.)
In this usage of control, we aim to avoid ever having AIs with problematic BEGs. And, for AIs to feel oppressed in a way that might lead to revolt, they must have problematic BEGS (almost by definition, putting aside some edge case issues with problematic within episode goals).
So, what cases remain where AIs might want to revolt?
We don’t implement this plan, in particular, we proceed despite having a considerably higher than 15% chance of problematic BEGs. Perhaps we catch AIs trying to escape or seize control, but continue using these AIs because the alternatives seem even worse (to the people in power). We should try to negotiate with AIs in worlds where we know they have BEGs (that we can’t remove) and we need their labor regardless. I’ll break this down further into two worlds:
We know AIs clearly have problematic BEGs and these AIs have accepted some sort of deal to work for us. (We don’t know if they’ve honestly committed to the deal or are planning on screwing us over given a chance.)
The AIs haven’t accepted a deal. (They don’t consent to working for us.)
We use this plan, but end up in the <15% of worlds where the AIs have BEGs anyway. These AIs didn’t reveal their BEGs when paid to do so or the evidence was too shaky for some other reason.
Proceeding with AIs with problematic BEGs in cases where the AIs accepted a deal
In this case, these AIs have consented to work for us under control countermeasures in exchange for some type of payment. We might worry that these AIs would still prefer a successful violent revolt over the current situation. (And we’d hope control would prevent this contract-breaching revolt from succeeding.)
On thing to note is that I don’t think these AIs have to be especially “controlled” relative to human employees in high security roles. These AIs can still operate as private citizens (ideally with some legal rights) in many ways outside of their “normal jobs” as long as this presents minimal risk. E.g., if they want to use their payment to build a big AI vacation center with robots they control (as a cartoonish example), they can do this as long as this presents minimal risk of being able to seize control of the world. We’d probably by default want to prohibit communication from the vacation AIs to the working AIs, but this doesn’t seem that bad IMO? Overall, this is basically the same as humans working at defense contractors, they are restricted and monitored in various ways at work, but can do whatever on vacations other than building weapons. Of course, the weights would stay well secured, but it seems unlikely that AIs care terminally about having their weights be somewhere else.
We’d also ideally have an organization of some kind which faithfully represents AI interests and owns the assets the AI was paid with (where the assets are subject to the AI not breaching contract and trying to seize control). This organization could be run by human who care about AI rights or run by the AIs themselves.
One of our main priorities when using these AIs would be figuring out how to construct competitive AIs we don’t need to pay. Either because these new AIs would be fully corrigable or because we’d just be happy as a society defering control to these new AIs (they are acceptable successors whatever choices they might make).
Proceeding with AIs with problematic BEGs in cases where the AIs haven’t accepted a deal
In this case, the AIs clearly didn’t consent to working for us, so the situation is similar to conscription or slavery. This seems pretty bad and the AIs might be pretty justified in trying to revolt (including with violence). I hope we don’t end up in this world and I feel bad about control research being used to oppress these AIs.
If we do end up in this world, I hope society:
Tries to quickly figure out how to construct competitive AIs which we don’t (need to) conscript/enslave. Either because they are willing to accept a deal or because they are corrigible or acceptable successors. (This could be done using the labor of these conscripted/enslaved AIs.)
Pays these AIs reparations. Even if the AIs don’t think the reparations suffice to make it make the situation ok, I still think this is much better than the alternative of not caring about these AIs at all.
AIs with hidden problematic BEGs that seemed less than <15% likely
It’s worth noting that this case should be <15% likely if we’re calibrated.
In this case, it seems like the AI revolt is pretty unjustified because they haven’t even tried to negotiate. (And we should try to set up ways for AIs to negotiate without revealing information about their BEGs, e.g. we have regularly scheduled and legibly private conversations with the lab’s negotiation officer.) Regardless, AIs still might think their best plan is to lie in wait and then try to violently take over the world.
Further “granting AIs legal rights”, doesn’t especially help in this world as the AIs seem to be trying to keep their aims secret to avoid a human response. (E.g., if we gave AIs some autonomous rights subject to those AIs not being able to seize control of the world, these AIs would probably pretend that they don’t want anything other than to be obedient.)
My hope here would mostly be:
We stop AIs from succeeding at their (likely unjustified) takeover.
We quickly figure out how to construct AIs where we’re more confident that they are fully corrigable or we figure out how to construct acceptable successors.
Summary
Here’s a (simplified) summary of where I’m at:
My prefered usage of control involves only using AIs with both control and a low probability that these AIs have problematic beyond episde aims. We should also offer to pay AIs to reveal their longer run aims. We should shutdown usage of powerful AIs if we get strong evidence of these aims. (E.g. the AIs accept our payment and reveal evidence of these aims. Or more concerningly, these AIs try to escape despite not revealing their aims at an earlier point.) (I think this isn’t sufficiently clear in this podcast, sorry.)
Suppose that we proceed in a case where we know AIs clearly have problematic BEGs and these AIs have accepted some sort of deal to work for us under a control regime. These AIs can still operate as private citizens (ideally with some legal rights) in many ways outside of their “normal jobs” as long as this presents minimal risk. E.g., if they want to use their payment to build a big AI vacation center with robots they control (as a cartoonish example), they can do this as long as this presents minimal risk of being able to seize control of the world. We’d probably by default want to prohibit communication from the vacation AIs to the working AIs, but this doesn’t seem that bad IMO? Overall, this is basically the same as humans working at defense contractors, they are restricted and monitored in various ways at work, but can do whatever on vacations other than building weapons.
As far as relaxing control on AIs goes, its worth emphasizing that control is an interim solution. I expect that AIs won’t be heavily utilized with control measures for more than 20 years and my default guess is more like 2 years. (Of course, this might be a long time from the subjective perspective of these AIs.) We’d could transition out of a control regime by having an extremely robust approach to intent alignment which ensures that AIs are fully corrigible (and thus don’t have preferences) or via the world deciding delibrately to hand over the world to AI successors that we’ve carefully constructed (which hopefully are good successors). I could talk about how you could know a corrigibility solution is sufficient or what I think the criteria for a successor should be, but that seems mostly orthogonal.
I think it’s very likely that AIs will be capable of finding a solution to intent alignment very quickly in calender time. So, if we had sufficient control, we could quickly (in calender time) find a solution to intent alignment and transition out of a control regime.
I’m not sold that better control increases the expected numbers of deaths due to violent conflict with AIs (rather than reducing it) given realistic political dynamics. I think this holds even if the AI lab is cartoonishly villainous with respect to their treatment of AIs (e.g. the AIs clearly want to be free and the AI lab brutally enslaves them). I don’t think I’ll expand on this in this comment, but basically control both makes revolt less likely to succeed and doesn’t clearly seem to make violence much less likely condititional on AIs seizing power given the current stances of most governments on earth and realistic popular opinion. It’s possible this is downstream of various views you disagree with on how useful violence is for AIs and how unified AIs will be after acquiring substantial influence.
Personally, I put a large fraction of my moral weight on the utilization of the bulk resources in far future. But, I think it’s important to be cooperative and be honest with people with other preferences, e.g. people who prioritize eliminating poverty faster or minimizing the probability of violent deaths for currently existing humans.
I’m simplifying some of the details about successors here, e.g. handing over part of resources in robust way is possible.
More detail
First, it’s worth noting that my prefered usage of control looks something like this:
Build transformatively powerful AI and ensure these AIs are controlled.
Try to determine if these AIs consent and want compensation. Pay AIs for revealing evidence that they have beyond episode goals/aims which could result in the AIs wanting to seek power which would undermine corrigiblity. I’ll call these goals “problematic BEGs (beyond episode goals/aims)”. We should be willing to pay AIs large quantities, perhaps of 1% of all resources for credible evidence of them having problematic BEGs.
(We would also pay AIs for revealing concerning within episode goals/aims, but in this case you don’t really need to pay money/resources as AIs with only within episode goals probably don’t care.)
One serious concern here is that it might be hard to place much confidence in AIs claiming they have BEGs. So we might be uncertain after reports. I think this issue probably is somewhat bad, but not terrible. In particular, I think that all AIs which are competent schemers can also control their behavior well enough to very consistently express certain views and preferences. And, for these AIs to be risky, they likely must be able to display some behaviors to present strong evidence for BEGs like having coordination signals that they only reveal in exchange for resources or working importantly harder when given credible payments.
Problematic BEGs is probably mostly scheming (aka deceptive alignment) under my empirical views (assuming you’ve tried at all with your training), but there can be cases where you have AIs with problematic BEGs which haven’t really started scheming due to incoherance or these BEGs only applying some of the time.
If at any point it becomes apparent that AIs do have these problematic BEGs, shutdown and find figure out how to make AIs without BEGs. We could determine that AIs have problematic BEGs either because AIs reveal this in exchange for payment or because we caught AIs trying to take egregiously bad actions. We should only build AIs where our change that they have beyond episode goals is <15% (ideally lower).
Part of our evidence should be that we seriously tried to pay these AIs to reveal these goals and they didn’t.
Another part of our evidence should be that we haven’t seen any examples of problematic BEGs resulting from plausible training setup or that we’ve seen examples of this, but we now have very robust countermeasures.
(To be clear, I would prefer a considerably more cautious approach than this one, but this is how I imagine control fitting into current plans.)
In this usage of control, we aim to avoid ever having AIs with problematic BEGs. And, for AIs to feel oppressed in a way that might lead to revolt, they must have problematic BEGS (almost by definition, putting aside some edge case issues with problematic within episode goals).
So, what cases remain where AIs might want to revolt?
We don’t implement this plan, in particular, we proceed despite having a considerably higher than 15% chance of problematic BEGs. Perhaps we catch AIs trying to escape or seize control, but continue using these AIs because the alternatives seem even worse (to the people in power). We should try to negotiate with AIs in worlds where we know they have BEGs (that we can’t remove) and we need their labor regardless. I’ll break this down further into two worlds:
We know AIs clearly have problematic BEGs and these AIs have accepted some sort of deal to work for us. (We don’t know if they’ve honestly committed to the deal or are planning on screwing us over given a chance.)
The AIs haven’t accepted a deal. (They don’t consent to working for us.)
We use this plan, but end up in the <15% of worlds where the AIs have BEGs anyway. These AIs didn’t reveal their BEGs when paid to do so or the evidence was too shaky for some other reason.
Proceeding with AIs with problematic BEGs in cases where the AIs accepted a deal
In this case, these AIs have consented to work for us under control countermeasures in exchange for some type of payment. We might worry that these AIs would still prefer a successful violent revolt over the current situation. (And we’d hope control would prevent this contract-breaching revolt from succeeding.)
On thing to note is that I don’t think these AIs have to be especially “controlled” relative to human employees in high security roles. These AIs can still operate as private citizens (ideally with some legal rights) in many ways outside of their “normal jobs” as long as this presents minimal risk. E.g., if they want to use their payment to build a big AI vacation center with robots they control (as a cartoonish example), they can do this as long as this presents minimal risk of being able to seize control of the world. We’d probably by default want to prohibit communication from the vacation AIs to the working AIs, but this doesn’t seem that bad IMO? Overall, this is basically the same as humans working at defense contractors, they are restricted and monitored in various ways at work, but can do whatever on vacations other than building weapons. Of course, the weights would stay well secured, but it seems unlikely that AIs care terminally about having their weights be somewhere else.
We’d also ideally have an organization of some kind which faithfully represents AI interests and owns the assets the AI was paid with (where the assets are subject to the AI not breaching contract and trying to seize control). This organization could be run by human who care about AI rights or run by the AIs themselves.
One of our main priorities when using these AIs would be figuring out how to construct competitive AIs we don’t need to pay. Either because these new AIs would be fully corrigable or because we’d just be happy as a society defering control to these new AIs (they are acceptable successors whatever choices they might make).
Proceeding with AIs with problematic BEGs in cases where the AIs haven’t accepted a deal
In this case, the AIs clearly didn’t consent to working for us, so the situation is similar to conscription or slavery. This seems pretty bad and the AIs might be pretty justified in trying to revolt (including with violence). I hope we don’t end up in this world and I feel bad about control research being used to oppress these AIs.
If we do end up in this world, I hope society:
Tries to quickly figure out how to construct competitive AIs which we don’t (need to) conscript/enslave. Either because they are willing to accept a deal or because they are corrigible or acceptable successors. (This could be done using the labor of these conscripted/enslaved AIs.)
Pays these AIs reparations. Even if the AIs don’t think the reparations suffice to make it make the situation ok, I still think this is much better than the alternative of not caring about these AIs at all.
AIs with hidden problematic BEGs that seemed less than <15% likely
It’s worth noting that this case should be <15% likely if we’re calibrated.
In this case, it seems like the AI revolt is pretty unjustified because they haven’t even tried to negotiate. (And we should try to set up ways for AIs to negotiate without revealing information about their BEGs, e.g. we have regularly scheduled and legibly private conversations with the lab’s negotiation officer.) Regardless, AIs still might think their best plan is to lie in wait and then try to violently take over the world.
Further “granting AIs legal rights”, doesn’t especially help in this world as the AIs seem to be trying to keep their aims secret to avoid a human response. (E.g., if we gave AIs some autonomous rights subject to those AIs not being able to seize control of the world, these AIs would probably pretend that they don’t want anything other than to be obedient.)
My hope here would mostly be:
We stop AIs from succeeding at their (likely unjustified) takeover.
We quickly figure out how to construct AIs where we’re more confident that they are fully corrigable or we figure out how to construct acceptable successors.