I made a comment pertinent to this recently. I’m pretty concerned about the framing of a Sovereign AI, and even more so about a Sovereign AI which seeks to harm people. I would really prefer to focus on heading that scenario off before we get to it rather than trying to ameliorate the harms after it is established.
My current belief as of September 2024 is that we should be aiming for powerful tool-AIs wielded by a democratic world government, that prevent the development of super-intelligent AIs or self-replicating weapons for as many years as possible while fundamental progress is made on AI Alignment.
I think your comment about the problem of people who value punishing others really hits the mark.
Basically, I don’t think it makes sense to try to satisfy everyone’s preferences. Instead, we should try for something like ‘the loosest, smallest libertarian world government that prevents catastrophe.’ Then we can have our normal nation-states implemented within the world government framework, implementing local laws.
I do think it’s possible to have a really great government that would align with my values, and yet be ‘loose’ enough in its decision boundaries that many other people also felt that it adequately aligned with their values. I think this hypothetical great-by-my-standards government would be better than the hypothetical minimal-libertarian-world-government with current nation-states within. Unfortunately, I don’t see a safe path which goes straight to my ideal government being the world government. Maybe someday I’ll get to help create and live under such a government! In the meantime, I’d prefer the minimal-libertarian-world-government to humanity being destroyed.
In regards to the Membrane idea… I find it seems less compelling to me as a safe way for someone to operate a potent AI than Corrigibility as defined by Max Harms.
So personal intent alignment is basically all we get except in perhaps very small groups.”″”
I want to disagree here. I think that a widely acceptable compromise on political rules, and the freedom to pursue happiness on one’s own terms without violating others’ rights, is quite achievable and desirable. I think that having a powerful AI establish/maintain the best possible government given the conflicting sets of values held by all parties is a great outcome. I agree that this isn’t what is generally meant by ‘values alignment’, but I think it’s a more useful thing to talk about.
I do agree that large groups of humans do seem to inevitably have contradictory values such that no perfect resolution is possible. I just think that that is beside the point, and not what we should even be fantasizing about. I also agree that most people who seem excited about ‘values alignment’ mean ‘alignment to their own values’. I’ve had numerous conversations with such people about the problem of people with harmful intent towards others (e.g. sadism, vengeance). I have yet to receive anything even remotely resembling a coherent response to this. Averaging values doesn’t solve the problem, there are weird bad edge cases that that falls into. Instead, you need to focus on a widely (but not necessarily unanimously) acceptable political compromise.
Regarding Corrigibility as an alternative safety measure:
I think that exploring the Corrigibility concept sounds like a valuable thing to do. I also think that Corrigibility formalisms can be quite tricky (for similar reasons that Membrane formalisms can be tricky: I think that they are both vulnerable to difficult-to-notice definitional issues). Consider a powerful and clever tool-AI. It is built using a Corrigibility formalism that works very well when the tool-AI is used to shut down competing AI projects. This formalism relies on a definition of Explanation, that is designed to prevent any form of undue influence. When talking with this tool-AI about shutting down computing AI projects, the definition of Explanation holds up fine. In this scenario, it could be the case that asking this seemingly corrigible tool-AI about a Sovereign AI proposal, is essentially equivalent to implementing that proposal.
Any definition of Explanation will necessarily be built on top of a lot of assumptions. Many of these will be unexamined implicit assumptions that the designers will not be aware of. In general, it would not be particularly surprising if one of these assumptions turns out to hold when discussing things along the lines of shutting down competing AI projects. But turns out to break when discussing a Sovereign AI proposal.
Let’s take one specific example. Consider the case where the tool-AI will try to Explain any topic that it is asked about, until the person asking Understands the topic sufficiently. When asked about a Sovereign AI proposal, the tool-AI will ensure that two separate aspects of the proposal will be Understood, (i): an alignment target, and (ii): a normative moral theory according to which this alignment target is the thing that a Sovereign AI project should aim at. It turns out that Explaining a normative moral theory until the person asking Understands it, is functionally equivalent to convincing the person to adopt this normative moral theory. If the tool-AI is very good at convincing, then the tool-AI could be essentially equivalent to an AI that will implement whatever Sovereign AI proposal it is first asked to explain (with a few extra steps).
Yes, in my discussions with Max Harms about CAST we discussed the concern of a highly capable corrigible tool-AI accidentally or intentionally manipulating its operators or other humans with very compelling answers to questions. My impression is that Max is more confident about his version of corrigibility managing to avoid manipulation scenarios than I am. I think this is definitely one of the more fragile and slippery aspects of corrigibility. In my opinion, manipulation-prevention in the context of corrigibility deserves more examination to see if better protections can be found, and a very cautious treatment during any deployment of a powerful corrigible tool-AI.
I agree that focus should be on preventing the existence of a Sovereign AI that seeks to harm people (as opposed to trying to deal with such an AI after it has already been built). The main reason for trying to find necessary features, is actually that it might stop a dangerous AI project from being pursued in the first place. In particular: it might convince the design team to abandon an AI project, that clearly lacks a feature that has been found to be necessary. An AI project that would (if successfully implemented) result in an AI Sovereign that would seek to harm people. For example a Sovereign AI that wants to respect a Membrane. But where the Membrane formalism does not actually prevent the AI from wanting to hurt individuals, because the formalism lacks a necessary feature.
One reason we might end up with a Sovereign AI that seeks to harm people is that someone makes two separate errors. Let’s say that Bob gains control over a tool-AI, and uses it to shut down unauthorised AI projects (Bob might for example be a single individual, or a design team, or a government, or a coalition of governments, or the UN, or a democratic world government, or something else along those lines). Bob gains the ability to launch a Sovereign AI. And Bob settles on a specific Sovereign AI design: Bob’s Sovereign AI (BSAI).
Bob knows that BSAI might contain a hidden flaw. And Bob is not being completely reckless about launching BSAI. So Bob designs a Membrane, whose function is to protect individuals (in case BSAI does have a hidden flaw). And Bob figures out how to make sure that BSAI will want to avoid piercing this Membrane (in other words: Bob makes sure that the Membrane will be internal to BSAI).
Consider the case where both BSAI, and the Membrane formalism in question, each have a hidden flaw. If both BSAI and the Membrane is successfully implemented, then the result would be a Sovereign AI that seeks to harm people (the resulting AI would want to both, (i): harm people, and (ii): respect the Membrane of every individual). One way to reduce the probability that such a project would go ahead, is to describe necessary features.
For example: if it is clear that the Membrane that Bob is planning to use, does not have the necessary Extended Membrane feature described in the post, then Bob should be able to see that this Membrane will not offer reliable protection from BSAI (which Bob knows might be needed, because Bob knows that BSAI might be flawed).
For a given AI project, it is not certain that there exists a realistically findable necessary feature, that can be used to illustrate the dangers of the project in question. And even if such a feature is found, it is not certain that Bob will listen. But looking for necessary features is still a tractable way of reducing the probability of a Sovereign AI that seeks to harm people.
A project to find necessary features is not really a quest for a solution to AI. It is more informative to see such a project as analogous to a quest to design a bulletproof vest for Bob, who will be going into a gunfight (and who might decide to put on the vest). Even if very successful, the bulletproof vest project will not offer full protection (Bob might get shot in the head). A vest is also not a solution. Whether Bob is a medic trying to evacuate wounded people from the gunfight, or Bob is a soldier trying to win the gunfight, the vest cannot be used to achieve Bob’s objective. Vests are not solutions. Vests are still very popular amongst people who know that they will be going into a gunfight.
So if you will share the fate of Bob. And if you might fail to persuade Bob to avoid a gunfight. Then it makes sense to try to design a bulletproof vest for Bob (because if you succeed, then he might decide to wear it. And that would be very good if he ends up getting shot in the stomach). (the vest in this analogy is analogous to descriptions of necessary features, that might be used to convince designers to abandon a dangerous AI project. The vest in this analogy is not analogous to a Membrane)
I made a comment pertinent to this recently. I’m pretty concerned about the framing of a Sovereign AI, and even more so about a Sovereign AI which seeks to harm people. I would really prefer to focus on heading that scenario off before we get to it rather than trying to ameliorate the harms after it is established.
My current belief as of September 2024 is that we should be aiming for powerful tool-AIs wielded by a democratic world government, that prevent the development of super-intelligent AIs or self-replicating weapons for as many years as possible while fundamental progress is made on AI Alignment.
I think your comment about the problem of people who value punishing others really hits the mark.
Basically, I don’t think it makes sense to try to satisfy everyone’s preferences. Instead, we should try for something like ‘the loosest, smallest libertarian world government that prevents catastrophe.’ Then we can have our normal nation-states implemented within the world government framework, implementing local laws.
I do think it’s possible to have a really great government that would align with my values, and yet be ‘loose’ enough in its decision boundaries that many other people also felt that it adequately aligned with their values. I think this hypothetical great-by-my-standards government would be better than the hypothetical minimal-libertarian-world-government with current nation-states within. Unfortunately, I don’t see a safe path which goes straight to my ideal government being the world government. Maybe someday I’ll get to help create and live under such a government! In the meantime, I’d prefer the minimal-libertarian-world-government to humanity being destroyed.
In regards to the Membrane idea… I find it seems less compelling to me as a safe way for someone to operate a potent AI than Corrigibility as defined by Max Harms.
My other comment:
Regarding Corrigibility as an alternative safety measure:
I think that exploring the Corrigibility concept sounds like a valuable thing to do. I also think that Corrigibility formalisms can be quite tricky (for similar reasons that Membrane formalisms can be tricky: I think that they are both vulnerable to difficult-to-notice definitional issues). Consider a powerful and clever tool-AI. It is built using a Corrigibility formalism that works very well when the tool-AI is used to shut down competing AI projects. This formalism relies on a definition of Explanation, that is designed to prevent any form of undue influence. When talking with this tool-AI about shutting down computing AI projects, the definition of Explanation holds up fine. In this scenario, it could be the case that asking this seemingly corrigible tool-AI about a Sovereign AI proposal, is essentially equivalent to implementing that proposal.
Any definition of Explanation will necessarily be built on top of a lot of assumptions. Many of these will be unexamined implicit assumptions that the designers will not be aware of. In general, it would not be particularly surprising if one of these assumptions turns out to hold when discussing things along the lines of shutting down competing AI projects. But turns out to break when discussing a Sovereign AI proposal.
Let’s take one specific example. Consider the case where the tool-AI will try to Explain any topic that it is asked about, until the person asking Understands the topic sufficiently. When asked about a Sovereign AI proposal, the tool-AI will ensure that two separate aspects of the proposal will be Understood, (i): an alignment target, and (ii): a normative moral theory according to which this alignment target is the thing that a Sovereign AI project should aim at. It turns out that Explaining a normative moral theory until the person asking Understands it, is functionally equivalent to convincing the person to adopt this normative moral theory. If the tool-AI is very good at convincing, then the tool-AI could be essentially equivalent to an AI that will implement whatever Sovereign AI proposal it is first asked to explain (with a few extra steps).
(I discussed this issue with Max Harms here)
Yes, in my discussions with Max Harms about CAST we discussed the concern of a highly capable corrigible tool-AI accidentally or intentionally manipulating its operators or other humans with very compelling answers to questions. My impression is that Max is more confident about his version of corrigibility managing to avoid manipulation scenarios than I am. I think this is definitely one of the more fragile and slippery aspects of corrigibility. In my opinion, manipulation-prevention in the context of corrigibility deserves more examination to see if better protections can be found, and a very cautious treatment during any deployment of a powerful corrigible tool-AI.
I agree that focus should be on preventing the existence of a Sovereign AI that seeks to harm people (as opposed to trying to deal with such an AI after it has already been built). The main reason for trying to find necessary features, is actually that it might stop a dangerous AI project from being pursued in the first place. In particular: it might convince the design team to abandon an AI project, that clearly lacks a feature that has been found to be necessary. An AI project that would (if successfully implemented) result in an AI Sovereign that would seek to harm people. For example a Sovereign AI that wants to respect a Membrane. But where the Membrane formalism does not actually prevent the AI from wanting to hurt individuals, because the formalism lacks a necessary feature.
One reason we might end up with a Sovereign AI that seeks to harm people is that someone makes two separate errors. Let’s say that Bob gains control over a tool-AI, and uses it to shut down unauthorised AI projects (Bob might for example be a single individual, or a design team, or a government, or a coalition of governments, or the UN, or a democratic world government, or something else along those lines). Bob gains the ability to launch a Sovereign AI. And Bob settles on a specific Sovereign AI design: Bob’s Sovereign AI (BSAI).
Bob knows that BSAI might contain a hidden flaw. And Bob is not being completely reckless about launching BSAI. So Bob designs a Membrane, whose function is to protect individuals (in case BSAI does have a hidden flaw). And Bob figures out how to make sure that BSAI will want to avoid piercing this Membrane (in other words: Bob makes sure that the Membrane will be internal to BSAI).
Consider the case where both BSAI, and the Membrane formalism in question, each have a hidden flaw. If both BSAI and the Membrane is successfully implemented, then the result would be a Sovereign AI that seeks to harm people (the resulting AI would want to both, (i): harm people, and (ii): respect the Membrane of every individual). One way to reduce the probability that such a project would go ahead, is to describe necessary features.
For example: if it is clear that the Membrane that Bob is planning to use, does not have the necessary Extended Membrane feature described in the post, then Bob should be able to see that this Membrane will not offer reliable protection from BSAI (which Bob knows might be needed, because Bob knows that BSAI might be flawed).
For a given AI project, it is not certain that there exists a realistically findable necessary feature, that can be used to illustrate the dangers of the project in question. And even if such a feature is found, it is not certain that Bob will listen. But looking for necessary features is still a tractable way of reducing the probability of a Sovereign AI that seeks to harm people.
A project to find necessary features is not really a quest for a solution to AI. It is more informative to see such a project as analogous to a quest to design a bulletproof vest for Bob, who will be going into a gunfight (and who might decide to put on the vest). Even if very successful, the bulletproof vest project will not offer full protection (Bob might get shot in the head). A vest is also not a solution. Whether Bob is a medic trying to evacuate wounded people from the gunfight, or Bob is a soldier trying to win the gunfight, the vest cannot be used to achieve Bob’s objective. Vests are not solutions. Vests are still very popular amongst people who know that they will be going into a gunfight.
So if you will share the fate of Bob. And if you might fail to persuade Bob to avoid a gunfight. Then it makes sense to try to design a bulletproof vest for Bob (because if you succeed, then he might decide to wear it. And that would be very good if he ends up getting shot in the stomach). (the vest in this analogy is analogous to descriptions of necessary features, that might be used to convince designers to abandon a dangerous AI project. The vest in this analogy is not analogous to a Membrane)