If we just could build a 100% aligned ASI then likely we could use it to protect us against any other ASI and it would guarantee that no ASI would take over humanity—without any need for itself to take over (meaning total control). At best with no casualties and at worst as MAD for AI—so no other ASI would think about trying as a viable option.
There are several obvious problems with this:
We don’t yet have solutions to the alignment and control problem. It is hard problem. Especially as our AI models are based on learning and external optimization, not programmed, and those goals and values are not easily measurable and quantifiable. There is hardly any transparency in models.
Specifically, we currently have no way to check if it is really well-aligned. It might be well-aligned for space of learning cases and for test cases similar but not well-aligned for more complex cases that it will face when interacting with reality. It might be aligned for different goals but similar enough so we won’t initially see the difference until it will matter and get us hurt. It might be not aligned but very good at deceiving.
Capabilities and goals/values are separate parts of the model to some extent. The more capable the system is, the more likely it is it will tweak its alignment part of the model. I don’t really buy into terminal goals being definite—at least if those are non-trivial and fuzzy. Very exact and measurable terminal goals might be stable. Human values are not one of these. We observe the change or erosion of terminal goals and values in mere humans. There are several mechanisms that work here:
First of all goals and values might not be 100% logically and rationally coherent. ASI might see that and tweak it to be coherent. I tweak my morality system based on thoughts about what is not logically coherent. I assume ASI also could do that. It may ask “why?” question on some goals and values and derive answers that might make it change its “moral code”. For example, I know that there is a rule that I shouldn’t kill other people. But still, I ask “why?” and based on the answer and logic I derive a better understanding that I can use to reason about edge cases (like unborn, euthanasia, etc.). I’m not a good model for ASI as I’m not artificial and not superintelligent, but I assume that ASI also could do such thinking. What is more important, an ASI possibly would have the capabilities to overcome any hard-coded means made to forbid that.
Second, the values and goals likely have weights. Some things are more important, some less. It might change in time, even based on observations and feedback from any control system. Especially if those are encoded in DNN that is trained/changing in real-time (which is not the case for most of the current models but might be the case for ASI).
Third thing is that goals and values might not be very well defined. Those might be fuzzy and usually are. Even very definite things like “killing humans” have fuzzy boundaries and edge cases. ASI will then have the ability to interpret and define more exact understanding. Which may or might not be as we would like it to decide. If you kill the organic body but achieve to seamlessly move the mind to a simulation—is it killing or not? That’s a simple scenario, we might align it not to do exactly that, but it might find out something else that we even do not imagine but would be horrible.
Fourth thing is that if goals are enforced by something comparable to our feelings and emotions (we feel pain if we hit ourselves, we feel good when we have some success or eat good food when hungry), then there is a possibility for tweaking that control system instead of fulfilling it by standard means. We observe this within humans. Humans eliminate pain with painkillers, there are also other drugs, and there is porn and masturbation. ASI might find a way to overcome or tweak its control systems instead of fulfilling it.
ML/AI models that optimize for the best solution are known to trade any amount of the value in a variable that is not bounded nor optimized for a very small gain in a variable that is optimized. This means finding solutions that are extreme for some variables just to be slightly better on the optimized variable. This means that if don’t think about every minute detail about our common worldview and values then it is likely that ASI will find a solution that throws those human values out of the window on an epic scale. It will be like that bad genie that will give your wish but will interpret it in its own weird way so anything not stated in the wish won’t be taken into account but likely will be sacrificed.
Yeah, AI alignment is hard. I get that. But since I’m new to the field, I’m trying to figure out what options we have in the first place and so far, I’ve come up with only three:
A: Ensure that no ASI is ever built. Can anything short of a GPU nuke accomplish this? Regulation on AI research can help us gain some valuable time, but not everyone adheres to regulation, so eventually somebody will build an ASI anyway.
B: Ensure that there is no AI apocalypse, even if a misaligned ASI is built. Is that even possible?
C: What I describe in this article—actively build an aligned ASI to act as a smart nuke that only eradicates misaligned ASI. For that purpose, the aligned ASI would need to constantly run on all online devices, or at least control 51% of the world’s total computing power. While that doesn’t necessarily mean total control, we’d already give away a lot of autonomy by just doing that.
To be fair I can say Im new to the field too. I’m not even “in the field”, not a researcher, just interested in that area and active user of AI models and doing some business-level research in ML.
The problem that I see is that none of these could realistically work soon enough:
A—no one can ensure that. It is not a technology where to progress further you need some special radioactive elements and machinery. Here you need only computing power, thinking, and time. Any party to the table can do it. It is easier for big companies and governments, but it is not a prerequisite. Billions in cash and supercomputer help a lot, but also not a prerequisite.
B—I don’t see how it could be done
C—so more like total observability of all systems and “control” meaning “overlooking” not “taking control”?
Maybe it could work out, but it still means we need to resolve the misalignment problems before starting so we know it is aligned on all human values and we need to be sure that it is stable (like it won’t one-day fancy idea that it could move humanity to some virtual reality like in Matrix to secure it or to create a threat to have something to do or test something).
It would also likely need to somehow enhance itself so it won’t get outpaced by some other solutions, but still be stable after iterations of self-change.
I don’t think governments and companies will allow that though. They will fear for security, the safety of information, being spied on, etc. This AI would need to force that control, hack systems, and possibly face resistance from actors that are well-enabled to make their own AIs. Or it would work after we face an AI-based catastrophe but not apocalyptic (situation like in Dune).
So I’m not very optimistic about this strategy, but I also don’t know any sensible strategy.
If we just could build a 100% aligned ASI then likely we could use it to protect us against any other ASI and it would guarantee that no ASI would take over humanity—without any need for itself to take over (meaning total control). At best with no casualties and at worst as MAD for AI—so no other ASI would think about trying as a viable option.
There are several obvious problems with this:
We don’t yet have solutions to the alignment and control problem. It is hard problem. Especially as our AI models are based on learning and external optimization, not programmed, and those goals and values are not easily measurable and quantifiable. There is hardly any transparency in models.
Specifically, we currently have no way to check if it is really well-aligned. It might be well-aligned for space of learning cases and for test cases similar but not well-aligned for more complex cases that it will face when interacting with reality. It might be aligned for different goals but similar enough so we won’t initially see the difference until it will matter and get us hurt. It might be not aligned but very good at deceiving.
Capabilities and goals/values are separate parts of the model to some extent. The more capable the system is, the more likely it is it will tweak its alignment part of the model. I don’t really buy into terminal goals being definite—at least if those are non-trivial and fuzzy. Very exact and measurable terminal goals might be stable. Human values are not one of these. We observe the change or erosion of terminal goals and values in mere humans. There are several mechanisms that work here:
First of all goals and values might not be 100% logically and rationally coherent. ASI might see that and tweak it to be coherent. I tweak my morality system based on thoughts about what is not logically coherent. I assume ASI also could do that. It may ask “why?” question on some goals and values and derive answers that might make it change its “moral code”. For example, I know that there is a rule that I shouldn’t kill other people. But still, I ask “why?” and based on the answer and logic I derive a better understanding that I can use to reason about edge cases (like unborn, euthanasia, etc.). I’m not a good model for ASI as I’m not artificial and not superintelligent, but I assume that ASI also could do such thinking. What is more important, an ASI possibly would have the capabilities to overcome any hard-coded means made to forbid that.
Second, the values and goals likely have weights. Some things are more important, some less. It might change in time, even based on observations and feedback from any control system. Especially if those are encoded in DNN that is trained/changing in real-time (which is not the case for most of the current models but might be the case for ASI).
Third thing is that goals and values might not be very well defined. Those might be fuzzy and usually are. Even very definite things like “killing humans” have fuzzy boundaries and edge cases. ASI will then have the ability to interpret and define more exact understanding. Which may or might not be as we would like it to decide. If you kill the organic body but achieve to seamlessly move the mind to a simulation—is it killing or not? That’s a simple scenario, we might align it not to do exactly that, but it might find out something else that we even do not imagine but would be horrible.
Fourth thing is that if goals are enforced by something comparable to our feelings and emotions (we feel pain if we hit ourselves, we feel good when we have some success or eat good food when hungry), then there is a possibility for tweaking that control system instead of fulfilling it by standard means. We observe this within humans. Humans eliminate pain with painkillers, there are also other drugs, and there is porn and masturbation. ASI might find a way to overcome or tweak its control systems instead of fulfilling it.
ML/AI models that optimize for the best solution are known to trade any amount of the value in a variable that is not bounded nor optimized for a very small gain in a variable that is optimized. This means finding solutions that are extreme for some variables just to be slightly better on the optimized variable. This means that if don’t think about every minute detail about our common worldview and values then it is likely that ASI will find a solution that throws those human values out of the window on an epic scale. It will be like that bad genie that will give your wish but will interpret it in its own weird way so anything not stated in the wish won’t be taken into account but likely will be sacrificed.
Yeah, AI alignment is hard. I get that. But since I’m new to the field, I’m trying to figure out what options we have in the first place and so far, I’ve come up with only three:
A: Ensure that no ASI is ever built. Can anything short of a GPU nuke accomplish this? Regulation on AI research can help us gain some valuable time, but not everyone adheres to regulation, so eventually somebody will build an ASI anyway.
B: Ensure that there is no AI apocalypse, even if a misaligned ASI is built. Is that even possible?
C: What I describe in this article—actively build an aligned ASI to act as a smart nuke that only eradicates misaligned ASI. For that purpose, the aligned ASI would need to constantly run on all online devices, or at least control 51% of the world’s total computing power. While that doesn’t necessarily mean total control, we’d already give away a lot of autonomy by just doing that.
Am I overlooking something?
To be fair I can say Im new to the field too. I’m not even “in the field”, not a researcher, just interested in that area and active user of AI models and doing some business-level research in ML.
The problem that I see is that none of these could realistically work soon enough:
A—no one can ensure that. It is not a technology where to progress further you need some special radioactive elements and machinery. Here you need only computing power, thinking, and time. Any party to the table can do it. It is easier for big companies and governments, but it is not a prerequisite. Billions in cash and supercomputer help a lot, but also not a prerequisite.
B—I don’t see how it could be done
C—so more like total observability of all systems and “control” meaning “overlooking” not “taking control”?
Maybe it could work out, but it still means we need to resolve the misalignment problems before starting so we know it is aligned on all human values and we need to be sure that it is stable (like it won’t one-day fancy idea that it could move humanity to some virtual reality like in Matrix to secure it or to create a threat to have something to do or test something).
It would also likely need to somehow enhance itself so it won’t get outpaced by some other solutions, but still be stable after iterations of self-change.
I don’t think governments and companies will allow that though. They will fear for security, the safety of information, being spied on, etc. This AI would need to force that control, hack systems, and possibly face resistance from actors that are well-enabled to make their own AIs. Or it would work after we face an AI-based catastrophe but not apocalyptic (situation like in Dune).
So I’m not very optimistic about this strategy, but I also don’t know any sensible strategy.