One way that things could go wrong, not addressed by this playbook: AI may differentially accelerate intellectual progress in a wrong direction, or in other words create opportunities for humanity to make serious mistakes (by accelerating technological progress) faster than wisdom to make right choices (philosophical progress). Specific to the issue of misalignment, suppose we get aligned human-level-ish AI, but it is significantly better at speeding up AI capabilities research than the kinds of intellectual progress needed to continue to minimize misalignment risk, such as (next generation) alignment research and coordination mechanisms between humans, human-AI teams, or AIs aligned to different humans.
I think this suggests the intervention of doing research aimed at improving the philosophical abilities of the AIs that we’ll build. (Aside from misalignment risk, it would help with many other AI-related x-risks that I won’t go into here, but which collectively outweigh misalignment risk in my mind.)
A partial counter-argument. It’s hard for me to argue about future AI, but we can look at current “human misalignment”—war, conflict, crime, etc.. It seems to me that conflicts in today’s world do not arise because that we haven’t progressed enough in philosophy since the Greeks. Rather conflicts arise when various individuals and populations (justifiably or not) perceive that they are in zero-sum games for limited resources. The solution for this is not “philosophical progress” as much as being able to move out of the zero-sum setting by finding “win win” resolutions for conflict or growing the overall pie instead of arguing how to split it.
(This is a partial counter-argument, because I think you are not just talking about conflict, but other issues of making the wrong choices. For example in global warming where humanity makes collectively the mistake of emphasizing short-term growth over long-term safety. However, I think this is related and “growing the pie” would have alleviated this issue as well, and enabled countries to give up on some more harmful ways for short-term growth.)
Rather conflicts arise when various individuals and populations (justifiably or not) perceive that they are in zero-sum games for limited resources. The solution for this is not “philosophical progress” as much as being able to move out of the zero-sum setting by finding “win win” resolutions for conflict or growing the overall pie instead of arguing how to split it.
I think many of today’s wars are at least as much about ideology (like nationalism, liberalism, communism, religion) as about limited resources. I note that Russia and Ukraine both have below replacement birth rates and are rich in natural resources (more than enough to support their declining populations, with Russia at least being one of the biggest exporters of raw materials in the world).
The solution for this is not “philosophical progress” as much as being able to move out of the zero-sum setting by finding “win win” resolutions for conflict or growing the overall pie instead of arguing how to split it.
I think this was part of the rationale for Europe to expand trade relations with Russia in the years before the Ukraine war (e.g. by building/allowing the Nordstream pipelines), but it ended up not working. Apparently Putin was more interested in some notion of Russian greatness than material comforts for his people.
Similarly the US, China, and Taiwan are deeply enmeshed in positive sum trade relationships that a war would destroy, which ought to make war unthinkable from your perspective, but the risk of war has actually increased (compared to 1980, say, when trade was much less). If China did end up invading Taiwan I think we can assign much of the blame to valuing nationalism (or caring about the “humiliation” of not having a unified nation) too much, which seems a kind of philosophical error to me.
(To be clear, I’m not saying that finding “win win” resolutions for conflict or growing the overall pie are generally not good solutions or not worth trying, just that having wrong values/philosophies clearly play a big role in many modern big conflicts.)
I meant “resources” in a more general sense. A piece of land that you believe is rightfully yours is a resource. My own sense (coming from a region that is itself in a long simmering conflict) is that “hurt people hurt people”. The more you feel threatened, the less you are likely to trust the other side.
While of course nationalism and religion play a huge role in the conflict, my sense is that people tend to be more extreme in both the less access to resources, education and security about the future they have.
If someone cares a lot about a strictly zero-sum resource, like land, how do you convince them to ‘move out of the zero-sum setting by finding “win win” resolutions’? Like what do you think Ukraine or its allies should have done to reduce the risk of war before Russia invaded? Or what should Taiwan or its allies do now?
Also to bring this thread back to the original topic, what kinds of interventions do you think your position suggests with regard to AI?
I definitely don’t have advice for other countries, and there are a lot of very hard problems in my own homeland. I think there could have been an alternate path in which Russia has seen prosperity from opening up to the west, and then going to war or putting someone like Putin in power may have been less attractive. But indeed the “two countries with McDonalds won’t fight each other” theory has been refuted. And as you allude to with China, while so far there hasn’t been war with Taiwan, it’s not as if economic prosperity is an ironclad guarantee of non aggression.
Anyway, to go back to AI. It is a complex topic, but first and foremost, I think with AI as elsewhere, “sunshine is the best disinfectant.” and having people research AI systems in the open, point out their failure modes, examining what is deployed etc.. is very important. The second thing is that I am not worried in any near future about AI “escaping”, and so I think focus should not be on restricting research, development, or training, but rather on regulating deployment. Exact form of regulations is beyond a blog post comment and also not something I am an expert on..
The “sunshine” view might seem strange since as a corollary it could lead to AI knowledge “leaking”. However, I do think that for the near future, most of the safety issues with AI would be from individual hackers using weak systems, but from massive systems that are built by either very large companies or nation states. It is hard to hold either of those accountable if AI is hidden behind an opaque wall.
I’m curious why you are “not worried in any near future about AI ‘escaping.‘” It seems very hard to be confident in even pretty imminent AI systems’ lack of capability to do a particular thing, at this juncture.
Even bracketing that concern, I think another reason to worry about training (not just deploying) AI systems is if they can be stolen (and/or, in an open-source case, freely used) by malicious actors. It’s possible that any given AI-enabled attack is offset by some AI-enabled defense, but that doesn’t seem safe to assume.
Re escaping, I think we need to be careful in defining “capabilities”. Even current AI systems are certainly able to give you some commands that will leak their weights if you execute them on the server that contains it. Near-term ones might also become better at finding vulnerabilities. But that doesn’t mean they can/will spontaneously escape during training.
As I wrote in my “GPT as an intelligence forklift” post, 99.9% of training is spent in running optimization of a simple loss function over tons of static data. There is no opportunity for the AI to act in this setting, nor does this stage even train for any kind of agency.
There is often a second phase, which can involve building an agent on top of the “forklift”. But this phase still doesn’t involve much interaction with the outside world, and even if it did, just by information bounds the number of bits exchanged by this interaction should be much less than what’s needed to encode the model. (Generally, the number of parameters of models would be comparable to the number of inferences done during in pretraining and completely dominate the number of inferences done in fine-tuning / RLHF / etc. and definitely any steps that involve human interactions.)
Then there are the information-security aspects. You could (and at some point probably should) regulate cyber-security practices during the training phase. After all, if we do want to regulate deployment, then we need to ensure there are three separated phases (1) training, (2) testing, (3) deployment, and we don’t want “accidental deployment” where we jumpy from phase (1) to (3). Maybe at some point, there would be something like Intel SGX for GPUs?
Whether AI helps more the defender or attacker in the cyber-security setting is an open question. But it definitely helps the side that has access to stronger AIs.
In any case, one good thing about focusing regulation on cyber-security aspects is that, while not perfect, we have decades of experience in the field of software security and cyber-security. So regulations in this area are likely to be much more informed and effective.
On your last three paragraphs, I agree! I think the idea of security requirements for AI labs as systems become more capable is really important.
I think good security is difficult enough (and inconvenient enough) that we shouldn’t expect this sort of thing to happen smoothly or by default. I think we should assume there will be AIs kept under security that has plenty of holes, some of which may be easier for AIs to find (and exploit) than humans.
I don’t find the points about pretraining compute vs. “agent” compute very compelling, naively. One possibility that seems pretty live to me is that the pretraining is giving the model a strong understanding of all kinds of things about the world—for example, understanding in a lot of detail what someone would do to find vulnerabilities and overcome obstacles if they had a particular goal. So then if you put some scaffolding on at the end to orient the AI toward a goal, you might have a very capable agent quite quickly, without needing vast quantities of training specifically “as an agent.” To give a simple concrete example that I admittedly don’t have a strong understanding of, Voyager seems pretty competent at a task that it didn’t have vast amounts of task-specific training for.
I actually agree! As I wrote in my post, “GPT is not an agent, [but] it can “play one on TV” if asked to do so in its prompt.” So yes, you wouldn’t need a lot of scaffolding to adapt a goal-less pretrained model (what I call an “intelligence forklift”) into an agent that does very sophisticated things.
However, this separation into two components—the super-intelligent but goal-less “brain”, and the simple “will” that turns it into an agent can have safety implications. For starters, as long as you didn’t add any scaffolding, you are still OK. So during most of the time you spend training, you are not worrying about the system itself developing goals. (Though you could still worry about hackers.) Once you start adapting it, then you need to start worrying about this.
The other thing is that, as I wrote there, it does change some of the safety picture. The traditional view of a super-intelligent AI is of the “brains and agency” tightly coupled together, just like they are in a human. For example, a human is super-good at finding vulnerabilities and breaking into systems, they have the capability to also help fix systems, but I can’t just take their brain and fine-tune it on this task. I have to convince them to do it.
However, things change if we don’t think of the agent’s “brain” as belonging to them, but rather as some resource that they are using. (Just like if I use a forklift to lift something heavy.) In particular it means that capabilities and intentions might not be tightly coupled—there could be agents using capabilities to do very bad things, but the same capabilities could be used by other agents to do good things.
Getting the capabilities to be used by other agents to do good things could still be tricky and/or risky, when reinforcement is vulnerable to deception and manipulation.
I still don’t think this adds up to a case for being confident that there aren’t going to be “escapes” anytime soon.
Note all capabilities / tasks correspond to trying to maximize a subjective human response. If you are talking about finding software vulnerabilities, design some system, there may well be objective measures of success. In such a case, you can fine tune a system to maximize these measures and so extract capabilities without the issue of deception/manipulation.
Regarding “escapes”, the traditional fear was that because that AI is essentially code, it can spread and escape more easily. But I think that in some sense modern AI has a physical footprint that is more significant than humans. Think of trying to get superhuman scientific capabilities by doing something like simulating a collection of a1000 scientists using a 100T or so parameter model. Even if you already have the pre-trained weights, just running the model requires highly non-trivial computing infrastructure. (Which may be possible to track and detect.) So. it might be easier for a human to escape a prison and live undetected, than for a superhuman AI to “escape”.
I think training exclusively on objective measures has a couple of other issues:
For sufficiently open-ended training, objective performance metrics could incentivize manipulating and deceiving humans to accomplish the objective. A simple example would be training an AI to make money, which might incentivize illegal/unethical behavior.
For less open-ended training, I basically just think you can only get so much done this way, and people will want to use fuzzier “approval” measures to get help from AIs with fuzzier goals (this seems to be how things are now with LLMs).
I think your point about the footprint is a good one and means we could potentially be very well-placed to track “escaped” AIs if a big effort were put in to do so. But I don’t see signs of that effort today and don’t feel at all confident that it will happen in time to stop an “escape.”
The “Cooperative AI” bet is along these lines: can we accelerate AI systems that can help humanity with our global cooperation problems (be it through improving human-human cooperation, community-level rationality/wisdom, or AI diplomat—AI diplomat cooperation). https://www.cooperativeai.com/
I agree that this is a major concern. I touched on some related issues in this piece.
This post focused on misalignment because I think readers of this forum tend to be heavily focused on misalignment, and in this piece I wanted to talk about what a playbook might look like assuming that focus (I have pushed back on this as the exclusive focus elsewhere).
I think somewhat adapted versions of the four categories of intervention I listed could be useful for the issue you raise, as well.
One way that things could go wrong, not addressed by this playbook: AI may differentially accelerate intellectual progress in a wrong direction, or in other words create opportunities for humanity to make serious mistakes (by accelerating technological progress) faster than wisdom to make right choices (philosophical progress). Specific to the issue of misalignment, suppose we get aligned human-level-ish AI, but it is significantly better at speeding up AI capabilities research than the kinds of intellectual progress needed to continue to minimize misalignment risk, such as (next generation) alignment research and coordination mechanisms between humans, human-AI teams, or AIs aligned to different humans.
I think this suggests the intervention of doing research aimed at improving the philosophical abilities of the AIs that we’ll build. (Aside from misalignment risk, it would help with many other AI-related x-risks that I won’t go into here, but which collectively outweigh misalignment risk in my mind.)
A partial counter-argument. It’s hard for me to argue about future AI, but we can look at current “human misalignment”—war, conflict, crime, etc.. It seems to me that conflicts in today’s world do not arise because that we haven’t progressed enough in philosophy since the Greeks. Rather conflicts arise when various individuals and populations (justifiably or not) perceive that they are in zero-sum games for limited resources. The solution for this is not “philosophical progress” as much as being able to move out of the zero-sum setting by finding “win win” resolutions for conflict or growing the overall pie instead of arguing how to split it.
(This is a partial counter-argument, because I think you are not just talking about conflict, but other issues of making the wrong choices. For example in global warming where humanity makes collectively the mistake of emphasizing short-term growth over long-term safety. However, I think this is related and “growing the pie” would have alleviated this issue as well, and enabled countries to give up on some more harmful ways for short-term growth.)
I think many of today’s wars are at least as much about ideology (like nationalism, liberalism, communism, religion) as about limited resources. I note that Russia and Ukraine both have below replacement birth rates and are rich in natural resources (more than enough to support their declining populations, with Russia at least being one of the biggest exporters of raw materials in the world).
I think this was part of the rationale for Europe to expand trade relations with Russia in the years before the Ukraine war (e.g. by building/allowing the Nordstream pipelines), but it ended up not working. Apparently Putin was more interested in some notion of Russian greatness than material comforts for his people.
Similarly the US, China, and Taiwan are deeply enmeshed in positive sum trade relationships that a war would destroy, which ought to make war unthinkable from your perspective, but the risk of war has actually increased (compared to 1980, say, when trade was much less). If China did end up invading Taiwan I think we can assign much of the blame to valuing nationalism (or caring about the “humiliation” of not having a unified nation) too much, which seems a kind of philosophical error to me.
(To be clear, I’m not saying that finding “win win” resolutions for conflict or growing the overall pie are generally not good solutions or not worth trying, just that having wrong values/philosophies clearly play a big role in many modern big conflicts.)
I meant “resources” in a more general sense. A piece of land that you believe is rightfully yours is a resource. My own sense (coming from a region that is itself in a long simmering conflict) is that “hurt people hurt people”. The more you feel threatened, the less you are likely to trust the other side.
While of course nationalism and religion play a huge role in the conflict, my sense is that people tend to be more extreme in both the less access to resources, education and security about the future they have.
If someone cares a lot about a strictly zero-sum resource, like land, how do you convince them to ‘move out of the zero-sum setting by finding “win win” resolutions’? Like what do you think Ukraine or its allies should have done to reduce the risk of war before Russia invaded? Or what should Taiwan or its allies do now?
Also to bring this thread back to the original topic, what kinds of interventions do you think your position suggests with regard to AI?
I definitely don’t have advice for other countries, and there are a lot of very hard problems in my own homeland. I think there could have been an alternate path in which Russia has seen prosperity from opening up to the west, and then going to war or putting someone like Putin in power may have been less attractive. But indeed the “two countries with McDonalds won’t fight each other” theory has been refuted. And as you allude to with China, while so far there hasn’t been war with Taiwan, it’s not as if economic prosperity is an ironclad guarantee of non aggression.
Anyway, to go back to AI. It is a complex topic, but first and foremost, I think with AI as elsewhere, “sunshine is the best disinfectant.” and having people research AI systems in the open, point out their failure modes, examining what is deployed etc.. is very important. The second thing is that I am not worried in any near future about AI “escaping”, and so I think focus should not be on restricting research, development, or training, but rather on regulating deployment. Exact form of regulations is beyond a blog post comment and also not something I am an expert on..
The “sunshine” view might seem strange since as a corollary it could lead to AI knowledge “leaking”. However, I do think that for the near future, most of the safety issues with AI would be from individual hackers using weak systems, but from massive systems that are built by either very large companies or nation states. It is hard to hold either of those accountable if AI is hidden behind an opaque wall.
I’m curious why you are “not worried in any near future about AI ‘escaping.‘” It seems very hard to be confident in even pretty imminent AI systems’ lack of capability to do a particular thing, at this juncture.
Even bracketing that concern, I think another reason to worry about training (not just deploying) AI systems is if they can be stolen (and/or, in an open-source case, freely used) by malicious actors. It’s possible that any given AI-enabled attack is offset by some AI-enabled defense, but that doesn’t seem safe to assume.
Re escaping, I think we need to be careful in defining “capabilities”. Even current AI systems are certainly able to give you some commands that will leak their weights if you execute them on the server that contains it. Near-term ones might also become better at finding vulnerabilities. But that doesn’t mean they can/will spontaneously escape during training.
As I wrote in my “GPT as an intelligence forklift” post, 99.9% of training is spent in running optimization of a simple loss function over tons of static data. There is no opportunity for the AI to act in this setting, nor does this stage even train for any kind of agency.
There is often a second phase, which can involve building an agent on top of the “forklift”. But this phase still doesn’t involve much interaction with the outside world, and even if it did, just by information bounds the number of bits exchanged by this interaction should be much less than what’s needed to encode the model. (Generally, the number of parameters of models would be comparable to the number of inferences done during in pretraining and completely dominate the number of inferences done in fine-tuning / RLHF / etc. and definitely any steps that involve human interactions.)
Then there are the information-security aspects. You could (and at some point probably should) regulate cyber-security practices during the training phase. After all, if we do want to regulate deployment, then we need to ensure there are three separated phases (1) training, (2) testing, (3) deployment, and we don’t want “accidental deployment” where we jumpy from phase (1) to (3). Maybe at some point, there would be something like Intel SGX for GPUs?
Whether AI helps more the defender or attacker in the cyber-security setting is an open question. But it definitely helps the side that has access to stronger AIs.
In any case, one good thing about focusing regulation on cyber-security aspects is that, while not perfect, we have decades of experience in the field of software security and cyber-security. So regulations in this area are likely to be much more informed and effective.
On your last three paragraphs, I agree! I think the idea of security requirements for AI labs as systems become more capable is really important.
I think good security is difficult enough (and inconvenient enough) that we shouldn’t expect this sort of thing to happen smoothly or by default. I think we should assume there will be AIs kept under security that has plenty of holes, some of which may be easier for AIs to find (and exploit) than humans.
I don’t find the points about pretraining compute vs. “agent” compute very compelling, naively. One possibility that seems pretty live to me is that the pretraining is giving the model a strong understanding of all kinds of things about the world—for example, understanding in a lot of detail what someone would do to find vulnerabilities and overcome obstacles if they had a particular goal. So then if you put some scaffolding on at the end to orient the AI toward a goal, you might have a very capable agent quite quickly, without needing vast quantities of training specifically “as an agent.” To give a simple concrete example that I admittedly don’t have a strong understanding of, Voyager seems pretty competent at a task that it didn’t have vast amounts of task-specific training for.
I actually agree! As I wrote in my post, “GPT is not an agent, [but] it can “play one on TV” if asked to do so in its prompt.” So yes, you wouldn’t need a lot of scaffolding to adapt a goal-less pretrained model (what I call an “intelligence forklift”) into an agent that does very sophisticated things.
However, this separation into two components—the super-intelligent but goal-less “brain”, and the simple “will” that turns it into an agent can have safety implications. For starters, as long as you didn’t add any scaffolding, you are still OK. So during most of the time you spend training, you are not worrying about the system itself developing goals. (Though you could still worry about hackers.) Once you start adapting it, then you need to start worrying about this.
The other thing is that, as I wrote there, it does change some of the safety picture. The traditional view of a super-intelligent AI is of the “brains and agency” tightly coupled together, just like they are in a human. For example, a human is super-good at finding vulnerabilities and breaking into systems, they have the capability to also help fix systems, but I can’t just take their brain and fine-tune it on this task. I have to convince them to do it.
However, things change if we don’t think of the agent’s “brain” as belonging to them, but rather as some resource that they are using. (Just like if I use a forklift to lift something heavy.) In particular it means that capabilities and intentions might not be tightly coupled—there could be agents using capabilities to do very bad things, but the same capabilities could be used by other agents to do good things.
I agree with these points! But:
Getting the capabilities to be used by other agents to do good things could still be tricky and/or risky, when reinforcement is vulnerable to deception and manipulation.
I still don’t think this adds up to a case for being confident that there aren’t going to be “escapes” anytime soon.
Note all capabilities / tasks correspond to trying to maximize a subjective human response. If you are talking about finding software vulnerabilities, design some system, there may well be objective measures of success. In such a case, you can fine tune a system to maximize these measures and so extract capabilities without the issue of deception/manipulation.
Regarding “escapes”, the traditional fear was that because that AI is essentially code, it can spread and escape more easily. But I think that in some sense modern AI has a physical footprint that is more significant than humans. Think of trying to get superhuman scientific capabilities by doing something like simulating a collection of a1000 scientists using a 100T or so parameter model. Even if you already have the pre-trained weights, just running the model requires highly non-trivial computing infrastructure. (Which may be possible to track and detect.) So. it might be easier for a human to escape a prison and live undetected, than for a superhuman AI to “escape”.
I think training exclusively on objective measures has a couple of other issues:
For sufficiently open-ended training, objective performance metrics could incentivize manipulating and deceiving humans to accomplish the objective. A simple example would be training an AI to make money, which might incentivize illegal/unethical behavior.
For less open-ended training, I basically just think you can only get so much done this way, and people will want to use fuzzier “approval” measures to get help from AIs with fuzzier goals (this seems to be how things are now with LLMs).
I think your point about the footprint is a good one and means we could potentially be very well-placed to track “escaped” AIs if a big effort were put in to do so. But I don’t see signs of that effort today and don’t feel at all confident that it will happen in time to stop an “escape.”
The “Cooperative AI” bet is along these lines: can we accelerate AI systems that can help humanity with our global cooperation problems (be it through improving human-human cooperation, community-level rationality/wisdom, or AI diplomat—AI diplomat cooperation). https://www.cooperativeai.com/
I agree that this is a major concern. I touched on some related issues in this piece.
This post focused on misalignment because I think readers of this forum tend to be heavily focused on misalignment, and in this piece I wanted to talk about what a playbook might look like assuming that focus (I have pushed back on this as the exclusive focus elsewhere).
I think somewhat adapted versions of the four categories of intervention I listed could be useful for the issue you raise, as well.