The Historical Teams Framework for Alignment Research
Eliezer Yudkowsky has famously advocated for embracing “security mindset” when thinking about AI safety. This is a mindset where you think about how to prevent things from going wrong, rather than how to make things go right. This seems obviously correct to me, so for the purposes of this post I’ll just take this as a given.
But I think there’s a piece missing from the AI Safety community’s understanding of security mindset, that is a key part of the mindset that computer security researchers and practitioners use. This is the concept of “threat model”, or “Mossad vs. not-Mossad”. In computer security, it’s important to specify the class of adversary you are trying to keep out, because different adversaries have different capabilities. For example, if you are trying to keep out a nation-state adversary, you need to worry about things like zero-day exploits, and you need to worry about physical security of your servers. If you are trying to keep out a script kiddie, you don’t need to worry about zero-day exploits, and you don’t need to worry about physical security of your servers. If you are trying to keep out a random person on the internet, you don’t need to worry about zero-day exploits, and you don’t need to worry about physical security of your servers, and you don’t need to worry about them breaking into your office and stealing your hard drives.
In my experience, the AI Safety community typically thinks primarily about adversaries with completely unbounded capabilities. There’s a good reason for this (there’s probably not much of a ceiling on AI capabilities, so, for any level of capabilities, eventually we will be facing that level of threat), but it makes it very hard to do the sort of productive, incremental work that builds a research program and eventually a paradigm.
The Historical Lens
How would one address this? I think we need a simple, shared framework in which to specify what kind of adversary we are thinking about at any given moment, and I think we need to associate any particular project with some particular class or classes of adversary it is meant to contain, thwart, or otherwise deal with. I think this would help us to make progress on the problem of AI safety, and I think it would help us to communicate with each other about our work. I also think it would help alignment researchers not immediately burn out upon seriously considering the problem.
What should this framework look like? I’ve been thinking about this question for a while, and the best answer I’ve come up with is to specify a list of human beings from history, and then assume for the sake of theory that we’re facing an AI that has the set of capabilities that list of human beings would have if they worked together in pure coordination, using all available modern knowledge.
Let’s work through some examples here to get a sense of what I’m talking about. Let’s take the question of boxing an AI. Now consider the adverary that is composed of just Napoleon Bonaparte. Napoleon was a very good general, and had the various skills that that implies. When he was finally defeated, they essentially tried to box him by exiling him to the island of Elba. He escaped, and caused more havoc, and then they exiled him to the island of St. Helena, which he did not escape from. In the framework I am proposing, this would be take as evidence that boxing an AI is very hard, and we will probably get it wrong the first time, but that for a fixed adversary, we may get it right on the second try, if we survive the failure of the first attempt. This is a slightly different conclusion than the one I think a lot of people in the alignment community would have on the question of boxing an AI, which is essentially that it’s impossible and we shouldn’t try.
This argument also feels much more grounded in reality and history than the typical alignment argument that I encounter. I think it is more likely to convince people who are not already convinced of the importance of AI safety. And I think it is just a much less painful and dangerous thought to hold in one’s head than the sort of “how do I enslave a god who can read my mind” kind of thinking that trying to reason about unbounded adversaries tends to lead to.
Let’s take another example. Let’s say we’re trying to figure out how to deal with deception in an AI. We could consider the adversary that is composed of Harry Houdini and Adolf Hitler. (Already we start to see that this framework is slightly weird and at times deeply offensive; one really can’t imagine the historical Houdini having a productive working relationship with the historical Hitler.) Houdini was a very good magician, who devoted much of the later part of his life to exposing mediums and other frauds. He could do this well essentially because he knew all the tricks that the mediums might use. He could have easily set up shop as a medium himself if he had wanted to. Hitler, on the other hand, was a very convincing liar, and was able to convince a large number of people to do a bunch of very bad things, through a combination of threats of violence and “the big lie”. Now the question becomes, how do you safely get productive work (say, writing a press release for a small business) out of the Houdini-Hitler mashup? I think it immediately become clear that this is quite a difficult task, and not really one that we have a good solution to. It’s not the sort of thing you would assume that you will succeed at. Here I think we have just rederived the standard conclusion that deception is a hard problem, but we have done so in a way that is more grounded in reality and history, and that is less painful and dangerous to think about.
Now that we have a sense of what this framework looks like, let’s try to do some productive work in it. Let’s say we’re trying to make progress on the deception problem. We could try to figure out what it would take to be able to trust a system that is primarily powered by Harry Houdini. Let’s say we’re trying to solve the standard ELK problem (the problem of keeping a diamond in a vault safe, when the diamond is protected by an AI that understands the construction and nature of the vault better than we do, and we rely on the AI to report on the state of the diamond). (This is a little bit of an abuse of the ELK framework, in that I think with ELK one usually is discussing a system that isn’t actively adversarial.)
The only way I can remotely imagine not getting outsmarted by the Houdini AI is to have another Houdini that thinks at the same speed, who isn’t in coordination with the fist Houdini. (Think of the historical Houdini outsmarting various mediums.) This is a classic approach to building AI alignment systems that I think is considered not very effective by a lot of alignment researchers, but I’ve always thought it was not given enough credit. RLHF, in some sense the established approach to prosaic AI alignment, takes this form.
The Simulation Lens
The other thing that makes alignment research difficult is that it’s very hard to run experiments that give you useful information about how to align an AI. There are multiple reasons for this: the most powerful existing AI’s are very difficult to work with (training them is complicated and expensive, and will likely remain so for the duration; getting their weights and activations is impossible for people outside the big labs).
So what would happen if we just took video games as our model of reality? Say, if we are trying to prevent a superintelligence from performing atrocities in reality, then try to get a superintelligence to avoid performing atrocities in Factorio. That sort of thing. This approach seems (to me at least) far more empirical and grounded than any approach to performing alignment research that I am aware of. It also has the advantage that the big AI labs could productively research it starting more than five years ago, because they already have.
How to Train Your Shoggoth, Part 2
The Historical Teams Framework for Alignment Research
Eliezer Yudkowsky has famously advocated for embracing “security mindset” when thinking about AI safety. This is a mindset where you think about how to prevent things from going wrong, rather than how to make things go right. This seems obviously correct to me, so for the purposes of this post I’ll just take this as a given.
But I think there’s a piece missing from the AI Safety community’s understanding of security mindset, that is a key part of the mindset that computer security researchers and practitioners use. This is the concept of “threat model”, or “Mossad vs. not-Mossad”. In computer security, it’s important to specify the class of adversary you are trying to keep out, because different adversaries have different capabilities. For example, if you are trying to keep out a nation-state adversary, you need to worry about things like zero-day exploits, and you need to worry about physical security of your servers. If you are trying to keep out a script kiddie, you don’t need to worry about zero-day exploits, and you don’t need to worry about physical security of your servers. If you are trying to keep out a random person on the internet, you don’t need to worry about zero-day exploits, and you don’t need to worry about physical security of your servers, and you don’t need to worry about them breaking into your office and stealing your hard drives.
In my experience, the AI Safety community typically thinks primarily about adversaries with completely unbounded capabilities. There’s a good reason for this (there’s probably not much of a ceiling on AI capabilities, so, for any level of capabilities, eventually we will be facing that level of threat), but it makes it very hard to do the sort of productive, incremental work that builds a research program and eventually a paradigm.
The Historical Lens
How would one address this? I think we need a simple, shared framework in which to specify what kind of adversary we are thinking about at any given moment, and I think we need to associate any particular project with some particular class or classes of adversary it is meant to contain, thwart, or otherwise deal with. I think this would help us to make progress on the problem of AI safety, and I think it would help us to communicate with each other about our work. I also think it would help alignment researchers not immediately burn out upon seriously considering the problem.
What should this framework look like? I’ve been thinking about this question for a while, and the best answer I’ve come up with is to specify a list of human beings from history, and then assume for the sake of theory that we’re facing an AI that has the set of capabilities that list of human beings would have if they worked together in pure coordination, using all available modern knowledge.
Let’s work through some examples here to get a sense of what I’m talking about. Let’s take the question of boxing an AI. Now consider the adverary that is composed of just Napoleon Bonaparte. Napoleon was a very good general, and had the various skills that that implies. When he was finally defeated, they essentially tried to box him by exiling him to the island of Elba. He escaped, and caused more havoc, and then they exiled him to the island of St. Helena, which he did not escape from. In the framework I am proposing, this would be take as evidence that boxing an AI is very hard, and we will probably get it wrong the first time, but that for a fixed adversary, we may get it right on the second try, if we survive the failure of the first attempt. This is a slightly different conclusion than the one I think a lot of people in the alignment community would have on the question of boxing an AI, which is essentially that it’s impossible and we shouldn’t try.
This argument also feels much more grounded in reality and history than the typical alignment argument that I encounter. I think it is more likely to convince people who are not already convinced of the importance of AI safety. And I think it is just a much less painful and dangerous thought to hold in one’s head than the sort of “how do I enslave a god who can read my mind” kind of thinking that trying to reason about unbounded adversaries tends to lead to.
Let’s take another example. Let’s say we’re trying to figure out how to deal with deception in an AI. We could consider the adversary that is composed of Harry Houdini and Adolf Hitler. (Already we start to see that this framework is slightly weird and at times deeply offensive; one really can’t imagine the historical Houdini having a productive working relationship with the historical Hitler.) Houdini was a very good magician, who devoted much of the later part of his life to exposing mediums and other frauds. He could do this well essentially because he knew all the tricks that the mediums might use. He could have easily set up shop as a medium himself if he had wanted to. Hitler, on the other hand, was a very convincing liar, and was able to convince a large number of people to do a bunch of very bad things, through a combination of threats of violence and “the big lie”. Now the question becomes, how do you safely get productive work (say, writing a press release for a small business) out of the Houdini-Hitler mashup? I think it immediately become clear that this is quite a difficult task, and not really one that we have a good solution to. It’s not the sort of thing you would assume that you will succeed at. Here I think we have just rederived the standard conclusion that deception is a hard problem, but we have done so in a way that is more grounded in reality and history, and that is less painful and dangerous to think about.
Now that we have a sense of what this framework looks like, let’s try to do some productive work in it. Let’s say we’re trying to make progress on the deception problem. We could try to figure out what it would take to be able to trust a system that is primarily powered by Harry Houdini. Let’s say we’re trying to solve the standard ELK problem (the problem of keeping a diamond in a vault safe, when the diamond is protected by an AI that understands the construction and nature of the vault better than we do, and we rely on the AI to report on the state of the diamond). (This is a little bit of an abuse of the ELK framework, in that I think with ELK one usually is discussing a system that isn’t actively adversarial.)
The only way I can remotely imagine not getting outsmarted by the Houdini AI is to have another Houdini that thinks at the same speed, who isn’t in coordination with the fist Houdini. (Think of the historical Houdini outsmarting various mediums.) This is a classic approach to building AI alignment systems that I think is considered not very effective by a lot of alignment researchers, but I’ve always thought it was not given enough credit. RLHF, in some sense the established approach to prosaic AI alignment, takes this form.
The Simulation Lens
The other thing that makes alignment research difficult is that it’s very hard to run experiments that give you useful information about how to align an AI. There are multiple reasons for this: the most powerful existing AI’s are very difficult to work with (training them is complicated and expensive, and will likely remain so for the duration; getting their weights and activations is impossible for people outside the big labs).
So what would happen if we just took video games as our model of reality? Say, if we are trying to prevent a superintelligence from performing atrocities in reality, then try to get a superintelligence to avoid performing atrocities in Factorio. That sort of thing. This approach seems (to me at least) far more empirical and grounded than any approach to performing alignment research that I am aware of. It also has the advantage that the big AI labs could productively research it starting more than five years ago, because they already have.