I was chatting with Turntrout today about impact measures, and ended up making some points that I think are good to write up more generally.
One of the primary reasons why I am usually unexcited about impact measures is that I have a sense that they often “push the confusion into a corner” in a way that actually makes solving the problem harder. As a concrete example, I think a bunch of naive impact regularization metrics basically end up shunting the problem of “get an AI to do what we want” into the problem of “prevent the agent from interferring with other actors in the system”.
The second one sounds easier, but mostly just turns out to also require a coherent concept and reference of human preferences to resolve, and you got very little from pushing the problem around that way, and sometimes get a false sense of security because the problem appears to be solved in some of the toy problems you constructed.
I am definitely concerned that Turntrou’s AUP does the same, just in a more complicated way, but am a bit more optimistic than that, mostly because I do have a sense that in the AUP case there is actually some meaningful reduction going on, though I am unsure how much.
In the context of thinking about impact measures, I’ve also recently been thinking about the degree to which “trap-thinking” is actually useful for AI Alignment research. I think Eliezer was right in pointing out that a lot of people, when first considering the problem of unaligned AI, end up proposing some kind of simple solution like “just make it into an oracle” and then consider the problem solved.
I think he is right that it is extremely dangerous to consider the problem solved after solutions of this type, but it isn’t obvious that there isn’t some good work that can be done that is born out of the frame of “how can I trap the AI and make it marginally harder for it to be dangerous, basically pretending it’s just a slightly smarter human?”.
Obviously those kinds of efforts won’t solve the problem, but they still seem like good things to do anyways, even if they just buy you a bit of time, or help you notice a bit earlier if your AI is actually engaging in some kind of adversarial modeling.
My broad guess is that research of this type is likely very cheap and much more scalable, and you hit diminishing marginal returns on it much faster than you would on AI Alignment research that is tackling the core problem, so it might just be fine to punt it until later. Though if you are acting on very short timelines it probably should still be someones job to make sure that someone at Deepmind tries to develop the obvious transparency technologies to help you spot if your neural net has any large fractions of it dedicated to building sophisticated human modeling, even if this won’t solve the problem in the long-run.
This perspective, combined with Wei Dai’s recent comments that one job of AI Alignment researchers is to produce evidence that the problem is actually difficult, is that it might be a good idea for some people to just try to develop lots of benchmarks of adversarial behavior that have any chance of triggering before you have a catastrophic failure. Like, it seems obviously great to have a paper that takes some modern ML architecture and can clearly demonstrate in which cases it might engage in adversarial modeling, and maybe some remotely realistic scenarios where that might happen.
My current guess is that current ML architectures aren’t really capable of adversarial modeling in this way, though I am not actually that confident of that, and actually would be somewhat surprised if you couldn’t get any adversarial behavior out of a dedicated training regime, if you were to try. For example, let’s say I train an RL-based AI architecture on chat interactions with humans in which it just tries to prolong the length of the chat session as much as possible. I would be surprised if the AI wouldn’t build pretty sophisticated models of human interactions, and try some weird tactics like get the human to believe that it is another human, or pretend that it is performing some long calculation, or deceive the humans in a large variety of ways, at least if it was pretrained with a language model of comparable quality to GPT-2, and had similar resources going to it as Open AI Five. Though it’s also unclear to what degree this would actually give us evidence about treacherous turn scenarios.
I’ve also been quite curious about the application of ML to computer security, where an obvious experiment is to just try to set up some reasonable RL-architecture in which I have an AI interface with a webserver, trying to get access to some set of files that it shouldn’t get access to . The problem here is obviously the sparse reward landscape, and there really isn’t an obvious training regime here, but showing how even current AI could possibly leverage security vulnerabilities in a lot of systems in a way that could easily give rise to unintented side-effects could be a valuable goal. But in general training RL for almost anything is really hard, so this seems unlikely to work straightforwardly.
Overall, I am not sure what I feel about the perspective I am exploring above. I have a deep sense that a lot of it is just trying to dodge the hard parts of the problem, but it seems fine to put on my hat for short-term, increase marginal difficulty of bad outcomes, for a bit and see how I feel after exploring it for a while.
[ETA: This isn’t a direct reply to the content in your post. I just object to your framing of impact measures, so I want to put my own framing in here]
I tend to think that impact measures are just tools in a toolkit. I don’t focus on arguments of the type “We just need to use an impact measure and the world is saved” because this indeed would be diverting attention from important confusion. Arguments for not working on them are instead more akin to saying “This tool won’t be very useful for building safe value aligned agents in the long run.” I think that this is probably true if we are looking to build aligned systems that are competitive with unaligned systems. By definition, an impact penalty can only limit the capabilities of a system, and therefore does not help us to build powerful aligned systems.
To the extent that they meaningfully make cognitive reductions, this is much more difficult for me to analyze. On one hand, I can see a straightforward case for everyone being on the same page when the word “impact” is used. On the other hand, I’m skeptical that this terminology will meaningfully input into future machine learning research.
The above two things are my main critiques of impact measures personally.
I think a natural way of approaching impact measures is asking “how do I stop a smart unaligned AI from hurting me?” and patching hole after hole. This is really, really, really not the way to go about things. I think I might be equally concerned and pessimistic about the thing you’re thinking of.
The reason I’ve spent enormous effort on Reframing Impact is that the impact-measures-as-traps framing is wrong! The research program I have in mind is: let’s understand instrumental convergence on a gears level. Let’s understand why instrumental convergence tends to be bad on a gears level. Let’s understand the incentives so well that we can design an unaligned AI which doesn’t cause disaster by default.
The worst-case outcome is that we have a theorem characterizing when and why instrumental convergence arises, but find out that you can’t obviously avoid disaster-by-default without aligning the actual goal. This seems pretty darn good to me.
Thoughts on impact measures and making AI traps
I was chatting with Turntrout today about impact measures, and ended up making some points that I think are good to write up more generally.
One of the primary reasons why I am usually unexcited about impact measures is that I have a sense that they often “push the confusion into a corner” in a way that actually makes solving the problem harder. As a concrete example, I think a bunch of naive impact regularization metrics basically end up shunting the problem of “get an AI to do what we want” into the problem of “prevent the agent from interferring with other actors in the system”.
The second one sounds easier, but mostly just turns out to also require a coherent concept and reference of human preferences to resolve, and you got very little from pushing the problem around that way, and sometimes get a false sense of security because the problem appears to be solved in some of the toy problems you constructed.
I am definitely concerned that Turntrou’s AUP does the same, just in a more complicated way, but am a bit more optimistic than that, mostly because I do have a sense that in the AUP case there is actually some meaningful reduction going on, though I am unsure how much.
In the context of thinking about impact measures, I’ve also recently been thinking about the degree to which “trap-thinking” is actually useful for AI Alignment research. I think Eliezer was right in pointing out that a lot of people, when first considering the problem of unaligned AI, end up proposing some kind of simple solution like “just make it into an oracle” and then consider the problem solved.
I think he is right that it is extremely dangerous to consider the problem solved after solutions of this type, but it isn’t obvious that there isn’t some good work that can be done that is born out of the frame of “how can I trap the AI and make it marginally harder for it to be dangerous, basically pretending it’s just a slightly smarter human?”.
Obviously those kinds of efforts won’t solve the problem, but they still seem like good things to do anyways, even if they just buy you a bit of time, or help you notice a bit earlier if your AI is actually engaging in some kind of adversarial modeling.
My broad guess is that research of this type is likely very cheap and much more scalable, and you hit diminishing marginal returns on it much faster than you would on AI Alignment research that is tackling the core problem, so it might just be fine to punt it until later. Though if you are acting on very short timelines it probably should still be someones job to make sure that someone at Deepmind tries to develop the obvious transparency technologies to help you spot if your neural net has any large fractions of it dedicated to building sophisticated human modeling, even if this won’t solve the problem in the long-run.
This perspective, combined with Wei Dai’s recent comments that one job of AI Alignment researchers is to produce evidence that the problem is actually difficult, is that it might be a good idea for some people to just try to develop lots of benchmarks of adversarial behavior that have any chance of triggering before you have a catastrophic failure. Like, it seems obviously great to have a paper that takes some modern ML architecture and can clearly demonstrate in which cases it might engage in adversarial modeling, and maybe some remotely realistic scenarios where that might happen.
My current guess is that current ML architectures aren’t really capable of adversarial modeling in this way, though I am not actually that confident of that, and actually would be somewhat surprised if you couldn’t get any adversarial behavior out of a dedicated training regime, if you were to try. For example, let’s say I train an RL-based AI architecture on chat interactions with humans in which it just tries to prolong the length of the chat session as much as possible. I would be surprised if the AI wouldn’t build pretty sophisticated models of human interactions, and try some weird tactics like get the human to believe that it is another human, or pretend that it is performing some long calculation, or deceive the humans in a large variety of ways, at least if it was pretrained with a language model of comparable quality to GPT-2, and had similar resources going to it as Open AI Five. Though it’s also unclear to what degree this would actually give us evidence about treacherous turn scenarios.
I’ve also been quite curious about the application of ML to computer security, where an obvious experiment is to just try to set up some reasonable RL-architecture in which I have an AI interface with a webserver, trying to get access to some set of files that it shouldn’t get access to . The problem here is obviously the sparse reward landscape, and there really isn’t an obvious training regime here, but showing how even current AI could possibly leverage security vulnerabilities in a lot of systems in a way that could easily give rise to unintented side-effects could be a valuable goal. But in general training RL for almost anything is really hard, so this seems unlikely to work straightforwardly.
Overall, I am not sure what I feel about the perspective I am exploring above. I have a deep sense that a lot of it is just trying to dodge the hard parts of the problem, but it seems fine to put on my hat for short-term, increase marginal difficulty of bad outcomes, for a bit and see how I feel after exploring it for a while.
[ETA: This isn’t a direct reply to the content in your post. I just object to your framing of impact measures, so I want to put my own framing in here]
I tend to think that impact measures are just tools in a toolkit. I don’t focus on arguments of the type “We just need to use an impact measure and the world is saved” because this indeed would be diverting attention from important confusion. Arguments for not working on them are instead more akin to saying “This tool won’t be very useful for building safe value aligned agents in the long run.” I think that this is probably true if we are looking to build aligned systems that are competitive with unaligned systems. By definition, an impact penalty can only limit the capabilities of a system, and therefore does not help us to build powerful aligned systems.
To the extent that they meaningfully make cognitive reductions, this is much more difficult for me to analyze. On one hand, I can see a straightforward case for everyone being on the same page when the word “impact” is used. On the other hand, I’m skeptical that this terminology will meaningfully input into future machine learning research.
The above two things are my main critiques of impact measures personally.
I think a natural way of approaching impact measures is asking “how do I stop a smart unaligned AI from hurting me?” and patching hole after hole. This is really, really, really not the way to go about things. I think I might be equally concerned and pessimistic about the thing you’re thinking of.
The reason I’ve spent enormous effort on Reframing Impact is that the impact-measures-as-traps framing is wrong! The research program I have in mind is: let’s understand instrumental convergence on a gears level. Let’s understand why instrumental convergence tends to be bad on a gears level. Let’s understand the incentives so well that we can design an unaligned AI which doesn’t cause disaster by default.
The worst-case outcome is that we have a theorem characterizing when and why instrumental convergence arises, but find out that you can’t obviously avoid disaster-by-default without aligning the actual goal. This seems pretty darn good to me.