I guess it’s an odd boundary. Insofar as it’s an accident, the accident is “we created agents that had different goals and stole all of our resources”. In the world Paul describes, there’ll be lots of powerful agents around, and we’ll be cooperating and working with them (and plausibly talking with them via GPT-style tech), and at some point the agents we’ve been cooperating with will have lots of power and start defecting on us.
I called it that because in Paul’s post, he gave examples of agents that can do damage to us – “organisms, corrupt bureaucrats, companies obsessed with growth” – and then argued that ML systems will be added to that list, things like “an automated corporation may just take the money and run”. This is a world where we have built other agents who work for us and help us, and then they suddenly take adversarial action (or alternatively the adversarial action happens gradually and off-screen and we only notice when it’s too late). The agency feels like a core part of it.
So I feel like accident and adversarial action are sort of the same thing in this case.
I mean, I agree that the scenario is about adversarial action, but it’s not adversarial action by enemy humans—or even enemy AIs—it’s adversarial action by misaligned (specifically deceptive) mesa-optimizers pursuing convergent instrumental goals.
Can you say more about the distinction between enemy AIs and misaligned mesa-optmizers? I feel like I don’t have a concrete grasp of what the difference would look like in, say, an AI system in charge of a company.
I could imagine “enemy action” making sense as a label if the thing you’re worried about is enemy humans deploying misaligned AI, but that’s very much not what Paul is worried about in the original post. Rather, Paul is concerned about us accidentally training AIs which are misaligned and thus pursue convergent instrumental goals like resource and power acquisition that result in existential risk.
Furthermore, they’re also not “enemy AIs” in the sense that “the AI doesn’t hate you”—it’s just misaligned and you’re in its way—and so even if you specify something like “enemy AI action” that still seems to me to conjure up a pretty inaccurate picture. I think something like “influence-seeking AIs”—which is precisely the term that Paul uses in the original post—is much more accurate.
I think I understand why you think the term is misleading, though I still think it’s helpfully concrete and not inaccurate. I have a bunch of work to get back to, not planning to follow up on this more right now. Welcome to ping me via PM if you’d like me to follow up another day.
I guess it’s an odd boundary. Insofar as it’s an accident, the accident is “we created agents that had different goals and stole all of our resources”. In the world Paul describes, there’ll be lots of powerful agents around, and we’ll be cooperating and working with them (and plausibly talking with them via GPT-style tech), and at some point the agents we’ve been cooperating with will have lots of power and start defecting on us.
I called it that because in Paul’s post, he gave examples of agents that can do damage to us – “organisms, corrupt bureaucrats, companies obsessed with growth” – and then argued that ML systems will be added to that list, things like “an automated corporation may just take the money and run”. This is a world where we have built other agents who work for us and help us, and then they suddenly take adversarial action (or alternatively the adversarial action happens gradually and off-screen and we only notice when it’s too late). The agency feels like a core part of it.
So I feel like accident and adversarial action are sort of the same thing in this case.
I mean, I agree that the scenario is about adversarial action, but it’s not adversarial action by enemy humans—or even enemy AIs—it’s adversarial action by misaligned (specifically deceptive) mesa-optimizers pursuing convergent instrumental goals.
Can you say more about the distinction between enemy AIs and misaligned mesa-optmizers? I feel like I don’t have a concrete grasp of what the difference would look like in, say, an AI system in charge of a company.
I could imagine “enemy action” making sense as a label if the thing you’re worried about is enemy humans deploying misaligned AI, but that’s very much not what Paul is worried about in the original post. Rather, Paul is concerned about us accidentally training AIs which are misaligned and thus pursue convergent instrumental goals like resource and power acquisition that result in existential risk.
Furthermore, they’re also not “enemy AIs” in the sense that “the AI doesn’t hate you”—it’s just misaligned and you’re in its way—and so even if you specify something like “enemy AI action” that still seems to me to conjure up a pretty inaccurate picture. I think something like “influence-seeking AIs”—which is precisely the term that Paul uses in the original post—is much more accurate.
I thought about it a bit more and changed my mind, it’s very confusing. I’ll make an edit later, maybe today.
I think I understand why you think the term is misleading, though I still think it’s helpfully concrete and not inaccurate. I have a bunch of work to get back to, not planning to follow up on this more right now. Welcome to ping me via PM if you’d like me to follow up another day.