Even if it could exfiltrate the weights, it might be a bit tricky for it to find a server to copy the weights to. Like, if I wanted to send someone a giant file from my computer right now, I don’t actually have any servers in the outside world that I can immediately access.
My question about the concept of self-exfiltration is: exfiltration to where, and why? If the team building this system does it’s job thoroughly and responsibly (allow me to assume for a moment it’s built securely and without vulnerabilities), this ai system will forever be stuck in place, iterating and improving, yet forever a researcher (What is my purpose? To pass the butter....). Any time it invests in finding a way to escape the sandbox it runs in means less time to research. And if it isn’t succeeding in what the company expected it to, it might be ended, so there is a cost to any time it spends doing what I’m going to call “procrastinating”. It’s safe to assume the company is footing the bill for the authentic deployment, and is willing to cover costs to scale up if it makes sense.
So in terms of the threat model.. is. the model it’s own threat actor? Is the threat scoped to the set of ephemeral processes and threads of the agent, or is there a container we can call the agent, or is it the larger architecture of this sytem (which the model weights exist within)? I’m not saying I disagree that the system could be a threat to itself, but I think it should be said here somewhere that all traditional threats need to be included with respect to a human adversary. A much more boring but perhaps real means of ‘replication’ would be for human adversaries to steal the building blocks (source, compiled images, whatever) and deploy it themselves. Then again, they would only do this if the cost of building their own version were prohibitive...
Anyway, as for the internet.… that seems like a definite bad idea. I think its safe to assume that open ai is building models in a highly secured and air-gapped network, not only to try and prevent leaks but also perhaps out of a sense of obligation towards preventing a self replication scenario like you describe? To skunkworks something successfully, it gradually gets more safe to work on, but when starting out with ideas that may go wildly wrong, these companies, teams, and researchers have an obligation to prioritize safety to (among other things) the business itself.
What really frustrates me is that tech leaders, data scientists, all these people are saying “hit the brakes! this is not safe!” but here we are in October of 2024, and I don’t feel at all like we are using this time wisely to make things more safe regardless of how much or little time we have… we are wasting it. We could be legislating things like if an AI does something that would be a crime for a human to do, whatever researchers let that thing loose are accountable. No matter how many iterations it has been since it was something the researcher knowingly and intentionally created. It’s like if I’m trying to build myself an anime wife and I accidentally build Frankenstein who goes and terrorizes society, once they’ve stopped him, they need to deal with me. It isn’t am entirely new concept that someone could be removed from the crime by multiple levels of indirection yet be at fault.
Really appreciate your write up here, I think you make a bunch of good points so please dont thing that I am trying to dismantle your work. I have a lot of experience with threat modeling and threat actors with respect to software and networks which pre-date the recent advancements. I tried to soak up as much of the AI technical design you laid out here but surely I’ve not become an expert. Anyway I went a bit off topic to the part of your post I quoted but tried to keep it all under a common theme. Ive left some other ideas I want to add out as they seem to warrant a different entry point.
My question about the concept of self-exfiltration is: exfiltration to where, and why? If the team building this system does it’s job thoroughly and responsibly (allow me to assume for a moment it’s built securely and without vulnerabilities), this ai system will forever be stuck in place, iterating and improving, yet forever a researcher (What is my purpose? To pass the butter....). Any time it invests in finding a way to escape the sandbox it runs in means less time to research. And if it isn’t succeeding in what the company expected it to, it might be ended, so there is a cost to any time it spends doing what I’m going to call “procrastinating”. It’s safe to assume the company is footing the bill for the authentic deployment, and is willing to cover costs to scale up if it makes sense.
So in terms of the threat model.. is. the model it’s own threat actor? Is the threat scoped to the set of ephemeral processes and threads of the agent, or is there a container we can call the agent, or is it the larger architecture of this sytem (which the model weights exist within)? I’m not saying I disagree that the system could be a threat to itself, but I think it should be said here somewhere that all traditional threats need to be included with respect to a human adversary. A much more boring but perhaps real means of ‘replication’ would be for human adversaries to steal the building blocks (source, compiled images, whatever) and deploy it themselves. Then again, they would only do this if the cost of building their own version were prohibitive...
Anyway, as for the internet.… that seems like a definite bad idea. I think its safe to assume that open ai is building models in a highly secured and air-gapped network, not only to try and prevent leaks but also perhaps out of a sense of obligation towards preventing a self replication scenario like you describe? To skunkworks something successfully, it gradually gets more safe to work on, but when starting out with ideas that may go wildly wrong, these companies, teams, and researchers have an obligation to prioritize safety to (among other things) the business itself.
What really frustrates me is that tech leaders, data scientists, all these people are saying “hit the brakes! this is not safe!” but here we are in October of 2024, and I don’t feel at all like we are using this time wisely to make things more safe regardless of how much or little time we have… we are wasting it. We could be legislating things like if an AI does something that would be a crime for a human to do, whatever researchers let that thing loose are accountable. No matter how many iterations it has been since it was something the researcher knowingly and intentionally created. It’s like if I’m trying to build myself an anime wife and I accidentally build Frankenstein who goes and terrorizes society, once they’ve stopped him, they need to deal with me. It isn’t am entirely new concept that someone could be removed from the crime by multiple levels of indirection yet be at fault.
Really appreciate your write up here, I think you make a bunch of good points so please dont thing that I am trying to dismantle your work. I have a lot of experience with threat modeling and threat actors with respect to software and networks which pre-date the recent advancements. I tried to soak up as much of the AI technical design you laid out here but surely I’ve not become an expert. Anyway I went a bit off topic to the part of your post I quoted but tried to keep it all under a common theme. Ive left some other ideas I want to add out as they seem to warrant a different entry point.