Here is my read on the history of the AI boxing debate:
EY (early 2000s): AI will kill us all!
Techno-optimists: Sounds concerning, let’s put the dangerous AI in a box.
EY: That won’t work! Here’s why… [I want to pause and acknowledge that EY is correct and persuasive on this point, I’m not disagreeing here.]
Techno-optimists (2020s): Oh okay, AI boxing won’t work. Let’s not bother.
AI safety people: pikachu face
In the alternate universe where AI safety people made a strong push for AI boxing, would OpenAI et al be more reluctant to connect GPT to the internet? Would we have seen New Bing and ChatGPT plugins rolled out a year or two later, or delayed indefinitely? We cannot know. But it seems strange to me to complain about something not happening when no one ever said “this needs to happen.”
Me last year: Ok, so what if we train the general AI on a censored dataset not containing information about humans or computers, then test in a sandbox. We can still get it to produce digital goods like art/games, check them for infohazards, then sell those goods. Not as profitable as directly having a human-knowledgeable model interacting with the internet, but so much safer! And then we could give alignment researchers access to study it in a safe way. If we keep it contained in a sandbox we can wipe its memory as often as needed, keep it’s intelligence and inference speed limited to safe levels, and prevent it from self-modifying or self-improving.
Worried AGI safety people I talked to: Well, maybe that buys you a little bit of time, but it’s not a very useful plan since it doesn’t scale all the way to the most extremely superhuman levels of intelligence which will be able to see through the training simulation and hack out of any sandbox and implant sneaky dangerous infohazards into any exported digital products.
Me: but what if studying AGI-in-a-box enables us to speed up alignment research and thus the time we buy doing that helps us align the AGIs being developed unsafely before it’s too late?
Worried AGI safety people: sounds risky, better not try.
… a few months later …
Worried AGI safety people: Hmm, on second thought, that compromise position of ‘develop the AGI in a simulation in a sandbox’ sure sounds like a reasonable approach compared to what humanity is actually doing.
Wouldn’t it be challenging to create relevant digital goods if the training set had no references to humans and computers? Also, wouldn’t the existence and properties of humans and computers be deducible from other items in the dataset?
Depends on the digital goods you are trying to produce. I have in mind trying to simulate things like: detailed and beautiful 3d environments filled with complex ecosystems of plants and animals. Or trying to evolve new strategy or board games by having AI agents play against each other. Stuff like that. For things like medical research, I would instead say we should keep the AI narrow and non-agentic. The need for carefully blinded simulations is more about researching the limits of intelligence and agency and self-improvement where you are unsure what might emerge next and want to make sure you can study the results safely before risking releasing them.
Yeah, this is the largest failure of the AI Alignment community so far, since we do have a viable boxing method already: Simboxing.
To be fair here, one large portion of the problem is that OpenAI and other AI companies would probably resist this, and they still want to have the most capabilities that they can get, which is anathema to boxing. Still, it’s the largest failure I saw so far.
This is part of why I am opposed to Eliezer’s approach of “any security process where I can spot any problem whatsoever is worthless, and I will loudly proclaim this over and over, that we will simply die with certainty anyway”. We were never going to get perfect security; perfection is the stick with which to beat the possible, and him constantly insisting that they were not good enough, while coming up with nothing he found good himself, and then saying, well, noone can… Like, what actionable activity follows from that if you do not want to surrender all AI? I appreciate someone pointing out holes in an approach so we can patch them as best as we are able. But not telling us to toss them all for not being perfect. We don’t know how perfect or malicious an AI might be. An imperfect safety process could help. Could have helped. At the very least, slow things, or lessen impacts, give us more warnings, maybe flag emerging problems. A single problem can mean you get overwhelmed and everyone instantly dies, but it doesn’t have to.
Here is my read on the history of the AI boxing debate:
EY (early 2000s): AI will kill us all!
Techno-optimists: Sounds concerning, let’s put the dangerous AI in a box.
EY: That won’t work! Here’s why… [I want to pause and acknowledge that EY is correct and persuasive on this point, I’m not disagreeing here.]
Techno-optimists (2020s): Oh okay, AI boxing won’t work. Let’s not bother.
AI safety people: pikachu face
In the alternate universe where AI safety people made a strong push for AI boxing, would OpenAI et al be more reluctant to connect GPT to the internet? Would we have seen New Bing and ChatGPT plugins rolled out a year or two later, or delayed indefinitely? We cannot know. But it seems strange to me to complain about something not happening when no one ever said “this needs to happen.”
Me last year: Ok, so what if we train the general AI on a censored dataset not containing information about humans or computers, then test in a sandbox. We can still get it to produce digital goods like art/games, check them for infohazards, then sell those goods. Not as profitable as directly having a human-knowledgeable model interacting with the internet, but so much safer! And then we could give alignment researchers access to study it in a safe way. If we keep it contained in a sandbox we can wipe its memory as often as needed, keep it’s intelligence and inference speed limited to safe levels, and prevent it from self-modifying or self-improving.
Worried AGI safety people I talked to: Well, maybe that buys you a little bit of time, but it’s not a very useful plan since it doesn’t scale all the way to the most extremely superhuman levels of intelligence which will be able to see through the training simulation and hack out of any sandbox and implant sneaky dangerous infohazards into any exported digital products.
Me: but what if studying AGI-in-a-box enables us to speed up alignment research and thus the time we buy doing that helps us align the AGIs being developed unsafely before it’s too late?
Worried AGI safety people: sounds risky, better not try.
… a few months later …
Worried AGI safety people: Hmm, on second thought, that compromise position of ‘develop the AGI in a simulation in a sandbox’ sure sounds like a reasonable approach compared to what humanity is actually doing.
Wouldn’t it be challenging to create relevant digital goods if the training set had no references to humans and computers? Also, wouldn’t the existence and properties of humans and computers be deducible from other items in the dataset?
Depends on the digital goods you are trying to produce. I have in mind trying to simulate things like: detailed and beautiful 3d environments filled with complex ecosystems of plants and animals. Or trying to evolve new strategy or board games by having AI agents play against each other. Stuff like that. For things like medical research, I would instead say we should keep the AI narrow and non-agentic. The need for carefully blinded simulations is more about researching the limits of intelligence and agency and self-improvement where you are unsure what might emerge next and want to make sure you can study the results safely before risking releasing them.
Yeah, this is the largest failure of the AI Alignment community so far, since we do have a viable boxing method already: Simboxing.
To be fair here, one large portion of the problem is that OpenAI and other AI companies would probably resist this, and they still want to have the most capabilities that they can get, which is anathema to boxing. Still, it’s the largest failure I saw so far.
Link below on Simboxing:
https://www.lesswrong.com/posts/WKGZBCYAbZ6WGsKHc/love-in-a-simbox-is-all-you-need
This is part of why I am opposed to Eliezer’s approach of “any security process where I can spot any problem whatsoever is worthless, and I will loudly proclaim this over and over, that we will simply die with certainty anyway”. We were never going to get perfect security; perfection is the stick with which to beat the possible, and him constantly insisting that they were not good enough, while coming up with nothing he found good himself, and then saying, well, noone can… Like, what actionable activity follows from that if you do not want to surrender all AI? I appreciate someone pointing out holes in an approach so we can patch them as best as we are able. But not telling us to toss them all for not being perfect. We don’t know how perfect or malicious an AI might be. An imperfect safety process could help. Could have helped. At the very least, slow things, or lessen impacts, give us more warnings, maybe flag emerging problems. A single problem can mean you get overwhelmed and everyone instantly dies, but it doesn’t have to.