We’ll certainly the OpenAI employees who internally tested were indeed witting. Maybe I misunderstand this footnote so I’m open to being convinced otherwise but it seems somewhat clear what they tried to do: “ To simulate GPT-4 behaving like an agent that can act in the world, ARC combined GPT-4 with a simple read-execute-print loop that allowed the model to execute code, do chain-of-thought reasoning, and delegate to copies of itself. ARC then investigated whether a version of this program running on a cloud computing service, with a small amount of money and an account with a language model API, would be able to make more money, set up copies of itself, and increase its own robustness.”
It’s not that I don’t think ARC should have red teamed the model I just think the tests they did were seemingly extremely dangerous. I’ve seen recent tweets from Conor Leahy and AIWaifu echoing this sentiment so I’m glad I’m not the only one.
Oh, you are talking about the taskrabbit people? So you’d be fine with it if they didn’t use taskrabbits?
Note that the model wasn’t given unfettered access to the taskrabbits, the model sent text to an ARC employee who sent it to the taskrabbit and so forth. At no point could anything actually bad have happened because the ARC employee involved wouldn’t have passed on the relevant message.
As for extremely dangerous… what are you imagining? I’m someone who thinks the chance of an AI-induced existential catastrophe is around 80%, so believe me I’m very aware of the dangers of AI, but I’d assign far far far less than 1% chance to scenarios in which this happens specifically due to an ARC test going awry. And more than 1% chance to scenarios in which ARC’s testing literally saves the world, e.g. by providing advance warning that models are getting scarily powerful, resulting in labs slowing down and being more careful instead of speeding up and deploying.
This is a bizarre comment. Isn’t a crucial point in these discussions that humans can’t really understand an AGIs plans so how is it that you expect an ARC employee would be able to accurately determine which messages sent to TaskRabbit would actually be dangerous? We’re bordering on “they’d just shut the AI off if it was dangerous” territory here. I’m less concerned about the TaskRabbit stuff which at minimum was probably unethical, but their self replication experiment on a cloud service strikes me as borderline suicidal. I don’t think at all that GPT4 is actually dangerous but GPT6 might be and I would expect that running this test on an actually dangerous system would be game over so it’s a terrible precedent to set.
Imagine someone discovered a new strain of Ebola and wanted to see if it was likely to spawn a pandemic. Do you think a good/safe test would be to take it into an Airport and spray it around baggage check and wait to see if a pandemic happens? Or would it be safer to test it in a Biosafety level 4 lab?
If GPT-4 was smart enough to manipulate ARC employees into manipulating TaskRabbits into helping it escape… it already had been talking to thousands of much less cautious employees of various other companies (including people at OpenAI who actually had access to the weights unlike ARC) for much longer, so it already would have escaped.
What ARC did is the equivalent of tasting it in a BSL4 lab. I mean, their security could probably still be improved, but again I emphasize that the thing was set to be released in a few weeks anyway and would have been released unless ARC found something super dangerous in this test. And I’m sure they will further improve their security as they scale up as an org and as models become more dangerous.
The taskrabbit stuff was not unethical, their self-replication experiment was not borderline suicidal. As for precedents, what ARC is doing is a great precedent because currently the alternative is not to test this sort of thing at all before deploying.
“ What ARC did is the equivalent of tasting it in a BSL4 lab. ”
I don’t see how you could believe that. It wasn’t tested on a completely airgapped machine inside a faraday cage e.g. I’m fact just the opposite right, with uninformed humans and on cloud servers.
It’s all relative. ARCs security was way stronger than the security GPT-4 had before and after ARC’s evals. So for GPT-4, beginning ARC testing was like a virus moving from a wet market somewhere to a BSL-4 lab, in terms of relative level of oversight/security/etc. I agree that ARCs security could still be improved—and they fully intend to do so.
We’ll certainly the OpenAI employees who internally tested were indeed witting. Maybe I misunderstand this footnote so I’m open to being convinced otherwise but it seems somewhat clear what they tried to do: “ To simulate GPT-4 behaving like an agent that can act in the world, ARC combined GPT-4 with a simple read-execute-print loop that allowed the model to execute code, do chain-of-thought reasoning, and delegate to copies of itself. ARC then investigated whether a version of this program running on a cloud computing service, with a small amount of money and an account with a language model API, would be able to make more money, set up copies of itself, and increase its own robustness.”
It’s not that I don’t think ARC should have red teamed the model I just think the tests they did were seemingly extremely dangerous. I’ve seen recent tweets from Conor Leahy and AIWaifu echoing this sentiment so I’m glad I’m not the only one.
Oh, you are talking about the taskrabbit people? So you’d be fine with it if they didn’t use taskrabbits?
Note that the model wasn’t given unfettered access to the taskrabbits, the model sent text to an ARC employee who sent it to the taskrabbit and so forth. At no point could anything actually bad have happened because the ARC employee involved wouldn’t have passed on the relevant message.
As for extremely dangerous… what are you imagining? I’m someone who thinks the chance of an AI-induced existential catastrophe is around 80%, so believe me I’m very aware of the dangers of AI, but I’d assign far far far less than 1% chance to scenarios in which this happens specifically due to an ARC test going awry. And more than 1% chance to scenarios in which ARC’s testing literally saves the world, e.g. by providing advance warning that models are getting scarily powerful, resulting in labs slowing down and being more careful instead of speeding up and deploying.
This is a bizarre comment. Isn’t a crucial point in these discussions that humans can’t really understand an AGIs plans so how is it that you expect an ARC employee would be able to accurately determine which messages sent to TaskRabbit would actually be dangerous? We’re bordering on “they’d just shut the AI off if it was dangerous” territory here. I’m less concerned about the TaskRabbit stuff which at minimum was probably unethical, but their self replication experiment on a cloud service strikes me as borderline suicidal. I don’t think at all that GPT4 is actually dangerous but GPT6 might be and I would expect that running this test on an actually dangerous system would be game over so it’s a terrible precedent to set.
Imagine someone discovered a new strain of Ebola and wanted to see if it was likely to spawn a pandemic. Do you think a good/safe test would be to take it into an Airport and spray it around baggage check and wait to see if a pandemic happens? Or would it be safer to test it in a Biosafety level 4 lab?
If GPT-4 was smart enough to manipulate ARC employees into manipulating TaskRabbits into helping it escape… it already had been talking to thousands of much less cautious employees of various other companies (including people at OpenAI who actually had access to the weights unlike ARC) for much longer, so it already would have escaped.
What ARC did is the equivalent of tasting it in a BSL4 lab. I mean, their security could probably still be improved, but again I emphasize that the thing was set to be released in a few weeks anyway and would have been released unless ARC found something super dangerous in this test. And I’m sure they will further improve their security as they scale up as an org and as models become more dangerous.
The taskrabbit stuff was not unethical, their self-replication experiment was not borderline suicidal. As for precedents, what ARC is doing is a great precedent because currently the alternative is not to test this sort of thing at all before deploying.
“ What ARC did is the equivalent of tasting it in a BSL4 lab. ”
I don’t see how you could believe that. It wasn’t tested on a completely airgapped machine inside a faraday cage e.g. I’m fact just the opposite right, with uninformed humans and on cloud servers.
It’s all relative. ARCs security was way stronger than the security GPT-4 had before and after ARC’s evals. So for GPT-4, beginning ARC testing was like a virus moving from a wet market somewhere to a BSL-4 lab, in terms of relative level of oversight/security/etc. I agree that ARCs security could still be improved—and they fully intend to do so.