If there is indeed systematic scheming by the model, and the lab has caught it red handed, the lab should be able to produce highly scientifically credible evidence of that. They could deeply understand the situations in which there’s a treacherous turn, how the models decides whether to openly defect, and publish. ML academics are deeply empirical and open minded, so it seems like the lab could win this empirical debate if they’ve indeed caught a systematic schemer.
How much scientific juice has, say, Facebook gotten out of CICERO? Have they deeply understood the situations in which CICERO begins planning to manipulate possible allies? Have they mechanistically interpretably understood how CICERO decides how long to cooperate and play nice, and when to openly defect and attack an ally? Is not CICERO a deeply empirical system based on observations and logs from many real-world games with actual human players rather than mere theoretical arguments? Has CICERO ended the empirical debate about whether LLMs can systematically scheme? Has it been shown what training techniques lead to scheming or why off-the-shelf normally-trained frozen LLMs were so useful for the planning and psychological manipulation compared to no-press Diplomacy?
Or has everyone pretty much forgotten about CICERO, handwaved away a few excuses about “well maybe it wasn’t really deception” and “didn’t it just learn to imitate humans why are you surprised”, and the entire line of work apparently dead as a doornail as FB pivots to Llama-everything and core authors left for places like OA?
If the incentives for scientific research don’t work there where the opposing commercial incentives are so very weak (borderline non-existent, even), why would they be highly likely to work elsewhere in scenarios with vastly more powerful opposing commercial incentives?
I think people did point out that CICERO lies, and that was a useful update about how shallow attempts to prevent AI deception can fail. I think it could be referenced, and has been referenced, in relevant discussions. I don’t think CICERO provides much or any evidence that we’ll get the kind of scheming that could lead to AI takeover, so it’s not at all surprising that the empirical ML community hasn’t done a massive update. I think the situation will be very different if we do find an AI system that is systematically scheming enough to pose non-negligible takeover risk and ‘catch it red handed’.
I think people did point out that CICERO lies, and that was a useful update about how shallow attempts to prevent AI deception can fail. I think it could be referenced, and has been referenced, in relevant discussions
None of which comes anywhere close to your claims about what labs would do if they caught systematic scheming to deceive and conquer humans in systems trained normally. CICERO schemes very systematically, in a way which depends crucially on the LLM which was not trained to deceive or scheme. It does stuff that would have been considered a while ago a redline. And what analysis does it get? Some cursory ‘pointing out’. Some ‘referencing in relevant discussions’. (Hasn’t even been replicated AFAIK.)
any evidence that we’ll get the kind of scheming that could lead to AI takeover,
See, that’s exactly the problem with this argument—the goalposts will keep moving. The red line will always be a little further beyond. You’re making the ‘warning shot’ argument. CICERO presents every element except immediate blatant risk of AI takeover, which makes it a good place to start squeezing that scientific juice, and yet, it’s still not enough. Because your argument is circular. You can only be convinced of ‘systematic scheming to pose non-negligible takeover risk’ if you’ve already been convinced that it’s ‘systematic scheming to pose non-negligible takeover risk’. You present it as if there were some clear, objective brightline, but there is not and will not be, because each time it’ll be like Sydney or CICERO or …: “oh, it didn’t take over, and therefore doesn’t present a takeover risk” and therefore no update happens. So all your assertion boils down to is the tautology that labs will deeply examine the risky agents they choose to deeply examine.
It seems like you think CICERO and Sydney are bigger updates than I do. Yes, there’s a continuum of cases of catching deception where it’s reasonable for the ML community to update on the plausibility of AI takeover. Yes, it’s important that the ML community updates before AI systems pose significant risk, and there’s a chance that they won’t do so. But I don’t see the lack of strong update towards p(doom) from CICERO as good evidence that the ML community won’t update if we get evidence of systematic scheming (including trying to break out of the lab when there was never any training signal incentivising that behaviour). I think that kind of evidence would be much more relevant to AI takeover risk than CICERO.
To clarify my position in case i’ve been misunderstood. I’m not saying the ML community will definitely update in time. I’m saying that if there is systematic scheming and we catch it red-handed (as I took Buck to be describing) then there will likely be a very significant update. And CICERO seems like a weak counter example (but not zero evidence)
How much scientific juice has, say, Facebook gotten out of CICERO? Have they deeply understood the situations in which CICERO begins planning to manipulate possible allies? Have they mechanistically interpretably understood how CICERO decides how long to cooperate and play nice, and when to openly defect and attack an ally? Is not CICERO a deeply empirical system based on observations and logs from many real-world games with actual human players rather than mere theoretical arguments? Has CICERO ended the empirical debate about whether LLMs can systematically scheme? Has it been shown what training techniques lead to scheming or why off-the-shelf normally-trained frozen LLMs were so useful for the planning and psychological manipulation compared to no-press Diplomacy?
Or has everyone pretty much forgotten about CICERO, handwaved away a few excuses about “well maybe it wasn’t really deception” and “didn’t it just learn to imitate humans why are you surprised”, and the entire line of work apparently dead as a doornail as FB pivots to Llama-everything and core authors left for places like OA?
If the incentives for scientific research don’t work there where the opposing commercial incentives are so very weak (borderline non-existent, even), why would they be highly likely to work elsewhere in scenarios with vastly more powerful opposing commercial incentives?
I think people did point out that CICERO lies, and that was a useful update about how shallow attempts to prevent AI deception can fail. I think it could be referenced, and has been referenced, in relevant discussions. I don’t think CICERO provides much or any evidence that we’ll get the kind of scheming that could lead to AI takeover, so it’s not at all surprising that the empirical ML community hasn’t done a massive update. I think the situation will be very different if we do find an AI system that is systematically scheming enough to pose non-negligible takeover risk and ‘catch it red handed’.
None of which comes anywhere close to your claims about what labs would do if they caught systematic scheming to deceive and conquer humans in systems trained normally. CICERO schemes very systematically, in a way which depends crucially on the LLM which was not trained to deceive or scheme. It does stuff that would have been considered a while ago a redline. And what analysis does it get? Some cursory ‘pointing out’. Some ‘referencing in relevant discussions’. (Hasn’t even been replicated AFAIK.)
See, that’s exactly the problem with this argument—the goalposts will keep moving. The red line will always be a little further beyond. You’re making the ‘warning shot’ argument. CICERO presents every element except immediate blatant risk of AI takeover, which makes it a good place to start squeezing that scientific juice, and yet, it’s still not enough. Because your argument is circular. You can only be convinced of ‘systematic scheming to pose non-negligible takeover risk’ if you’ve already been convinced that it’s ‘systematic scheming to pose non-negligible takeover risk’. You present it as if there were some clear, objective brightline, but there is not and will not be, because each time it’ll be like Sydney or CICERO or …: “oh, it didn’t take over, and therefore doesn’t present a takeover risk” and therefore no update happens. So all your assertion boils down to is the tautology that labs will deeply examine the risky agents they choose to deeply examine.
It seems like you think CICERO and Sydney are bigger updates than I do. Yes, there’s a continuum of cases of catching deception where it’s reasonable for the ML community to update on the plausibility of AI takeover. Yes, it’s important that the ML community updates before AI systems pose significant risk, and there’s a chance that they won’t do so. But I don’t see the lack of strong update towards p(doom) from CICERO as good evidence that the ML community won’t update if we get evidence of systematic scheming (including trying to break out of the lab when there was never any training signal incentivising that behaviour). I think that kind of evidence would be much more relevant to AI takeover risk than CICERO.
To clarify my position in case i’ve been misunderstood. I’m not saying the ML community will definitely update in time. I’m saying that if there is systematic scheming and we catch it red-handed (as I took Buck to be describing) then there will likely be a very significant update. And CICERO seems like a weak counter example (but not zero evidence)