Once a model is deceptive at one point, even if this happens stochastically, it may continue in its deception deterministically.
We can’t rely on future models being as stochastic w.r.t the things we care about, e.g. scheming behaviour.
Regarding 2, consider the trend towards determinicity we see for the probability that GPT-N will output a grammatically correct sentence. For GPT-1 this was low, and it has trended upwards towards determinicity with newer releases. We’re seeing a similar trend for scheming behaviour (though hopefully we can buck this trend with alignment techniques).
Yep, LLMs are stochastic in the sense that there isn’t literally a 100% probability of their outputs having any given property. But they could very well be effectively deterministic (e.g. there’s plausibly a >99.999% probability that GPT-4′s response to “What’s the capital of France” includes the string “Paris”)
Yes, of course. Despite its stochastic nature, it is extraordinarily unlikely for an advanced LLM to respond with anything other than 2 + 2 = 4 or Paris for the capital of France. A stochastic phenomenon can, in practice, tend toward deterministic behavior. However, deception in a context such as the one discussed in Apollo Research’s article is not really comparable to answering 2 + 2 = ?. What the article demonstrates is that we are dealing with tendencies, accompanied by considerable randomness, including in the intensity of the deception.
Assuming a more sophisticated model has roughly double the deception capability of model o1, it would be enough to increase the sample size of responses for the anomaly to become glaringly obvious. One could also imagine a more rigorous test involving even more complex situations. It does not seem inconceivable that such a procedure could, for years to come—and perhaps even at the stage of the first generations of AGI—identify deceptive behaviors and establish an RL procedure based on this test.
I think the concern here is twofold:
Once a model is deceptive at one point, even if this happens stochastically, it may continue in its deception deterministically.
We can’t rely on future models being as stochastic w.r.t the things we care about, e.g. scheming behaviour.
Regarding 2, consider the trend towards determinicity we see for the probability that GPT-N will output a grammatically correct sentence. For GPT-1 this was low, and it has trended upwards towards determinicity with newer releases. We’re seeing a similar trend for scheming behaviour (though hopefully we can buck this trend with alignment techniques).
Yep, LLMs are stochastic in the sense that there isn’t literally a 100% probability of their outputs having any given property. But they could very well be effectively deterministic (e.g. there’s plausibly a >99.999% probability that GPT-4′s response to “What’s the capital of France” includes the string “Paris”)
Yes, of course. Despite its stochastic nature, it is extraordinarily unlikely for an advanced LLM to respond with anything other than 2 + 2 = 4 or Paris for the capital of France. A stochastic phenomenon can, in practice, tend toward deterministic behavior. However, deception in a context such as the one discussed in Apollo Research’s article is not really comparable to answering 2 + 2 = ?. What the article demonstrates is that we are dealing with tendencies, accompanied by considerable randomness, including in the intensity of the deception.
Assuming a more sophisticated model has roughly double the deception capability of model o1, it would be enough to increase the sample size of responses for the anomaly to become glaringly obvious. One could also imagine a more rigorous test involving even more complex situations. It does not seem inconceivable that such a procedure could, for years to come—and perhaps even at the stage of the first generations of AGI—identify deceptive behaviors and establish an RL procedure based on this test.