Hiding the CoT from users hides it from the people who most need to know about the deceptive cognition.
I agree in not thinking hiding the CoT is good for alignment.
They already show some willingness to dismiss evidence of deceptive cognition which they gain this way, in the o1 report. This calls into question the canary-in-coalmine benefit.
I definitely agree that OpenAI would dismiss good evidence of deceptive cognition, though I personally don’t find the o1 report damning, because I find the explanation that it confabulates links in the CoT because there is a difference between it’s capability to retrieve links and it’s ability to know links to be pretty convincing (combined with links being a case where perfect linking is far more useful than approximately linking.)
See this post for why even extreme evidence may not get them to undeploy:
At this point, the system becomes quite difficult to “deeply” correct. Deceptive behavior is hard to remove once it creeps in. Attempting to train against deceptive behavior instead steers the deception to be better. I would expect alignment training to similarly fail, so that you get a clever schemer by default.
While I do think aligning a deceptively aligned model is far harder due to adversarial dynamics, I want to note that the paper is not very much evidence for it, so you should still mostly rely on priors/other evidence:
So, as a consequence of this line of thinking, it seems like an important long-term strategy with LLMs (and other AI technologies) is to keep as far away from deceptive behaviors as you can. You want to minimize deceptive behaviors (and its precursor capabilities) throughout all of training, if you can, because it is difficult to get out once it creeps in. You want to try to create and maintain a truth-telling equilibrium, where small moves towards deception are too clumsy to be rewarded.
(Edited out the paragraph that users need to know, since Daniel Kokotajlo convinced me that hiding the CoT is bad, actually.)
While I don’t think hiding the CoT is good for alignment, I’d say that in a lot of risk scenarios, the people who would most need to know about the deceptive cognition includes government officials and the lab, not regulators, since users likely have little control over the AI.
I think you mean “not users?”
I agree, but I think government officials and company employees might not find out about the deceptive cognition unless there is general transparency about it. Because very often, curious incidents are noticed by users and then put up on Twitter, for example, and only then eventually rise to the attention of the company employees. Moreover, the consensus-forming process happens largely outside the government, in public discourse, so it’s important for the public to be aware of e.g. concerning or interesting behaviors. Finally, and most importantly, the basic alignment science advancements that need to happen and will happen from lots of people studying real-world examples of hidden/reasoning/etc. CoT… well, they sure won’t happen inside the heads of government officials. And the number of people working on this inside the corporations is pretty small. Exposing the CoT to the public increases the quality-adjusted amount of scientific work on them by orders of magnitude.
That’s actually a pretty good argument, and I actually basically agree that hiding CoT from the users is a bad choice from an alignment perspective now.
What if the CoT was hidden by default, but ‘power users’ could get access to it? That might get you some protection from busybodies complaining about factually-accurate-but-rude content in the CoT, while still giving the benefits of having thoughtful critics examining the CoT for systematic flaws.
This might work. Let’s remember the financial incentives. Exposing a non-aligned CoT to all users is pretty likely to generate lots of articles about how your AI is super creepy, which will create a public perception that your AI in particular is not trustworthy relative to your competition.
I agree that it would be better to expose from an alignment perspective, I’m just noting the incentives on AI companies.
My version of this would be something like: Users can pay $$ to view CoT (but no more than e.g. 10 per day), and everyone gets one free view-the-CoT coupon per day.
Also, ‘approved users’ such as external evals organizations/auditors should be of course allowed to see all the CoTs.
The main problem a thing like this needs to solve is the threat of bad actors scraping a bunch of CoT data and then using it to train their own powerful models. So, my proposal here would make that difficult.
Some thoughts on this post:
I agree in not thinking hiding the CoT is good for alignment.
I definitely agree that OpenAI would dismiss good evidence of deceptive cognition, though I personally don’t find the o1 report damning, because I find the explanation that it confabulates links in the CoT because there is a difference between it’s capability to retrieve links and it’s ability to know links to be pretty convincing (combined with links being a case where perfect linking is far more useful than approximately linking.)
See this post for why even extreme evidence may not get them to undeploy:
https://www.lesswrong.com/posts/YTZAmJKydD5hdRSeG/would-catching-your-ais-trying-to-escape-convince-ai
While I do think aligning a deceptively aligned model is far harder due to adversarial dynamics, I want to note that the paper is not very much evidence for it, so you should still mostly rely on priors/other evidence:
https://www.lesswrong.com/posts/YsFZF3K9tuzbfrLxo/#tchmrbND2cNYui6aM
I definitely agree with this claim in general:
(Edited out the paragraph that users need to know, since Daniel Kokotajlo convinced me that hiding the CoT is bad, actually.)
I think you mean “not users?”
I agree, but I think government officials and company employees might not find out about the deceptive cognition unless there is general transparency about it. Because very often, curious incidents are noticed by users and then put up on Twitter, for example, and only then eventually rise to the attention of the company employees. Moreover, the consensus-forming process happens largely outside the government, in public discourse, so it’s important for the public to be aware of e.g. concerning or interesting behaviors. Finally, and most importantly, the basic alignment science advancements that need to happen and will happen from lots of people studying real-world examples of hidden/reasoning/etc. CoT… well, they sure won’t happen inside the heads of government officials. And the number of people working on this inside the corporations is pretty small. Exposing the CoT to the public increases the quality-adjusted amount of scientific work on them by orders of magnitude.
That’s actually a pretty good argument, and I actually basically agree that hiding CoT from the users is a bad choice from an alignment perspective now.
What if the CoT was hidden by default, but ‘power users’ could get access to it? That might get you some protection from busybodies complaining about factually-accurate-but-rude content in the CoT, while still giving the benefits of having thoughtful critics examining the CoT for systematic flaws.
This might work. Let’s remember the financial incentives. Exposing a non-aligned CoT to all users is pretty likely to generate lots of articles about how your AI is super creepy, which will create a public perception that your AI in particular is not trustworthy relative to your competition.
I agree that it would be better to expose from an alignment perspective, I’m just noting the incentives on AI companies.
This might actually be a useful idea, thanks for your idea.
My version of this would be something like: Users can pay $$ to view CoT (but no more than e.g. 10 per day), and everyone gets one free view-the-CoT coupon per day.
Also, ‘approved users’ such as external evals organizations/auditors should be of course allowed to see all the CoTs.
The main problem a thing like this needs to solve is the threat of bad actors scraping a bunch of CoT data and then using it to train their own powerful models. So, my proposal here would make that difficult.