That’s actually a pretty good argument, and I actually basically agree that hiding CoT from the users is a bad choice from an alignment perspective now.
Noosphere89
The first concern is absolutely critical, and one way to break the circularity issue is to rely on AI control, while another way is to place incentives that favor alignment as an equilibrium and make dishonesty/misalignment unfavorable, in the sense that you can’t have a continuously rewarding path to misalignment.
The second issue is less critical, assuming that AGI #21 hasn’t itself become deceptively aligned, because at that point, we can throw away #22 and restart from a fresh training run.
If that’s no longer an option, we can go to war against the misaligned AGI with our own AGI forces.
In particular, you can still do a whole lot of automated research once you break labor bottlenecks, and while this is a slowdown, this isn’t fatal, so we can work around it.
The third issue is if we have achieved aligned ASI, than we have at that point achieved our goal, and once humans are obsolete in making alignment advances, that’s when we can say the end goal has been achieved.
Some thoughts on this post:
Hiding the CoT from users hides it from the people who most need to know about the deceptive cognition.
I agree in not thinking hiding the CoT is good for alignment.
They already show some willingness to dismiss evidence of deceptive cognition which they gain this way, in the o1 report. This calls into question the canary-in-coalmine benefit.
I definitely agree that OpenAI would dismiss good evidence of deceptive cognition, though I personally don’t find the o1 report damning, because I find the explanation that it confabulates links in the CoT because there is a difference between it’s capability to retrieve links and it’s ability to know links to be pretty convincing (combined with links being a case where perfect linking is far more useful than approximately linking.)
See this post for why even extreme evidence may not get them to undeploy:
At this point, the system becomes quite difficult to “deeply” correct. Deceptive behavior is hard to remove once it creeps in. Attempting to train against deceptive behavior instead steers the deception to be better. I would expect alignment training to similarly fail, so that you get a clever schemer by default.
While I do think aligning a deceptively aligned model is far harder due to adversarial dynamics, I want to note that the paper is not very much evidence for it, so you should still mostly rely on priors/other evidence:
https://www.lesswrong.com/posts/YsFZF3K9tuzbfrLxo/#tchmrbND2cNYui6aM
I definitely agree with this claim in general:
So, as a consequence of this line of thinking, it seems like an important long-term strategy with LLMs (and other AI technologies) is to keep as far away from deceptive behaviors as you can. You want to minimize deceptive behaviors (and its precursor capabilities) throughout all of training, if you can, because it is difficult to get out once it creeps in. You want to try to create and maintain a truth-telling equilibrium, where small moves towards deception are too clumsy to be rewarded.
(Edited out the paragraph that users need to know, since Daniel Kokotajlo convinced me that hiding the CoT is bad, actually.)
I am not exactly sure I understand your proposal. Are you proposing that we radically gut our leading future model by restricting it severely? I don’t think any AI labs will agree to do so, because such future AI is much less useful than probably even current AIs.
Or are you proposing that we use AI monitors our leading future AI models and then we heavily restrict only the monitors?
My proposal is to restrain the AI monitor’s domain only.
I agree this is a reduction in capability from unconstrained AI, but at least in the internal use setting rather than deploying the AI, you probably don’t need, and maybe don’t want it to be able to write fictional stories or telling calming stories, but rather using the AI for specific employment tasks.
Re jailbreaks, I think this is not an example of alignment not being solved, but rather an example of how easy it is to misuse/control LLMs.
Also, a lot of the jailbreak successes rely on the fact that it’s been trained to accept a very wide range of requests for deployment reasons, which suggests narrowing the domain of acceptable questions for internal use could reduce the space of jailbreaks dramatically:
Current jailbreaks of chatbots often work by exploiting the fact that chatbots are trained to indulge a bewilderingly broad variety of requests—interpreting programs, telling you a calming story in character as your grandma, writing fiction, you name it. But a model that’s just monitoring for suspicious behavior can be trained to be much less cooperative with the user—no roleplaying, just analyzing code according to a preset set of questions. This might substantially reduce the attack surface.
To answer this specific question
Do you mean we’re waiting tiil 2026⁄27 for results of thee next scaleup? If this round (GPT5, Claude 4, Gemini 2.0) show diminishing returns, wouldn’t we expect that the next will too?
Yes, assuming Claude 4/Gemini 2.0/GPT-5 don’t release or are disappointing in 2025-2026, this is definitely evidence that things are slowing down.
It doesn’t conclusively disprove it, but it does make progress shakier.
Agree with the rest of the comment.
I think where I get off the train personally probably comes down to the instrumental goals leading to misaligned goals section, combined with me being more skeptical of instrumental goals leading to unbounded power-seeking.
I agree there are definitely zero-sum parts of the science loop, but my worldview is that the parts where the goals are zero sum/competitive receive less weight than the alignment attempts.
I’d say the biggest area of how I’m skeptical so far is that I think there’s a real difference between the useful idea that power is useful for the science loop and the idea that the AI will seize power by any means necessary to advance it’s goals.
I think instrumental convergence will look more like local power-seeking that is more related to the task at hand, and not to serve some of it’s other goals, primarily due to denser feedback constraining the solution space and instrumental convergence more than humans.
That said, this is a very good post, and I’m certainly happier that this much more rigorous post was written than a lot of other takes on scheming.
I’d argue one of the issues with a lot of early social media moderation policies was treating ironic beliefs that were usually banned as not ban-worthy, because as it turned out, ironic belief in some extremism turned out to either have been fake, or turned into the real versions over time.
To be clear, I don’t yet believe that the rumors are true, or that if they are, that they matter.
We will have to wait until 2026-2027 to get real evidence on large training run progress.
One potential answer to how we might break the circularity is the AI control agenda that works in a specific useful capability range, but fail if we assume arbitrarily/infinitely capable AIs.
This might already be enough to do so given somewhat favorable assumptions.
But there is a point here in that absent AI control strategies, we do need a baseline of alignment in general.
Thankfully, I believe this is likely to be the case by default.
See Seth Herd’s comment below for a perspective:
I think this is in fact the crux, in that I don’t think they can do this in the general case, no matter how much compute is used, and even in the more specific cases, I still expect it to be extremely hard verging on impossible to actually get the distribution, primarily because you get equal evidence for almost every value, for the same reasons as why getting more compute is an instrumental convergent goal, so you cannot infer the values of basically anyone solely on the fact that you live in a simulation.
In the general case, the distribution/probability isn’t even well defined at all.
The boring answer to Solomonoff’s malignness is that the simulation hypothesis is true, but we can infer nothing about our universe through it, since the simulation hypothesis predicts everything, and thus is too general a theory.
The affect is attenutated greatly provided that we assume the ability to arbitrarily copy the Solomonoff inductor/Halting Oracle, as then we can drive the complexity of picking out the universe arbitrarily close to picking out the specific user in the universe, and in the limit of infinite Solomonoff induction uses, are exactly equal:
IMO, the psychological unity of humankind thesis is a case of typical minding/overgeneralizing, combined with overestimating the role of genetics/algorithms and underestimating the role of data in what makes us human.
I basically agree with the game-theoretic perspective, combined with another perspective which suggests that as long as humans are relevant in the economy, you kind of have to help those humans if you want to profit, and merely an AI that automates a lot of work could disrupt it very heavily if a CEO could have perfectly loyal AI workers that never demanded anything in the broader economy.
The best argument here probably comes from Paul Christiano, but to summarize the argument, it’s because even in a situation where we messed up pretty badly in aligning the AI, so long as the failure mode isn’t deceptive alignment but instead misgeneralization of human preferences/non-deceptive alignment failures, it’s pretty likely that there will be at least some human-regarding preferences, and that means the AI will do some acts of niceness if it is cheap to them, and preserving humans is very cheap for superintelligent AI.
More answers can be found here:
This is also my interpretation of the rumors, assuming they are true, which I don’t put much probability on.
I’d say the main things that made my own p(Doom) went down this year are the following:
-
I’ve come to believe that data was both a major factor in capabilities and alignment, and I also believe that careful interventions on that data could be really helpful for alignment.
-
I’ve come to think that instrumental convergence is closer to a scalar quantity than a boolean, and while I don’t think 0 instrumental convergence is incentivized for capabilities and domain reasons, I do think that restraining instrumental convergence/putting useful constraints on instrumental convergence like world models is helpful for capabilities to the extent that I think that power-seeking will likely be a lot more local than what humans do.
-
I’ve overall shifted towards a worldview where the common thought experiment of the second-species argument, where humans have killed over 90%+ of chimpanzees and gorillas due to them running away with intelligence and being misaligned neglects very crucial differences between the human and the AI case that makes my p(Doom) lower.
(Maybe another way to say it is I think the outcome of humans just completely running roughshod on every other species due to instrumental convergence is not the median outcome of AI development, but a deep outiler that is very uninformative to how AI outcomes will look like.)
I’ve come to believe that human values, or at least the generator of values, are actually simpler than a lot of people think, and that a lot of the complexity that appears to be there is because we generally don’t like admitting that very simple rules can generate very complex outcomes.
-
While I’m not a believer in the scaling has died meme yet, I’m glad you do have a plan for what happens if AI scaling does stop.
I’m less confident in this position since I put on a disagree emoji, but my reason is because it’s much easier to control an AIs data sources for training than it is for humans, which means it’s quite easy in theory (but might be difficult in practice, which worries me) to censor just enough data such that the model doesn’t even think that it’s likely in a simulation that doesn’t add up to normality.
I definitely agree that conditioning on AI catastrophe, I think the 4 step chaotic catastrophe is the most likely way an AI catastrophe leads to us being extinct or at least in a very bad position.
I admit the big difference is that I do think that 2 is probably incorrect, as we have some useful knowledge of how models form goals, and I expect this to continue.