Since o1 I’ve been thinking that faithful chain-of-thought is waaaay underinvested in as a research direction.
If we get models such that a forward pass is kinda dumb, CoT is superhuman, and CoT is faithful and legible, then we can all go home, right? Loss of control is not gonna be a problem.
And it feels plausibly tractable.
I might go so far as to say it Pareto dominates most people’s agendas on importance and tractability. While being pretty neglected.
I’ve made some attempts over the past year to (at least somewhat) raise the profile of this kind of approach and its potential, especially when applied to automated safety research (often in conversation with / prompted by Ryan Greenblatt; and a lot of it stemming from my winter ’24 Astra Fellowship with @evhub):
I might write a high-level post at some point (which should hopefully help with the visibility of this kind of agenda more than the separate comments on various other [not-necessarily-that-related] posts).
As someone who has advocated for my own simple enough scheme for alignment (though it would be complicated to actually do it in practice, but I absolutely think it could be done.), I absolutely agree with this, and it does IMO look a lot better of an option than most schemes for safety.
I also agree re tractability claims, and I do think there’s a reasonably high chance that the first AIs that automate all AI research like scaling and robotics will have quite weak forward passes and quite strong COTs, more like in the 50-75% IMO, and this is actually quite a high value activity to do.
What feels underexplored to me is: If we can control roughly human-level AI systems, what do we DO with them?
Automated/strongly-augmented AI risk mitigation research, among various other options that Redwood discusses in some of their posts/public appearances.
There’s no such thing as “faithful CoT” because even if what you are doing is locally interpretable in terms of CoT, the way it adds up over the world still depends on the exogenous disequilibria that you are acting on as well as the ways your divertions of these disequilibria support themselves. At best you can get interpretability for infinitesimal perturbations, but why care about infinitesimal perturbations?
CoT is generally done for one level of data processing, e.g. responding to a user in a chat, or performing one step of a task for an autonomous LLM agent. However usually when AIs are deployed, they are asked to do many levels of data processing, e.g. respond to many users or perform many steps sequentially. It doesn’t matter if the chains of thoughts are highly faithful for pretty much all of these levels individually, what matters is if they are faithful in aggregate, i.e. for the few most important levels of data processing as well as in how the overall wave of smaller interactions add up.
I’d like to see how this would compare to a human organization. Suppose individual workers or individual worker-interactions are all highly faithful in a tech company. Naturally, though, the entire tech company will begin exhibiting misalignment, tend towards blind profit seeking, etc. Despite the faithfulness of its individual parts.
Is that the kind of situation you’re thinking of here? Is that why having mind-reading equipment that forced all the workers to dump their inner monologue wouldn’t actually be of much use towards aligning the overall system, because the real problem is something like the aggregate or “emergent” behavior of the system, rather than the faithfulness of the individual parts?
My threat model is entirely different: Even if human organizations are misaligned today, the human organizations rely primarily on human work, and so they pass tons of resources and discretion on to humans, and try to ensure that the humans are vigorous.
Meanwhile, under @mattmacdermott ’s alignment proposal, one would get rid of the humans and pass tons of resources and discretion on to LLMs. Whether one values the LLMs is up to you, but if human judgement and resource ownership is removed, obviously that means humans lose control, unless one can change organizations to give control to non-employees instead of employees (and it’s questionable how meaningful that is).
Since o1 I’ve been thinking that faithful chain-of-thought is waaaay underinvested in as a research direction.
If we get models such that a forward pass is kinda dumb, CoT is superhuman, and CoT is faithful and legible, then we can all go home, right? Loss of control is not gonna be a problem.
And it feels plausibly tractable.
I might go so far as to say it Pareto dominates most people’s agendas on importance and tractability. While being pretty neglected.
(Others had already written on these/related topics previously, e.g. https://www.lesswrong.com/posts/r3xwHzMmMf25peeHE/the-translucent-thoughts-hypotheses-and-their-implications, https://www.lesswrong.com/posts/FRRb6Gqem8k69ocbi/externalized-reasoning-oversight-a-research-direction-for, https://www.lesswrong.com/posts/fRSj2W4Fjje8rQWm9/thoughts-on-sharing-information-about-language-model#Accelerating_LM_agents_seems_neutral__or_maybe_positive_, https://www.lesswrong.com/posts/dcoxvEhAfYcov2LA6/agentized-llms-will-change-the-alignment-landscape, https://www.lesswrong.com/posts/ogHr8SvGqg9pW5wsT/capabilities-and-alignment-of-llm-cognitive-architectures, https://intelligence.org/visible/.)
I’ve made some attempts over the past year to (at least somewhat) raise the profile of this kind of approach and its potential, especially when applied to automated safety research (often in conversation with / prompted by Ryan Greenblatt; and a lot of it stemming from my winter ’24 Astra Fellowship with @evhub):
https://www.lesswrong.com/posts/HmQGHGCnvmpCNDBjc/current-ais-provide-nearly-no-data-relevant-to-agi-alignment#mcA57W6YK6a2TGaE2
https://www.lesswrong.com/posts/yQSmcfN4kA7rATHGK/many-arguments-for-ai-x-risk-are-wrong?commentId=HiHSizJB7eDN9CRFw
https://www.lesswrong.com/posts/yQSmcfN4kA7rATHGK/many-arguments-for-ai-x-risk-are-wrong?commentId=KGPExCE8mvmZNQE8E
https://www.lesswrong.com/posts/yQSmcfN4kA7rATHGK/many-arguments-for-ai-x-risk-are-wrong?commentId=zm4zQYLBf9m8zNbnx
https://www.lesswrong.com/posts/yQSmcfN4kA7rATHGK/many-arguments-for-ai-x-risk-are-wrong?commentId=TDJFetyynQzDoykEL
I might write a high-level post at some point (which should hopefully help with the visibility of this kind of agenda more than the separate comments on various other [not-necessarily-that-related] posts).
As someone who has advocated for my own simple enough scheme for alignment (though it would be complicated to actually do it in practice, but I absolutely think it could be done.), I absolutely agree with this, and it does IMO look a lot better of an option than most schemes for safety.
I also agree re tractability claims, and I do think there’s a reasonably high chance that the first AIs that automate all AI research like scaling and robotics will have quite weak forward passes and quite strong COTs, more like in the 50-75% IMO, and this is actually quite a high value activity to do.
Link below:
https://www.lesswrong.com/posts/HmQGHGCnvmpCNDBjc/current-ais-provide-nearly-no-data-relevant-to-agi-alignment#mcA57W6YK6a2TGaE2
Doesn’t this just shift what we worry about? If control of roughly human level and slightly superhuman systems is easy, that still leaves:
Human institutions using AI to centralize power
Conflict between human-controlled AI systems
Going out with a whimper scenarios (or other multi-agent problems)
Not understanding the reasoning of vastly superhuman AI (even with COT)
What feels underexplored to me is: If we can control roughly human-level AI systems, what do we DO with them?
Automated/strongly-augmented AI risk mitigation research, among various other options that Redwood discusses in some of their posts/public appearances.
There’s no such thing as “faithful CoT” because even if what you are doing is locally interpretable in terms of CoT, the way it adds up over the world still depends on the exogenous disequilibria that you are acting on as well as the ways your divertions of these disequilibria support themselves. At best you can get interpretability for infinitesimal perturbations, but why care about infinitesimal perturbations?
What do you mean by “over the world”? Are you including human coordination problems in this?
CoT is generally done for one level of data processing, e.g. responding to a user in a chat, or performing one step of a task for an autonomous LLM agent. However usually when AIs are deployed, they are asked to do many levels of data processing, e.g. respond to many users or perform many steps sequentially. It doesn’t matter if the chains of thoughts are highly faithful for pretty much all of these levels individually, what matters is if they are faithful in aggregate, i.e. for the few most important levels of data processing as well as in how the overall wave of smaller interactions add up.
I’d like to see how this would compare to a human organization. Suppose individual workers or individual worker-interactions are all highly faithful in a tech company. Naturally, though, the entire tech company will begin exhibiting misalignment, tend towards blind profit seeking, etc. Despite the faithfulness of its individual parts.
Is that the kind of situation you’re thinking of here? Is that why having mind-reading equipment that forced all the workers to dump their inner monologue wouldn’t actually be of much use towards aligning the overall system, because the real problem is something like the aggregate or “emergent” behavior of the system, rather than the faithfulness of the individual parts?
My threat model is entirely different: Even if human organizations are misaligned today, the human organizations rely primarily on human work, and so they pass tons of resources and discretion on to humans, and try to ensure that the humans are vigorous.
Meanwhile, under @mattmacdermott ’s alignment proposal, one would get rid of the humans and pass tons of resources and discretion on to LLMs. Whether one values the LLMs is up to you, but if human judgement and resource ownership is removed, obviously that means humans lose control, unless one can change organizations to give control to non-employees instead of employees (and it’s questionable how meaningful that is).