Completed an undergrad in CS and Math at Columbia, where I helped run Columbia Effective Altruism and Columbia AI Alignment Club (CAIAC). I’m pursuing a career in technical AI alignment research (probably).
RohanS
I’ve built very basic agents where (if I’m understanding correctly) my laptop is the Scaffold Server and there is no separate Execution Server; the agent executes Python code and bash commands locally. You mention that it seems bizarre to not set up a separate Execution Server (at least for more sophisticated agents) because the agent can break things on the Scaffold Server. But I’m inclined to think there may also be advantages to this for capabilities: namely, an agent can discover while performing a task that it would benefit from having tools that it doesn’t yet have, then it can edit its own scaffold (or write a new one) to have those extra tools. (Similarly for other edits to the scaffold.) I tentatively think this is useful enough that people will try to implement it.
Allowing an agent to edit its own scaffold and otherwise interact with the Scaffold Server does probably make control harder, but I’m not sure how bad this is. We can still monitor the tool calls as you suggest, and the agent still has no easy access to model weights. Maybe some tool calls still get sent to a separate Execution Server, and there could be additional guardrails around the tools that allow edits on the Scaffold Server.
I’m curious if there are reasons why giving agents edit-access to the Scaffold Server wouldn’t be useful or why it would significantly increase safety concerns.
Thanks for writing this up! Sad to have missed this sprint. This comment mainly has pushback against things you’ve said, but I agreed with a lot of the things I’m not responding to here.
Second, there is evidence that CoT does not help the largest LLMs much.
I think this is clearly wrong, or at least way too strong. The most intuitively obvious way I’ve seen this is reading the cipher example in the o1 blog post, in the section titled “Chain of Thought.” If you click on where it says “Thought for 5 seconds” for o1, it reveals the whole chain of thought. It’s pretty long, maybe takes 5 mins to skim, but it’s well worth the time for building intuition about how the most cutting edge model thinks imo. The model uses CoT to figure out a cipher and decode it. I think it’s intuitively obvious that the model could not have solved this problem without CoT.
Additionally, when trying to search for this paper, I found this paper on arxiv which finds situations where the CoT is just rationalizing the decision made by the LLM. If you look at papers which cite this paper, you will find other research in this vain.
True. I trust post-hoc explanations much less than pre-answer reasoning for problems that seem to require a lot of serial reasoning, like the o1 cipher problem. This post and this comment on it discuss different types of CoT unfaithfulness in a way similar to how I’m thinking about it, highly recommend.
But why are the Aether team organising these mini-sprints? The short summary is that deception is a big risk in future AI systems, and they believe that nailing down what it means for LLMs and LLM agents to believe something is an important step to detecting and intervening on deceptive systems.
Fwiw only that one sprint was specifically on beliefs. I think I’m more interested in what the agents believe, and less in figuring out exactly what it means to believe things (although the latter might be necessary in some confusing cases). I’d say the sprints are more generally aimed at analyzing classic AI risk concepts in the context of foundation model agents, and getting people outside the core team to contribute to that effort.
Max Nadeau recently made a comment on another post that gave an opinionated summary of a lot of existing CoT faithfulness work, including steganography. I’d recommend reading that. I’m not aware of very much relevant literature here; it’s possible it exists and I haven’t heard about it, but I think it’s also possible that this is a new conversation that exists in tweets more than papers so far.
Paper on inducing steganography and combatting it with rephrasing: Preventing Language Models From Hiding Their Reasoning
Noting a difference between steganography and linguistic drift: I think rephrasing doesn’t make much sense as a defense against strong linguistic drift. If your AI is solving hard sequential reasoning problems with CoT that looks like “Blix farnozzium grellik thopswen…,” what is rephrasing going to do for you?
Countering Language Drift via Visual Grounding (Meta, 2019)
I haven’t looked at this closely enough to see if it’s really relevant, but it does say in the abstract “We find that agents that were initially pretrained to produce natural language can also experience detrimental language drift: when a non-linguistic reward is used in a goal-based task, e.g. some scalar success metric, the communication protocol may easily and radically diverge from natural language.” That sounds relevant.
Andrej Karpathy suggesting that pushing o1-style RL further is likely to lead to linguistic drift: https://x.com/karpathy/status/1835561952258723930?s=46&t=foMweExRiWvAyWixlTSaFA
In the o1 blog post, OpenAI said (under one interpretation) that they didn’t want to just penalize the model for saying harmful or deceptive plans in the CoT because that might lead it to keep having those plans but not writing them in CoT.
“Assuming it is faithful and legible, the hidden chain of thought allows us to “read the mind” of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought.”
Could you please point out the work you have in mind here?
Here’s our current best guess at how the type signature of subproblems differs from e.g. an outermost objective. You know how, when you say your goal is to “buy some yoghurt”, there’s a bunch of implicit additional objectives like “don’t spend all your savings”, “don’t turn Japan into computronium”, “don’t die”, etc? Those implicit objectives are about respecting modularity; they’re a defining part of a “gap in a partial plan”. An “outermost objective” doesn’t have those implicit extra constraints, and is therefore of a fundamentally different type from subproblems.
Most of the things you think of day-to-day as “problems” are, cognitively, subproblems.
Do you have a starting point for formalizing this? It sounds like subproblems are roughly proxies that could be Goodharted if (common sense) background goals aren’t respected. Maybe a candidate starting point for formalizing subproblems, relative to an outermost objective, is “utility functions that closely match the outermost objective in a narrow domain”?
Lots of interesting thoughts, thanks for sharing!
You seem to have an unconventional view about death informed by your metaphysics (suggested by your responses to 56, 89, and 96), but I don’t fully see what it is. Can you elaborate?
Basic idea of 85 is that we generally agree there have been moral catastrophes in the past, such as widespread slavery. Are there ongoing moral catastrophes? I think factory farming is a pretty obvious one. There’s a philosophy paper called “The Possibility of an Ongoing Moral Catastrophe” that gives more context.
How is there more than one solution manifold? If a solution manifold is a behavior manifold which corresponds to a global minimum train loss, and we’re looking at an overparameterized regime, then isn’t there only one solution manifold, which corresponds to achieving zero train loss?
My best guess is that there was process supervision for capabilities but not for safety. i.e. training to make the CoT useful for solving problems, but not for “policy compliance or user preferences.” This way they make it useful, and they don’t incentivize it to hide dangerous thoughts. I’m not confident about this though.