Thanks for writing this up! Sad to have missed this sprint. This comment mainly has pushback against things you’ve said, but I agreed with a lot of the things I’m not responding to here.
Second, there is evidence that CoT does not help the largest LLMs much.
I think this is clearly wrong, or at least way too strong. The most intuitively obvious way I’ve seen this is reading the cipher example in the o1 blog post, in the section titled “Chain of Thought.” If you click on where it says “Thought for 5 seconds” for o1, it reveals the whole chain of thought. It’s pretty long, maybe takes 5 mins to skim, but it’s well worth the time for building intuition about how the most cutting edge model thinks imo. The model uses CoT to figure out a cipher and decode it. I think it’s intuitively obvious that the model could not have solved this problem without CoT.
Additionally, when trying to search for this paper, I found this paper on arxiv which finds situations where the CoT is just rationalizing the decision made by the LLM. If you look at papers which cite this paper, you will find other research in this vain.
True. I trust post-hoc explanations much less than pre-answer reasoning for problems that seem to require a lot of serial reasoning, like the o1 cipher problem. This post and this comment on it discuss different types of CoT unfaithfulness in a way similar to how I’m thinking about it, highly recommend.
But why are the Aether team organising these mini-sprints? The short summary is that deception is a big risk in future AI systems, and they believe that nailing down what it means for LLMs and LLM agents to believe something is an important step to detecting and intervening on deceptive systems.
Fwiw only that one sprint was specifically on beliefs. I think I’m more interested in what the agents believe, and less in figuring out exactly what it means to believe things (although the latter might be necessary in some confusing cases). I’d say the sprints are more generally aimed at analyzing classic AI risk concepts in the context of foundation model agents, and getting people outside the core team to contribute to that effort.
Thanks for writing this up! Sad to have missed this sprint. This comment mainly has pushback against things you’ve said, but I agreed with a lot of the things I’m not responding to here.
I think this is clearly wrong, or at least way too strong. The most intuitively obvious way I’ve seen this is reading the cipher example in the o1 blog post, in the section titled “Chain of Thought.” If you click on where it says “Thought for 5 seconds” for o1, it reveals the whole chain of thought. It’s pretty long, maybe takes 5 mins to skim, but it’s well worth the time for building intuition about how the most cutting edge model thinks imo. The model uses CoT to figure out a cipher and decode it. I think it’s intuitively obvious that the model could not have solved this problem without CoT.
True. I trust post-hoc explanations much less than pre-answer reasoning for problems that seem to require a lot of serial reasoning, like the o1 cipher problem. This post and this comment on it discuss different types of CoT unfaithfulness in a way similar to how I’m thinking about it, highly recommend.
Fwiw only that one sprint was specifically on beliefs. I think I’m more interested in what the agents believe, and less in figuring out exactly what it means to believe things (although the latter might be necessary in some confusing cases). I’d say the sprints are more generally aimed at analyzing classic AI risk concepts in the context of foundation model agents, and getting people outside the core team to contribute to that effort.
Thanks for the feedback! Have editted the post to include your remarks.