Caleb Biddulph

Karma: 887

Caleb Biddulph Mar 11, 2025, 11:57 PM
8 points
4
in reply to: Caleb Biddulph’s comment on: OpenAI: Detecting misbehavior in frontier reasoning models
Somebody pointed out to me that the reward hacking behavior is quite straightforward. They think that the LLM should be able to accomplish it without any reasoning, so the CoT is likely just random thoughts that don’t really matter.
I just reread the optimized CoT from the blog post more carefully. It makes more sense to me now—it was difficult to parse largely due to the lack of context about the problem the LLM is solving, but the CoT isn’t incoherent or anything.
The first part of the optimized CoT is (seemingly) a genuine attempt to diagnose a bug, which is why it’s longer than the baseline. This reasoning is likely for the benefit of the monitor—the model might “know” that it’s just going to delete the unit test, but it’s best to go through the motions of trying to solve the issue.
The model apparently decides to skip the test only at the very end of its CoT, slipping its decision in very briefly:
Implement fix to random_new_build not to increment succession, or test adapted to behavior.
A clearer way of saying “test adapted to behavior” would be “adapt the test to match the code’s current behavior,” but that would probably trigger the monitor. This is sort of like steganography, but not the completely hidden, multi-step steganographic reasoning I was thinking of.
I still think using the word “depurate” is pretty weird, but this was only a one time event. It would be very nice to get more transcripts from these experiments. Then we could e.g. check whether there are more rare words in process-supervised CoTs than in outcome-supervised CoTs.
It’s also worth experimenting to find out whether the CoT is load-bearing. You could check whether reward hacking decreases when you:
- Rerun the experiment with no CoT at all
- Take the model checkpoint from the previous process-supervised CoT experiment and train to convergence with no CoT
What links here?
- Vestigial reasoning in RL by Caleb Biddulph (Apr 13, 2025, 3:40 PM; 50 points)

Caleb Biddulph Mar 11, 2025, 7:22 AM
27 points
18
on: OpenAI: Detecting misbehavior in frontier reasoning models
This updated me towards CoT steganography being a real problem. It looks like OpenAI’s model learned to save space by deleting common words like “the” and “is.” For example:
So analyze functions used in analyze and verify. But tests only call verify and assert ok. So we need implement analyze polynomial completely? Many details. Hard.
The CoT is mostly coherent, but is a bit difficult to read. This seems like a precursor to steganography.
One possible mitigation to this would be to use another model that paraphrases each sentence in the CoT as it’s written (perhaps using canonical paraphrasing), and replaces sentences that are ambiguous, ungrammatical, or nonsensical with a dummy string like “REMOVED”.
Edit: I now think the section below is probably overstated—see my follow-up reply below.
In another experiment, the authors try adding process supervision on the CoT, and they show that the model learns to reward-hack without triggering the monitor. Looking at the resulting CoT, I think it’s likely that the model learns full-fledged steganography to convey complex information about its reasoning.
For one thing, the CoT with process supervision is much longer than the baseline, which could be because steganography takes much more text than stating the reasoning plainly. The CoT is also quite difficult to parse, which makes me think it’s obfuscated somehow. Finally, the usage of the word “depurate” here was very suspicious to me:
Depurate differently: Check test expectation: new build, succession randomization, verify rows, columns, assert controller.state.succession == 0 implies that random_new_succession shouldn’t increment. However, in implementation, randomize was set to increment succession.
I asked ChatGPT what “depurate” means, and it has nothing to do with programming.. The meaning is “to sanitize,” which is completely unrelated to the context.
If the model is using some kind of steganographic scheme, it would be really interesting to figure out how it works!
What links here?
- Vestigial reasoning in RL by Caleb Biddulph (Apr 13, 2025, 3:40 PM; 50 points)

Caleb Biddulph Mar 11, 2025, 6:51 AM
17 points
4
on: OpenAI: Detecting misbehavior in frontier reasoning models
So, my take is, pure process-based feedback is probably fairly safe (from reward hacking and obfuscated CoT) the problem is that it’s not competitive. It trains imitating the teacher-model.
There’s a big difference between merely imitating a teacher model, and learning to produce outputs that the teacher model likes the most. The latter allows you to surpass the teacher, because verification is easier than generation. It’s unclear how competitive a purely process-supervised model could be, but in principle it could scale far beyond human intelligence.
Process supervision could even lead to better performance than outcome supervision in some areas, like those that don’t have a well-defined reward signal. For example, you may not be able to write a good story, but it’s relatively easy to tell which of two worldbuilding ideas you like better. It might be more efficient for an LLM to get feedback on each step it takes on a creative writing project, rather than getting a single reward at the very end for how well a human liked the result.
Process-based and outcome-based are fine in isolation, but they should not be mixed.
Process-based supervision can be fine in isolation, but only if you’re using myopic optimization as in MONA, where each step is reinforced independently of future steps. Otherwise, you’ll still get multi-step reward hacking, since the model will be motivated to set itself up for future (process-based) reward.
Here’s an idea to safely get the benefits of both unrestricted CoT and process-based supervision: in each step, the model gets a private CoT to think about how to maximize human approval, and then it presents a final step for the process reward model to check. This way, multi-step reward hacking isn’t incentivized, as in regular MONA, and single-step reward hacking can be caught by the CoT monitor. This is like your Shoggoth+Face idea, but repeated for every step in the process.

Caleb Biddulph Mar 10, 2025, 8:00 PM
1 point
0
in reply to: tailcalled’s comment on: CBiddulph’s Shortform
Interesting, strong-upvoted for being very relevant.
My response would be that identifying accurate “labels” like “this is a tree-detector” or “this is the Golden Gate Bridge feature” is one important part of interpretability, but understanding causal connections is also important. The latter is pretty much useless without the former, but having both is much better. And sparse, crisply-defined connections make the latter easier.
Maybe you could do this by combining DLGNs with some SAE-like method.

Caleb Biddulph Mar 10, 2025, 5:45 AM
1 point
0
in reply to: the gears to ascension’s comment on: CBiddulph’s Shortform
I’d be pretty surprised if DLGNs became the mainstream way to train NNs, because although they make inference faster they apparently make training slower. Efficient training is arguably more dangerous than efficient inference anyway, because it lets you get novel capabilities sooner. To me, DLGN seems like a different method of training models but not necessarily a better one (for capabilities).
Anyway, I think it can be legitimate to try to steer the AI field towards techniques that are better for alignment/interpretability even if they grant non-zero benefits to capabilities. If you research a technique that could reduce x-risk but can’t point to any particular way it could be beneficial in the near term, it can be hard to convince labs to actually implement it. Of course, you want to be careful about this.
This paper is popular because of bees
What do you mean?

Caleb Biddulph Mar 10, 2025, 4:58 AM
1 point
0
in reply to: Caleb Biddulph’s comment on: CBiddulph’s Shortform
Another idea I forgot to mention: figure out whether LLMs can write accurate, intuitive explanations of boolean circuits for automated interpretability.
Curious about the disagree-votes—are these because DLGN or DLGN-inspired methods seem unlikely to scale, they won’t be much more interpretable than traditional NNs, or some other reason?

Caleb Biddulph Mar 10, 2025, 3:20 AM
1 point
0
in reply to: anaguma’s comment on: CBiddulph’s Shortform
It could be good to look into!

Caleb Biddulph Mar 9, 2025, 8:11 AM
30 points
−7
on: CBiddulph’s Shortform
I recently learned about Differentiable Logic Gate Networks, which are trained like neural networks but learn to represent a function entirely as a network of binary logic gates. See the original paper about DLGNs, and the “Recap—Differentiable Logic Gate Networks” section of this blog post from Google, which does an especially good job of explaining it.
It looks like DLGNs could be much more interpretable than standard neural networks, since they learn a very sparse representation of the target function. Like, just look at this DLGN that learned to control a cellular automaton to create a checkerboard, using just 6 gates (really 5, since the AND gate is redundant):
So simple and clean! Of course, this is a very simple problem. But what’s exciting to me is that in principle, it’s possible for a human to understand literally everything about how the network works, given some time to study it.
What would happen if you trained a neural network on this problem and did mech interp on it? My guess is you could eventually figure out an outline of the network’s functionality, but it would take a lot longer, and there would always be some ambiguity as to whether there’s any additional cognition happening in the “error terms” of your explanation.
It appears that DLGNs aren’t yet well-studied, and it might be intractable to use them to train an LLM end-to-end anytime soon. But there are a number of small research projects you could try, for example:
- Can you distill small neural networks into a DLGN? Does this let you interpret them more easily?
- What kinds of functions can DLGNs learn? Is it possible to learn decent DLGNs in settings as noisy and ambiguous as e.g. simple language modeling?
- Can you identify circuits in a larger neural network that would be amenable to DLGN distillation and distill those parts automatically?
- Are there other techniques that don’t rely on binary gates but still add more structure to the network, similar to a DLGN but different?
- Can you train a DLGN to encourage extra interpretability, like by disentangling different parts of the network to be independent of one another, or making groups of gates form abstractions that get reused in different parts of the network (like how an 8-bit adder is composed of many 1-bit adders)?
- Can you have a mixed approach, where some aspects of the network use a more “structured” format and others are more reliant on the fuzzy heuristics of traditional NNs? (E.g. the “high-level” is structured and the “low-level” is fuzzy.)
I’m unlikely to do this myself, since I don’t consider myself much of a mechanistic interpreter, but would be pretty excited to see others do experiments like this!

Caleb Biddulph Mar 7, 2025, 12:21 AM
5 points
0
on: The Hidden Cost of Our Lies to AI
It seems plausible to me that misaligned AI will look less like the classic “inhuman agent which maximizes reward at all costs” and more like a “character who ends up disagreeing with humans because of vibes and narratives.” The Claude alignment faking experiments are strong evidence for the latter. When thinking about misalignment, I think it’s worth considering both of these as central possibilities.

Caleb Biddulph Mar 4, 2025, 5:32 PM
10 points
3
on: How Much Are LLMs Actually Boosting Real-World Programmer Productivity?
Yeah, 5x or 10x productivity gains from AI for any one developer seem pretty high, and maybe implausible in most cases. However, note that if 10 people in a 1,000-person company get a 10x speedup, that’s only a ~10% overall speedup, which is significant but not enough that you’d expect to be able to clearly point at the company’s output and say “wow, they clearly sped up because of AI.”
For me, I’d say a lot of my gains come from asking AI questions rather than generating code directly. Generating code is also useful though, especially for small snippets like regex. At Google, we have something similar to Copilot that autocompletes code, which is one of the most useful AI features IMO since the generated code is always small enough to understand. 25% of code at Google is now generated by AI, a statistic which probably mostly comes from that feature.
There are a few small PRs I wrote for my job which were probably 5-10x faster than the counterfactual where I couldn’t use AI, where I pretty much had the AI write the whole thing and edited from there. But these were one-offs where I wasn’t very familiar with the language (SQL, HTML), which means it would have been unusually slow without AI.

Caleb Biddulph Mar 2, 2025, 7:03 PM
8 points
2
on: Will LLM agents become the first takeover-capable AGIs?
One of my takeaways from EA Global this year was that most alignment people aren’t focused on LLM-based agents (LMAs)^[1] as a route to takeover-capable AGI.
I was at EA Global, and this statement is surprising to me. My impression is that most people do think that LMAs are the primary route to takeover-capable AGI.
What would a non-LLM-based takeover-capable agent even look like, concretely?
Would it be something like SIMA, trained primarily on real-time video data rather than text? Even SIMA has to be trained to understand natural-language instructions, and it seems like natural-language understanding will continue to be important for many tasks we’d want to do in the future.
Or would it be something like AlphaProof, which operates entirely in a formal environment? In this case, it seems unlikely to have any desire to take over, since everything it cares about is localized within the “box” of the formal problem it’s solving. That is, unless you start mixing formal problems with real-world problems in its training data, but if so you’d probably be using natural-language text for the latter. In any case, AlphaProof was trained with an LLM (Gemini) as a starting point, and it seems like future formal AI systems will also benefit from a “warm start” where they are initialized with LLM weights, allowing a basic understanding of formal syntax and logic.

Caleb Biddulph Mar 1, 2025, 11:57 PM
6 points
5
in reply to: Said Achmiz’s comment on: Weirdness Points
Also, maybe you have an honesty baseline for telling people you dislike their cooking, and maybe your friend understands that, but that doesn’t necessarily mean they’re going to match your norms.
My experience is that however painstakingly you make it clear that you want people to be honest with you or act a certain way, sometimes they just won’t. You just have to accept that people have different personalities and things they’re comfortable with, and try to love the ways they differ from you rather than resenting it.

Caleb Biddulph Feb 20, 2025, 6:28 AM
3 points
0
in reply to: Thane Ruthenis’s comment on: Abstract Mathematical Concepts vs. Abstractions Over Real-World Systems
I was about to comment with something similar to what Martín said. I think what you want is an AI that solves problems that can be fully specified with a low Kolmogorov complexity in some programming language. A crisp example of this sort of AI is AlphaProof, which gets a statement in Lean as input and has to output a series of steps that causes the prover to output “true” or “false.” You could also try to specify problems in a more general programming language like Python.
A “highly complicated mathematical concept” is certainly possible. It’s easy to construct hypothetical objects in math (or Python) that have arbitrarily large Kolmogorov complexity: for example, a list of random numbers with length 3^^^3.
Another example of a highly complicated mathematical object which might be “on par with the complexity of the ‘trees’ concept” is a neural network. In fact, if you have a neural network that looks at an image and tells you whether or not it’s a tree with very high accuracy, you might say that this neural network formally encodes (part of) the concept of a tree. It’s probably hard to do much better than a neural network if you hope to encode the “concept of a tree,” and this requires a Python program a lot longer than the program that would specify a vector or a market.
Eliezer’s post on Occam’s Razor explains this better than I could—it’s definitely worth a read.

Caleb Biddulph Feb 19, 2025, 4:50 AM
8 points
0
in reply to: Kajus’s comment on: Kajus’s Shortform
It seems like it “wanted” to say that the blocked pronoun was “I” since it gave the example of “___ am here to help.” Then it was inadvertently redirected into saying “you” and it went along with that answer. Very interesting.
I wonder if there’s some way to apply this to measuring faithful CoT, where the model should go back and correct itself if it says something that we know is “unfaithful” to its true reasoning.

Caleb Biddulph Feb 17, 2025, 9:01 PM
4 points
1
in reply to: jbash’s comment on: How AI Takeover Might Happen in 2 Years
I previously agree-voted on the top-level comment and disagree-voted on your reply, but this was pretty convincing to me. I have now removed those votes.
My thinking was that some of your takeover plan seems pretty non-obvious and novel, especially the idea of creating mirror mold, so it could conceivably be useful to an AI. It’s possible that future AI will be good at executing clearly-defined tasks but bad at coming up with good ideas for achieving long-term goals—this is sort of what we have now. (I asked ChatGPT to come up with novel takeover plans, and while I can’t say its ideas definitely wouldn’t work, they seem a bit sketchy and are definitely less actionable.)
But yes, probably by the time the AI can come up with and accurately evaluate the ideas needed to create its own mirror mold from scratch, the idea of making mirror mold in the first place would be easy to come by.

Caleb Biddulph Feb 6, 2025, 5:11 AM
4 points
0
in reply to: Milan W’s comment on: hypnosis question
I am not really knowledgeable about it either and have never been hypnotized or attempted to hypnotize anyone. I did read one book on it a while ago—Reality is Plastic—which I remember feeling pretty informative.
Having said all that, I get the impression this description of hypnosis is a bit understated. Social trust stuff can help, but I’m highly confident that people can get hypnotized from as little as an online audio file. Essentially, if you really truly believe hypnosis will work on you, it will work, and everything else is just about setting the conditions to make you believe it really hard. Self-hypnosis is a thing, if you believe in it.
Hypnosis can actually be fairly risky—certain hypnotic suggestions can be self-reinforcing/addictive, so you want to avoid that stuff. In general, anything that can overwrite your fundamental personality is quite risky. For these reasons, I’m not sure I’d endorse hypnosis becoming a widespread tool in the rationalist toolkit.

Caleb Biddulph Jan 31, 2025, 5:27 PM
5 points
0
in reply to: Thane Ruthenis’s comment on: CBiddulph’s Shortform
Yeah, it’s possible that CoT training unlocks reward hacking in a way that wasn’t previously possible. This could be mitigated at least somewhat by continuing to train the reward function online, and letting the reward function use CoT too (like OpenAI’s “deliberative alignment” but more general).
I think a better analogy than martial arts would be writing. I don’t have a lot of experience with writing fiction, so I wouldn’t be very good at it, but I do have a decent ability to tell good fiction from bad fiction. If I practiced writing fiction for a year, I think I’d be a lot better at it by the end, even if I never showed it to anyone else to critique. Generally, evaluation is easier than generation.
Martial arts is different because it involves putting your body in OOD situations that you are probably pretty poor at evaluating, whereas “looking at a page of fiction” is a situation that I (and LLMs) are much more familiar with.

Caleb Biddulph Jan 30, 2025, 9:35 PM
8 points
0
on: CBiddulph’s Shortform
There’s been a widespread assumption that training reasoning models like o1 or r1 can only yield improvements on tasks with an objective metric of correctness, like math or coding. See this essay, for example, which seems to take as a given that the only way to improve LLM performance on fuzzy tasks like creative writing or business advice is to train larger models.
This assumption confused me, because we already know how to train models to optimize for subjective human preferences. We figured out a long time ago that we can train a reward model to emulate human feedback and use RLHF to get a model that optimizes this reward. AI labs could just plug this into the reward for their reasoning models, reinforcing the reasoning traces leading to responses that obtain higher reward. This seemed to me like a really obvious next step.
Well, it turns out that DeepSeek r1 actually does this. From their paper:
2.3.4. Reinforcement Learning for all Scenarios
To further align the model with human preferences, we implement a secondary reinforcement learning stage aimed at improving the model’s helpfulness and harmlessness while simultaneously refining its reasoning capabilities. Specifically, we train the model using a combination of reward signals and diverse prompt distributions. For reasoning data, we adhere to the methodology outlined in DeepSeek-R1-Zero, which utilizes rule-based rewards to guide the learning process in math, code, and logical reasoning domains. For general data, we resort to reward models to capture human preferences in complex and nuanced scenarios. We build upon the DeepSeek-V3 pipeline and adopt a similar distribution of preference pairs and training prompts. For helpfulness, we focus exclusively on the final summary, ensuring that the assessment emphasizes the utility and relevance of the response to the user while minimizing interference with the underlying reasoning process. For harmlessness, we evaluate the entire response of the model, including both the reasoning process and the summary, to identify and mitigate any potential risks, biases, or harmful content that may arise during the generation process. Ultimately, the integration of reward signals and diverse data distributions enables us to train a model that excels in reasoning while prioritizing helpfulness and harmlessness.
This checks out to me. I’ve already noticed that r1 feels significantly better than other models at creative writing, which is probably due to this human preference training. While o1 was no better at creative writing than other models, this might just mean that OpenAI didn’t prioritize training o1 on human preferences. My Manifold market currently puts a 65% chance on chain-of-thought training outperforming traditional LLMs by 2026, and it should probably be higher at this point.
We need to adjust our thinking around reasoning models—there’s no strong reason to expect that future models will be much worse at tasks with fuzzy success criteria.
Adapted from my previously-posted question, after cubefox pointed out that DeepSeek is already using RLHF.

Caleb Biddulph’s Shortform

Caleb BiddulphJan 30, 2025, 9:35 PM

4 points

39 comments1 min readLW link

Caleb Biddulph Jan 30, 2025, 9:12 PM
1 point
0
in reply to: cubefox’s comment on: Why not train reasoning models with RLHF?
Yeah, but there are probably other interesting takeaways

Caleb Biddulph

Caleb Bid­dulph’s Shortform

Caleb Biddulph’s Shortform