James Chua

Karma: 372

https://jameschua.net/about/

James Chua Apr 13, 2025, 9:24 AM
1 point
0
in reply to: Guillaume Martres’s comment on: OpenAI Responses API changes models’ behavior
glad this was helpful! Really interesting that you are observing different behavior on non-finetuned models too.
Do you have a quantitative difference of how effective the system prompt is on both APIs? E.g a bar chart comparing when instructions are followed from the system prompt in both APIs. Would be an interesting finding!

James Chua Apr 12, 2025, 4:53 PM
1 point
0
in reply to: Sam Marks’s comment on: OpenAI Responses API changes models’ behavior
Hi Sam!
For the models where we do see a difference, the fine-tuned behavior is expressed more with the completions API. so yes, we recommend people to use the completions API.
(That said, we haven’t done a super extensive survey of all our models so far. So i’m curious if others observe this issue and have the same experience)

OpenAI Responses API changes models’ behavior

Jan Betley and James Chua

Apr 11, 2025, 1:27 PM

52 points

6 comments2 min readLW link

James Chua Mar 22, 2025, 9:42 PM
11 points
0
on: Do models say what they learn?
Thanks for this interesting and important work! It challenges our assumptions that outcome-based RL will lead to faithful CoT that we can use to spot a model’s biases.
>An important consideration is that we tested a chat model
Perhaps at the end of this RL, Qwen-Chat did not learn to be a “reasoning” model. It does not know how to use its long CoT to arrive at better answers.
Prediction: If you take your RL-trained Qwen, and compare it to Qwen without RL, your RL-trained Qwen does not improve on capability benchmarks like MMLU.
Perhaps if you started with e.g. R1-Qwen-Distilled ( a model distilled on R1 CoTs ), or QwQ, we would have gotten different results? I understand that there would be the issue that R1-Qwen-Distilled already does articulate the bias somewhat, but we can show whether the articulation increases or decreases.

James Chua Mar 13, 2025, 6:16 PM
LW: 3 AF: 3
0
AF
in reply to: Daniel Kokotajlo’s comment on: OpenAI: Detecting misbehavior in frontier reasoning models
Do you have a sense of what I, as a researcher, could do?
I sense that having users/companies want faithful CoT is very important.. In-tune users, as. nostalgebraist points out, will know how to use CoTs to debug LLMs. But I’m not sure whether this represents only 1% of users, so big labs just won’t care. Maybe we need to try and educate more users about this. Maybe reach out to people who tweet about LLM best use cases to highlight this?

James Chua Mar 11, 2025, 2:52 AM
LW: 15 AF: 9
0
AF
on: OpenAI: Detecting misbehavior in frontier reasoning models
we extend the duration of the “golden era” in which people can mostly sorta tell what the best AIs are thinking
Agreed. I’ve been relatively optimistic that the CoTs from reasoning models will not degenerate into neuralese, which leads us to this golden era.
Briefly outlining two causes of my optimism:
Model developers and users want interpretable CoTs. Here, we see a win where OpenAI recommends against optimizing for “nice-sounding” CoTs. One obstacle to that is if other model developers don’t care. Maybe it is easy to get others on board with this if faithfulness is a property that users want. We know users can be dumb and may prefer nice-sounding CoT. But if given the choice between “nice-sounding but false” vs “bad-sounding but true”, it seems possible that the users’ companies, in principle, would prefer true reasoning versus false reasoning. Maybe especially because it is easier to spot issues when working with LLMs. E.g. Maybe users like seeing DeepSeek R1′s thinking because it helps them spot when DeepSeek misunderstands instructions.
Natural language is a pretty good local optima to write the CoT in. Most of the pretraining data is in natural language. You can also learn “offline” from other models by training on their CoT. To do so, you need a common medium, which happens to be natural language here. We know that llamas mostly work in English. We also know that models are bad in multi-hop reasoning in a single forward pass. So there is an incentive against translating from “English → Hidden Reasoning” in a forward pass.
Also, credit to you for pointing out that we should not optimize for nice-sounding CoTs since 2023.

James Chua Feb 28, 2025, 3:41 AM
4 points
0
in reply to: mattmacdermott’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
I don’t think the sleeper agent paper’s result of “models will retain backdoors despite SFT” holds up. (When you examine other models or try further SFT).
See sara price’s paper https://arxiv.org/pdf/2407.04108.

James Chua Feb 27, 2025, 5:42 AM
14 points
1
in reply to: EJT’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
I too was initially confused by this. In this paper, models generalize widely. Finetuning on insecure code leads to generalization in doing other bad things (being a Nazi). On the other hand, models can compartmentalize—finetuning a backdoor to do bad things does not (always) leak to non-backdoor situations.
When do models choose to compartmentalize? You have two parts of the dataset for finetuning backdoors. One part of the dataset is the bad behavior with the backdoor. The other part is the “normal” behavior that does not have a backdoor. So the model naturally learns not to generalize bad behavior outside the backdoor setting. Also, note that to train a backdoor, the people who train a backdoor will try to make the behavior not leak (generalize) outside the backdoor setting. So there is selection pressure against generalization.
In this paper, the authors (my colleagues) train only on insecure code. So the model has a “choice”. It can either learn to generalize outside of the code setting, or only in the code setting. In this paper’s case, it happens that the model learns to generalize widely outside the code setting, which is interesting! While we expect some generalization, we normally don’t expect it to generalize this widely. (otherwise you would have seen this result before in other papers)

James Chua Feb 19, 2025, 12:39 AM
3 points
0
on: Inference-Time-Compute: More Faithful? A Research Note
Updated version of the paper with results from DeepSeek, and cool discussion about reward models here
https://x.com/OwainEvans_UK/status/1891889067228528954

James Chua Jan 21, 2025, 2:42 PM
8 points
2
in reply to: Aaron_Scher’s comment on: Inference-Time-Compute: More Faithful? A Research Note
thanks!
i’m away from work now so i cant write much. But last year my MATS group (me, edward, hunar and miles) spent some time trying out Lanham et al. tests (two months?).
Its not easy to use for a few reasons. And I dont think results are easy to interpret. Here are my thoughts. May remember better ones later. Hopefully there can be an improved version of lanham’s methods.
Recap:
Lanham’s method truncates CoT or adds a mistake. It then forces the model to give an answer straightaway.
- One big assumption the causality of the cot. This means that in whatever situation you have where the cot is not causal, you can never “prove” faithfulness. So, you need to search for questions which the model can solve with cot, but cannot solve in non-cot setup. For example, its impossible to prove the faithfulness of questions like “10 + 10″, because the model can solve it in a non-cot fashion. This requirement of casuality seems undesirable—you want to prove faithfulness even in areas where the model can do it in non-cot. If my LLM is scheming and lying to me about “10 + 10”, i still want to be able to prove its faithfulness!
- as an implementation detail it can be annoying to implement because models have different non-cot capabilities.
- Adding a mistake to the CoT, then forcing the model to answer is quite an out of distribution behavior. What kind of mistake do you want to add? It seems that sonetimes models ignore models as part of their normal behavior. Maybe they would even articulate that “Oh actually i made a mistake, and am going to ignore it and not change my answer”. But because we force the model to give an answer straightaway, the model has no chance to articulate what it is doing. And since the model does not switch its answer, we deem the CoT as “unfaithful” even though it would have articulate it under normal situations.

James Chua Jan 18, 2025, 8:36 AM
5 points
0
in reply to: Daniel Tan’s comment on: Daniel Tan’s Shortform
“Speedrun” projects. Write papers with hypothetical data and decide whether they’d be interesting. If not then move on to something else.

Writing hypothetical paper abstracts has been a good quick way for me to figure out if things would be interesting.

New, improved multiple-choice TruthfulQA

Owain_Evans, James Chua and Steph Lin

Jan 15, 2025, 11:32 PM

72 points

0 comments3 min readLW link

James Chua Jan 15, 2025, 5:38 PM
7 points
0
on: Inference-Time-Compute: More Faithful? A Research Note
We plan to iterate on this research note in the upcoming weeks. Feedback is welcome!
Ideas I want to explore:
- New reasoning models may be released (e.g. deepseek-r1 API, some other open source ones). Can we reproduce results?
- Do these ITC models articulate reasoning behind e.g social biases / medical advice?
- Try to plant backdoor. Do these models articulate the backdoor?

James Chua Jan 15, 2025, 5:27 PM
2 points
0
in reply to: Daniel Tan’s comment on: Inference-Time-Compute: More Faithful? A Research Note
thanks! heres my initial thought about introspection and how to improve on the setup there:
in my introspection paper we train models to predict their behavior in a single forward pass without CoT.
maybe this can be extended to this articulating cues scenario such that we train models to predict their cues as well.
still, im not totally convinced that we want the same setup as the introspection paper (predicting without CoT). it seems like an unnecessary restraint to force this kind thinking about the effect of a cue in a single forward pass. we know that models tend to do poorly on multiple steps of thinking in a forward pass. so why handicap ourselves?
my current thinking is that it is more effective for models to generate hypotheses explicitly and then reason about what effects their reasoning afterwards. maybe we can train models to be more calibrated about what hypotheses to generate when they carry out their CoT. seems ok.

James Chua Jan 15, 2025, 9:19 AM
3 points
0
in reply to: quetzal_rainbow’s comment on: Inference-Time-Compute: More Faithful? A Research Note
thanks! Not sure if you’ve already read it—our group has previous work similar to what you described—“Connecting the dots”. Models can e.g. articulate functions that that implicit in the training data. This ability is not perfect, models still have a long way to go.
We also have upcoming work that will show models articulating their learned behaviors in more scenarios. Will be released soon.