Co-Director at Apart Research—https://apartresearch.com/
Jason Hoelscher-Obermaier
There is no constraint towards specifying measurable goals of the kind that lead to reward-hacking concerns.
I’m not sure that reward-hacking in LM agent systems is inevitable, but it seems at least plausible that reward hacking could occur in such systems without further precautions.
For example, if oversight is implemented via an overseer LLM agent O which gives scores for proposed actions by another agent A, then A might end up adversarially optimizing against O if A is set up for a high success rate (high rate of actions accepted).
(I agree very much with the general point of the post, though)
> key discrepancies in the explanations that lead models to support the biased answer instead of the correct answer in many cases come near the end of the explanation
That’s interesting. Any idea why it’s likelier to have the invalid reasoning step (that allows the biased conclusion) towards the end of the CoT rather than right at the start?
Thanks for the pointer to the discussion and your thoughts on planning in LLMs. That’s helpful.
Do you happen to know which decoding strategy is used for the models you investigated? I think this could make a difference regarding how to think about planning.
Say we’re sampling 100 full continuations. Then we might end up with some fraction of these continuations ending with the biased answer. Assume now the induced bias leads the model to assign a very low probability for the last token being the correct, unbiased answer. In this situation, we could end up with a continuation that leads to the biased answer even if the model did not have a representation of the desired answer directly after the prompt.
(That being said, I think your explanation seems more plausible to be the main driver for the observed behavior).
I found this very interesting, thanks for the write-up! Table 3 of your paper is really fun to look at.
I’m actually puzzled by how good the models are at adapting their externalised reasoning to match the answer they “want” to arrive at. Do you have an intuition for how this actually works?
Intuitively, I would think the bias has a strong effect on the final answer but much less so on the chain of thought preceding it. Yet the cot precedes the final answer , so how does the model “plan ahead” in this case?
Show that we can recover superhuman knowledge from language models with our approach
Maybe applying CCS to a scientific context would be an option for extending the evaluation?
For example, harvesting undecided scientific statements, which we expect to become resolved soonish, and using CCS for predictions on these statements?
Regarding the applicability of CCS to superhuman questions and superhuman models:
Is there any data on how CCS accuracy scales with difficulty of the question?
Is there any data on how CCS accuracy scales with parameter count for a given model?
Would it make sense to use truths discovered via CCS as a training signal for fine-tuning LLMs?
Glad you’re doing this. By default, it seems we’re going to end up with very strong tool-use models where any potential safety measures are easily removed by jailbreaks or fine-tuning. I understand you as working on: How are we going to know that it happened? is that a fair characterization?
Another important question: What should the response be to the appearance of such a model? any thoughts?