zac_kenton

Karma: 447

Zac Kenton—Senior Research Scientist in the AGI Safety & Alignment team at Google DeepMind. Formerly at MILA, FHI, OATML (Oxford).

zac_kenton Jul 12, 2024, 10:32 AM
3 points
0
on: AI #72: Denying the Future
Thanks for featuring our work! I’d like to clarify a few points, which I think each share some top-level similarities: our study is study of protocols as inference-only (which is cheap and quick to study, possibly indicative) whereas what we care more about it protocols for training (which is much more expensive, and will take longer to study) which was out of scope for this work, though we intend to look at that next based on our findings—e.g. we have learnt that some domains are easier to work with than others, some baseline protocols are more meaningful/easier to interpret. In my opinion this is time well-spent to avoid spending lots more money and time on rushing into finetuning but with a bad setup.
The paper does not discuss compute costs. Which is odd, since to me that seems like the central thing you are doing?
Claude estimates that compared to asking the question directly, using the article is a 1.2x-1.5x compute cost. If you use advanced techniques, then if the models had similar costs the cost would be 6x-8x for consultancy, 8x-10x for debate and 7x-11x for open versions, times N if you do best-of-N. Then you have to multiply again because the consultants and debaters are larger more expensive models.
I haven’t carefully thought through these estimates (especially the use of an article, which to me seems to depend largely on the article length), but it looks like you’re considering the inference costs. In the eventual use-case of using scalable oversight for training/finetuning, the cost of training is amortised. Typical usage would then be sample once from the finetuned model (as the hope is that the training incentives initial response eg for truthfulness. You could play out the whole debate if you want to at deployment,, e.g. for monitoring, but not necessary in general). It would be more appropriate to calculate finetune costs, as we don’t think there is much advantage to using these as inference procedures. We’ll be in a better position to estimate that in the next project.
And of course, given that we know Gemini 1.5 Pro is not misaligned or deceptive, there is every expectation that any strategy by Gemma other than ‘trust Gemini 1.5’s answer’ is going to make it score worse.
Actually, in theory at least, one should be able to do better even without models being explicitly misaligned/deceptive (that is the hope of debate over other protocols like consultancy, after all). We think our work is interesting because it provides some mixed results on how that works in a particular empirical setup, though clearly limited by inference-only.
So what have we learned about scalable oversight? It seems like this setup sidesteps the actual problems?
Instead I would say it implicitly highlights the problem that it is extraordinarily difficult to get the judge to do better than trusting the stronger models, a strategy which then breaks down catastrophically when you need the judge the most.
This is probably too strong a claim—we’ve tried to highlight our results are relatively mixed on the outcomes of the protocols, and are limited by being inference-only.

zac_kenton Jul 11, 2024, 4:11 PM
LW: 1 AF: 1
0
AF
in reply to: zac_kenton’s comment on: On scalable oversight with weak LLMs judging strong LLMs
The post has now been edited with the updated plots for open consultancy/debate.

zac_kenton Jul 11, 2024, 4:03 PM
LW: 3 AF: 3
0
AF
in reply to: Fabien Roger’s comment on: On scalable oversight with weak LLMs judging strong LLMs
Thanks for the comment Fabien. A couple of points:
- open debate accuracy is (almost, except for the way we handle invalid answers, which is very rare) the same as debate accuracy. That’s because the data is almost exactly the same—we’re just marking one debater as a protagonist based on what that model would choose under direct QA. So it’s not bad that open debate has same accuracy as debate, that was expected. It is kinda bad that it’s somewhat worse than open consultancy, though we didn’t try ‘fully open debate’ where debaters can both pick same side (or opposite, perhaps under resampling/rephrasing etc). This is probably a better comparison to open consultancy.
- your points about open consultancy, which I roughly understand as ‘weak judge would score higher if they just trusted the consultant’ is a good point, and has made us double check our filtering code, and I think we do have a bug there (accidentally used the weak judge’s model under direct QA to select the consultant’s answer, should have used consultant’s model, and similarly for debate). Fixing that bug brings the open consultancy accuracies for weak judges up to roughly in line with direct QA accuracy of strong consultant’s model (so it is better than open debate), and slightly increases protagonist winrate (without affecting debate accuracy).
Thanks so much for this—it prompted us to look for bugs! We will update the arxiv and add an edit on this blogpost.
On the footnote—sorry for confusion, but we do still think it’s meaningful to take the answer as what the judge gives (as we’re interested in the kind of feedback the consultant would get for training, rather than just how the consultant is performing, for which consultant accuracy is more appropriate).
And yes, I am interested in the versions of these protocols that incentivise arguing for the side ‘you believe to be true’/‘is most convincing’ and seeing how that affects the judge. We’re aiming for these set ups in the next project (e.g. ‘fully open debate’ above).

On scalable oversight with weak LLMs judging strong LLMs

zac_kenton, Noah Siegel, janos, Jonah Brown-Cohen, Samuel Albanie, David Lindner and Rohin Shah

Jul 8, 2024, 8:59 AM

49 points

18 comments7 min readLW link

(arxiv.org)

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Seb Farquhar, Vikrant Varma, zac_kenton, gasteigerjo, Vlad Mikulik and Rohin Shah

Dec 18, 2023, 11:58 AM

147 points

21 comments10 min readLW link

zac_kenton Nov 3, 2022, 1:43 PM
LW: 2 AF: 2
0
AF
in reply to: michaelcohen’s comment on: Threat Model Literature Review
Thanks for the comment Michael. Firstly, just wanted to clarify the framing of this literature review—when considering strengths and weaknesses of each threat model, this was done in light of what we were aiming to do: generate and prioritise alignment research projects—rather than as an all-things-considered direct critique of each work (I think that is best done by commenting directly on those articles etc). I’ll add a clarification of that at the top. Now to your comments:

To your 1st point: I think the lack of specific assumptions about the AGI development model is both a strength and a weakness. Regarding the weakness, we mention it because it makes it harder to generate and prioritize research projects. It could be more helpful to say more explicitly, or earlier in the article what kind of systems you’re considering, perhaps pointing to the closest current prosaic system, or explaining why current systems are nothing like what you imagine the AGI development model is like.

On your 2nd point: What I meant was more “what about goal misgeneralization? Wouldn’t that mean the agent is likely to not be wireheading, and pursuing some other goal instead?”—you hint at this at the end of the section on supervised learning but that was in the context of whether a supervised learner would develop a misgeneralized long-term goal, and settled on being agnostic there.

On your 3rd point: It could have been interesting to read arguments for why would it need all available energy to secure its computer, rather than satisficing at some level. Or some detail on the steps for how it builds the technology to gather the energy, or how it would convert that into defence.

Threat Model Literature Review

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

Nov 1, 2022, 11:03 AM

78 points

4 comments25 min readLW link

Clarifying AI X-risk

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

Nov 1, 2022, 11:03 AM

127 points

24 comments4 min readLW link 1 review

zac_kenton Aug 23, 2022, 1:31 PM
LW: 3 AF: 2
0
AF
in reply to: Algon’s comment on: Discovering Agents
I haven’t considered this in great detail, but if there are $N$ variables, then I think the causal discovery runtime is $O (N^{2})$ . As we mention in the paper (footnote 5) there may be more efficient causal discovery algorithms that make use of certain assumptions about the system.
On adoption, perhaps if one encounters a situation where the computational cost is too high, one could coarse-grain their variables to reduce the number of variables. I don’t have results on this at the moment but I expect that the presence of agency (none, or some) is robust to the coarse-graining, though the exact number of agents is not (example 4.3), nor are the variables identified as decisions/utilities (Appendix C).

zac_kenton Aug 23, 2022, 1:14 PM
2 points
0
in reply to: Jack R’s comment on: Discovering Agents
Thanks, this has now been corrected to say ‘not terminal’.

Discovering Agents

zac_kentonAug 18, 2022, 5:33 PM

73 points

11 comments6 min readLW link

zac_kenton

On scal­able over­sight with weak LLMs judg­ing strong LLMs

Dis­cus­sion: Challenges with Un­su­per­vised LLM Knowl­edge Discovery

Threat Model Liter­a­ture Review

Clar­ify­ing AI X-risk

Dis­cov­er­ing Agents

On scalable oversight with weak LLMs judging strong LLMs

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Threat Model Literature Review

Clarifying AI X-risk

Discovering Agents