Extremely reasonable strategic pivot. How would you explain AI risk to a TikTok audience?
Neel Nanda
Political influence seems a very different skill to me? Lots of very influential politicians have been very incompetent in other real world ways
Alternatively, they are linked to some big major idea in governance or technical safety, often by spotting something missing years before it became relevant.
This is just a special case (and an unusually important one) of a good forecasting record, right?
I think the correct question is how much of an update should you make in an absolute sense rather than a relative sense? Many people in this community are overconfident and if you decide that every person is less worth listening to than you thought this doesn’t change who you listen to, but it should make you a lot more uncertain in your beliefs
Interesting. Thanks for the list. That seemed like a pretty reasonable breakdown to me. I think mechanistic interpretability does train some of them in particular, two, three and maybe six. But I agree that things involve thinking about society and politics and power and economics etc as a whole do seem clearly more relevant.
One major concern I have is that it’s hard to judge skill in domains with worse feedback loops because there is not feedback on who is correct. I’m curious how confident you are in your assessment of who has good takes or is good in these fields, and how you determine this?
Thanks!
okay, but, how actually DO we evaluate strategic takes?
Yeah, I don’t have a great answer to this one. I’m mostly trying to convey the spirit of: we’re all quite confused, and the people who seem competent disagree a lot, so they can’t actually be that correct. And given that the ground truth is confusion, it is epistemically healthier to be aware of this.
Actually solving these problems is way harder! I haven’t found a much better substitute than looking at people who have a good non-trivial track record of predictions, and people who have what to me seem like coherent models of the world that make legitimate and correct seeming predictions. Though the latter one is fuzzier and has a lot more false positives. A particularly salient form of a good track record is people who had positions in domains I know well (eg interpretability) that I previously thought were wrong/ridiculous, but who I later decided were right (eg I give Buck decent points here, and also a fair amount of points to Chris Olah)
I think it’s pretty plausible that something pathological like that is happening. We’re releasing this as an interesting idea that others might find useful for their use case, not as something we’re confident is a superior method. If we were continuing with SAE work, we would likely sanity check it more but we thought it better to release it than not
Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)
Yes, I agree. It’s very annoying for general epistemics (though obviously pragmatically useful to me in various ways if people respect my opinion)
Though, to be clear, my main goal in writing this post was not to request that people defer less to me specifically, but more to make the general point about please defer more intelligently using myself as an example and to avoid calling any specific person out
I agree that I’d be shocked if GDM was training on eval sets. But I do think hill climbing on benchmarks is also very bad for those benchmarks being an accurate metric of progress and I don’t trust any AI lab not to hill climb on particularly flashy metrics
I’m not trying to agree with that one. I think that if someone has thought a bunch about the general topic of AI and has a bunch of useful takes. They can probably convert this on the fly to something somewhat useful, even if it’s not as reliable as it would be if they spent a long time thinking about it. Like I think I can give useful technical mechanistic interpretability takes even if the question is about topics I’ve not spent much time thinking about before
Are the joint names separated by spaces if not, the tokenization is going to be totally broken more generally I would be interested to see this Tried with a code that EG maps familiar tokens to obscure ones or something like mapping token with id k to id maximum minus K. Tokens feel like the natural way in llm would represent its processing and thus encoded processing. Doing things in individual letters is kind of hard
I had not noticed that part. Thanks for flagging
Good Research Takes are Not Sufficient for Good Strategic Takes
Thanks a lot for doing this. This is substantially more evaluation awareness than I would have predicted. I’m not super convinced by the classifying transcript purpose experiments, since the evaluator model is plausibly primed to think about this stuff, but the monitoring results seem compelling and very concerning. Thanks a lot for doing this work. I guess we really get concerned when it stops showing up in the chain of thought...
I would disagree that either one or four was achieved because, to my knowledge, the auditing game focused on finding, not fixing. It’s also a bit ambiguous whether the prediction involves fixing with interpretability or fixing with unrelated means. It wouldn’t surprise me if you could use the SAE-derived insights to filter out the relevant data if you really wanted to, but I’d guess an LLM classifier is more effective. Did they do that in the paper?
(To be clear, I think it’s a great paper, and to the degree that there’s a disagreement here it’s that I think your predictions weren’t covering the right comparative advantages of interp)
This is fantastic work, I’d be very excited to see more work in the vein of auditing games. It seems like the one of the best ways so far to test how useful different techniques for understanding models are
I found it helpful because it put me in the frame of a alien biological intelligence rather than an AI because I have lots of preconceptions about AIs and it’s it’s easy to implicitly think in terms of expected utility maximizers or tools or whatever. While if I’m imagining an octopus, I’m kind of imagining humans, but a bit weirder and more alien, and I would not trust humans
Oh sure, an executive assistant i.e. personal assistant in a work context can be super valuable just from an impact maximisation perspective but generally they need to be hired by your employer not by you in your personal capacity (unless you have a much more permissive/low security employer than Google)
Agreed. If I’m talking to someone who I expect to be able to recalibrate, I just explain that I think the standard norms are dumb, the norms I actually follow, and then give an honest and balanced assessment. If I’m talking to someone I don’t really know, I generally give a positive but not very detailed reference or don’t reply, depending on context.
I’m confused. It feels like you’re basically saying that reality is uncertain, prediction markets reflect this, but in order to be useful for affecting the minds of ordinary voters (a specific use case that is not the main use case I think prediction markets matter for) they must be certain or near certain