scasper
My bad. Didn’t mean to imply you thought it was desirable.
There’s a crux here somewhere related to the idea that, with high probability, AI will be powerful enough and integrated into the world in such a way that it will be inevitable or desirable for normal human institutions to eventually lose control and for some small regime to take over the world. I don’t think this is very likely for reasons discussed in the post, and it’s also easy to use this kind of view to justify some pretty harmful types of actions.
Thx!
I disagree that people working on the technical alignment problem generally believe that solving that technical problem is sufficient to get to Safe & Beneficial AGI.
I won’t put words in people’s mouth, but it’s not my goal to talk about words. I think that large portions of the AI safety community act this way. This includes most people working on scalable alignment, interp, and deception.
If we don’t solve the technical alignment problem, then we’ll eventually wind up with a recipe for summoning more and more powerful demons with callous lack of interest in whether humans live or die
Yeah, I don’t really agree with the idea that getting better at alignment is necessary for safety. I think that it’s more likely than not that we’re already sufficiently good at it. The paragraph titled: “If AI causes a catastrophe, what are the chances that it will be triggered by the choices of people who were exercising what would be considered to be “best safety practices” at the time?” gives my thoughts on this.
Reframing AI Safety as a Neverending Institutional Challenge
Yea, thanks, good point. On one hand, I am assuming that the after identifying the SAE neurons of interest, they could be used for steering (this is related to section 5.3.2 of the paper). On the other hand, I am also assuming that in this case, identifying the problem is 80% of the challenge. IRL, I would assume that this kind of problem could and would be addressed by adversarial fine-tuning the model some more.
EIS XV: A New Proof of Concept for Useful Interpretability
Thanks for the comment. My probabilities sum to 165%, which would translate to me saying that I would expect, on average, the next paper to do 1.65 things from the list, which to me doesn’t seem too crazy. I think that this DOES match my expectations.
But I also think you make a good point. If the next paper comes out, does only one thing, and does it really well, I commit to not complaining too much.
EIS XIV: Is mechanistic interpretability about to be practically useful?
Some relevant papers to anyone spelunking around this post years later:
https://arxiv.org/abs/2403.05030
https://arxiv.org/abs/2407.15549
Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?
Thanks, I think that these points are helpful and basically fair. Here is one thought, but I don’t have any disagreements.
Olah et al. 100% do a good job of noting what remains to be accomplished and that there is a lot more to do. But when people in the public or government get the misconception that mechanistic interpretability has been (or definitely will be) solved, we have to ask where this misconception came from. And I expect that claims like “Sparse autoencoders produce interpretable features for large models” contribute to this.
Thanks for the comment. I think the experiments you mention are good (why I think the paper met 3), but I don’t think that its competitiveness has been demonstrated (why I think the paper did not meet 6 or 10). I think there are two problems.
First, is that it’s under a streetlight. Ideally, there would be an experiment that began with a predetermined set of edits (e.g., one from Meng et al., 2022) and then used SAEs to perform them.
Second, there’s no baseline that SAE edits are compared to. There are lots of techniques from the editing, finetuning, steering, rep-E, data curation, etc. literatures that people use to make specific changes to models’ behaviors. Ideally, we’d want SAEs to be competitive with them. Unfortunately, good comparisons would be hard because using SAEs for editing models is a pretty unique method with lots of compute required upfront. This would make it non-straightforward to compare the difficulty of making different changes with different methods, but it does not obviate the need for baselines.
EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024
Thanks for the useful post. There are a lot of things to like about this. Here are a few questions/comments.
First, I appreciate the transparency around this point.we advocate for what we see as the ‘minimal viable policy’ for creating a good AI ecosystem, and we will be open to feedback.
Second, I have a small question. Why the use of the word “testing” instead of “auditing”? In my experience, most of the conversations lately about this kind of thing revolve around the term “auditing.”
Third, I wanted to note that this post does not talk about access levels for testers and ask why this is the case. My thoughts here relate to this recent work. I would have been excited to see (1) a commitment to providing auditors with methodological details, documentation, and data or (2) a commitment to working on APIs and SREs that can allow for auditors to be securely given grey- and white-box access.
I think one thing that’s pretty cool is “home teaching.” Mormon congregation members who are able are assigned various other members of the congregation to check in on. This often involves a monthly visit to their house, talking for a bit, sharing some spiritual thoughts, etc. The nice thing about it is that home teaching sometimes really benefits people who need it. Especially for old or disabled people, they get nice visits, and home teachers often help them with stuff. In my experience, Mormons within a congregation are pretty good at helping each other with misc. things in life (e.g. fixing a broken water heater), and this is largely done through home teaching.
Thanks. I agree that the points apply to individual researchers. But I don’t think that it applies in a comparably worrisome way because individual researchers do not have comparable intelligence, money, and power compared to the labs. This is me stressing the “when put under great optimization pressure” of Goodhart’s Law. Subtle misalignments are much less dangerous when there is a weak optimization force behind the proxy than when there is a strong one.
See also this much older and closely related post by Thomas Woodside: Is EA an advanced, planning, strategically-aware power-seeking misaligned mesa-optimizer?
Analogies between scaling labs and misaligned superintelligent AI
Thanks for the reply. I think this resolves the thread and puts us on the same page :)
I’m glad you think that the post has a small audience and may not be needed. I suppose that’s a good sign.
—
In the post I said it’s good that nukes don’t blow up on accident and similarly, it’s good that BSL-4 protocols and tech exist. I’m not saying that alignment solutions shouldn’t exist. I am speaking to a specific audience (e.g. the frontier companies and allies) that their focus on alignment isn’t commensurate with its usefulness. Also don’t forget the dual nature of alignment progress. I also mentioned in the post that frontier alignment progress hastens timelines and makes misuse risk more acute.