scasper

Karma: 2,006

https://stephencasper.com/

EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety

scasperFeb 17, 2023, 8:48 PM

49 points

9 comments12 min readLW link

scasper Feb 17, 2023, 4:53 PM
1 point
0
in reply to: Noosphere89’s comment on: EIS V: Blind Spots In AI Safety Interpretability Research
I see the point of this post. No arguments with the existence of productive reframing. But I do not think this post makes a good case for reframing being robustly good. Obviously, it can be bad too. And for the specific cases discussed in the post, the post you linked doesn’t make me think “Oh, these are reframed ideas, so good—glad we are doing redundant work in isolation.”
For example with polysemanticity/superposition I think that TAISIC’s work has created generational confusion and insularity that are harmful. And I think TAISIC’s failure to understand that MI means doing program synthesis/induction/language-translation has led to a lot of unproductive work on toy problems using methods that are unlikely to scale.

scasper Feb 17, 2023, 4:30 PM
LW: 6 AF: 2
6
AF
in reply to: Noosphere89’s comment on: EIS V: Blind Spots In AI Safety Interpretability Research
This seems interesting. I do not know of steelmen for isolation, renaming, reinventing, etc. What is yours?

EIS V: Blind Spots In AI Safety Interpretability Research

scasperFeb 16, 2023, 7:09 PM

57 points

24 comments10 min readLW link

EIS IV: A Spotlight on Feature Attribution/Saliency

scasperFeb 15, 2023, 6:46 PM

19 points

1 comment4 min readLW link

scasper Feb 15, 2023, 3:22 PM
LW: 5 AF: 3
0
AF
in reply to: Butanium’s comment on: EIS III: Broad Critiques of Interpretability Research
Thanks. I’ll talk in some depth about causal scrubbing in two of the upcoming posts which narrow down discussion specifically to AI safety work. I think it’s a highly valuable way of measuring how well a hypothesis seems to explain a network, but there are some pitfalls with it to be aware of.

EIS III: Broad Critiques of Interpretability Research

scasperFeb 14, 2023, 6:24 PM

20 points

2 comments11 min readLW link

scasper Feb 13, 2023, 7:00 PM
1 point
0
in reply to: Clement Neo’s comment on: We Found An Neuron in GPT-2
Thanks, but I’m asking more about why you chose to study this particular thing instead of something else entirely. For example, why not study “this” versus “that” completions or any number of other simple things in the language model?

scasper Feb 13, 2023, 2:34 PM
1 point
0
on: We Found An Neuron in GPT-2
How was the ′ a’ v. ′ an’ selection task selected? It seems quite convenient to probe for and also the kind of thing that could result from p-hacking over a set of similar simple tasks.

scasper Feb 10, 2023, 8:09 AM
LW: 1 AF: 1
0
AF
in reply to: LawrenceC’s comment on: The Engineer’s Interpretability Sequence (EIS) I: Intro
Correct. I intended the 3 paragraphs in that comment to be separate thoughts. Sorry.

scasper Feb 9, 2023, 8:00 PM
LW: 1 AF: 1
0
AF
in reply to: ryan_greenblatt’s comment on: EIS II: What is “Interpretability”?
There are not that many that I don’t think are fungible with interpretability work :)
But I would describe most outer alignment work to be sufficiently different...

scasper Feb 9, 2023, 7:46 PM
LW: 1 AF: 1
0
AF
in reply to: ryan_greenblatt’s comment on: The Engineer’s Interpretability Sequence (EIS) I: Intro
Interesting to know that about the plan. I have assumed that remix was in large part about getting more people into this type of work. But I’m interested in the conclusions and current views on it. Is there a post reflecting on how it went and what lessons were learned from it?

scasper Feb 9, 2023, 7:44 PM
LW: 3 AF: 2
0
AF
in reply to: Charlie Steiner’s comment on: The Engineer’s Interpretability Sequence (EIS) I: Intro
I think that my personal thoughts on capabilities externalities are reflected well in this post.
I’d also note that this concern isn’t very unique to interpretability work but applies to alignment work in general. And in comparison to other alignment techniques, I think that the downside risks of interpretability tools are most likely lower than those of stuff like RLHF. Most theories of change for interpretability helping with AI safety involve engineering work at some point in time, so I would expect that most interpretability researchers have similar attitudes to this on dual use concerns.
In general, a tool being engineering-relevant does not imply that it will be competitive for setting a new SOTA on something risky. So when I will talk about engineering relevance in this sequence, I don’t have big advancements in mind so much as stuff like fairly simple debugging work.

scasper Feb 9, 2023, 7:35 PM
LW: 1 AF: 1
0
AF
in reply to: ryan_greenblatt’s comment on: The Engineer’s Interpretability Sequence (EIS) I: Intro
I think that (1) is interesting. This sounds plausible, but I do not know of any examples of this perspective being fleshed out. Do you know of any posts on this?

scasper Feb 9, 2023, 7:30 PM
LW: 2 AF: 2
−2
AF
in reply to: ryan_greenblatt’s comment on: The Engineer’s Interpretability Sequence (EIS) I: Intro
Thanks! I discuss in the second post of the sequence why I lump ARC’s work in with human-centered interpretability.

EIS II: What is “Interpretability”?

scasperFeb 9, 2023, 4:48 PM

28 points

6 comments4 min readLW link

The Engineer’s Interpretability Sequence (EIS) I: Intro

scasperFeb 9, 2023, 4:28 PM

46 points

24 comments3 min readLW link

scasper Jan 3, 2023, 8:01 PM
35 points
14
on: How to eat potato chips while typing
There seems to be high variance in the scope of the challenges that Katja has been tackling recently.

scasper Dec 27, 2022, 4:57 AM
8 points
13
on: Slightly against aligning with neo-luddites
There are 4 points of disagreement I have about this post.
First, I think it’s fundamentally based on a strawperson.
my fundamental objection is that their specific strategy for delaying AI is not well targeted.
This post provides an argument for not adopting the “neo-luddite” agenda or not directly empowering neo-luddites. This is not an argument against allying with neo-luddites for specific purposes. I don’t know of anyone who has actually advocated for the former. This is not how I would characterize Katija’s post.
Second, I think there is an inner strawperson with the example about text-to-image models. From a bird’s eye view, I agree with caring very little about these models mimicking humans artistic styles. But this is not where the vast majority of tangible harm may be coming from with text-to-image models. I think that this most likely comes from non-consensual deepfakes being easy to use for targeted harassment, humiliation, and blackmail. I know you’ve seen the EA forum post about this because you commented on it. But I’d be interested in seeing a reply to my reply to your comment on the post.
Third, I think that this post fails to consider how certain (most?) regulations that neo-luddites would support could meaningfully slow risky things down. In general, any type of regulation that makes research and dev for risky AI technologies harder or less incentivized will in fact slow risky AI progress down. I think that the one example you bring up—text-to-image models—is a counterexample to your point. Suppose we pass a bunch of restrictive IP laws that make it more painful to research, develop, and deploy text-to-image models. That would suddenly slow down this branch of research which could concievably be useful for making riskier AI in the future (e.g. multimodal media generators), hinder revenue opportunities for companies who are speeding up risky AI progress, close off this revenue option to possible future companies who may do the same, and establish law/case law/precedent around generative models that could be set precedent or be repurposed for other types of AI later.
Fourth, I also am not convinced by the specific argument about how indiscriminate regulation could make alignment harder.
Suppose the neo-luddites succeed, and the US congress overhauls copyright law. A plausible consequence is that commercial AI models will only be allowed to be trained on data that was licensed very permissively, such as data that’s in the public domain...Right now, if an AI org needs some data that they think will help with alignment, they can generally obtain it, unless that data is private.
This is a nitpick, but I don’t actually predict this scenario would pan out. I don’t think we’d realistically overhaul copyright law and have the kind of regime with datasets that you describe. But this is probably a question for policy people. There are also neo-luddite solutions that your argument would not apply to—like having legal requirements for companies to make their models “forget” certain content upon request. This would only be a hindrance to the deployer.
Ultimately though, what matters is not whether something makes certain alignment research harder. It matters how much something makes alignment research harder relative to how much it makes risky research harder. Alignment researchers are definitely the ones that are differentially data-hungry. What’s a concrete, concievable story in which something like the hypothetical law you described makes things differentially harder for alignment researchers compared to capabilities researchers?
What links here?
- c.trout's comment on Slightly against aligning with neo-luddites by Matthew Barnett (Dec 27, 2022, 5:06 AM; 5 points)

scasper Dec 27, 2022, 2:46 AM
LW: 2 AF: 2
0
AF
in reply to: paulfchristiano’s comment on: Avoiding perpetual risk from TAI
This is an interesting point. But I’m not convinced, at least immediately, that this isn’t likely to be largely a matter of AI governance.
There is a long list of governance strategies that aren’t specific to AI that can help us handle perpetual risk. But there is also a long list of strategies that are. I think that all of the things I mentioned under strategy 2 have AI specific examples:
establishing regulatory agencies, auditing companies, auditing models, creating painful bureaucracy around building risky AI systems, influencing hardware supply chains to slow things down, and avoiding arms races.
And I think that some of the things I mentioned for strategy 3 do too:
giving governments powers to rapidly detect and respond to firms doing risky things with TAI, hitting killswitches involving global finance or the internet, cybersecurity, and generally being more resilient to catastrophes as a global community.
So ultimately, I won’t make claims about whether avoiding perpetual risk is mostly an AI governance problem or mostly a more general governance problem, but certainly there are a bunch of AI specific things in this domain. I also think they might be a bit neglected relative to some of the strategy 1 stuff.

scasper

EIS VI: Cri­tiques of Mechanis­tic In­ter­pretabil­ity Work in AI Safety

EIS V: Blind Spots In AI Safety In­ter­pretabil­ity Research

EIS IV: A Spotlight on Fea­ture At­tri­bu­tion/​Saliency

EIS III: Broad Cri­tiques of In­ter­pretabil­ity Research

EIS II: What is “In­ter­pretabil­ity”?

The Eng­ineer’s In­ter­pretabil­ity Se­quence (EIS) I: Intro