Redwood Research and Constellation
Nate Thomas
Karma: 513
Thanks, Neel! It should be fixed now.
Apply to the Constellation Visiting Researcher Program and Astra Fellowship, in Berkeley this Winter
Causal scrubbing: results on induction heads
Causal scrubbing: results on a paren balance checker
Causal scrubbing: Appendix
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley
Note that it’s unsurprising that a different model categorizes this correctly because the failure was generated from an attack on the particular model we were working with. The relevant question is “given a model, how easy is it to find a failure by attacking that model using our rewriting tools?”
To anyone reading this who wants to work on or discuss FHI-flavored work: Consider applying to Constellation’s programs (the deadline for some of them is today!), which include salaried positions for researchers.