Joseph Bloom comments on Neuroscience and Alignment

Joseph Bloom 24 Mar 2024 8:49 UTC
1 point
0
with the mech interp people where they think we can identify values or other high-level concepts like deception simply by looking at the model’s linear representations bottom-up, where I think that’ll be a highly non-trivial problem.
I’m not sure anyone I know in mech interp is claiming this is a non-trivial problem.
- Jozdien 3 Apr 2024 19:39 UTC
  2 points
  0
  Parent
  Yeah sorry I should have been more precise. I think it’s so non-trivial that it plausibly contains most of the difficulty in the overall problem—which is a statement I think many people working on mechanistic interpretability would disagree with.