paulfchristiano comments on Mechanistic anomaly detection and ELK

paulfchristiano 28 Nov 2022 19:34 UTC
LW: 4 AF: 3
0
AF
deceptive reasoning is causally upstream of train output variance (e.g. because the model has read ARC’s post on anomaly detection), so is included in π.
I’m not sure I fully understand this example, but I think it’s fine. The idea is:
- Suppose the model thinks about “Should I tell the truth, or do a treacherous turn?” On the training distribution it always concludes “tell the truth.”
- A good explanation of the model’s behavior on the training distribution will capture this fact (otherwise it will completely fail to match the empirics).
- If we simply replaced the answer to this question with the constant “tell the truth” then we’d get the same behavior.
- So a sub-explanation π can effectively ablate the answer to this question with its training-set-mean, and still explain the variance.
- On a new input where the model decides to lie and say “the diamond will remain in the vault,” such a sub-explanation won’t explain why the model said that. And so we will conservatively flag this output as “potentially for abnormal reasons.”
alien philosophy explains train output variance; unfortunately it also has a notion of object permanence we wouldn’t agree with, which the (AGI) robber exploits
Same question here as in my reply to Ramana: what is this alien notion of object permanence? How is this implemented within the model? I’m very interested in concrete counterexamples along these lines, but my current belief is that it’s very plausible that it’s impossible to get to a concrete counterexample because the algorithm actually works.
We obviously won’t be able to settle this question confidently until we have a satisfactory notion of probabilistic heuristic argument. In the interim we will need to work with our pre-formal notion of heuristic argument and so everything will be a bit vague. But I currently think it will be tricky to produce a concrete counterexample that even looks like it plausibly fails.
(Note that it’s totally fine if the counterexample uses some fictionalized physics.)