We expect an explanation in terms of the weights of the model and the properties of the input distribution.
We have a model that predicts a very specific pattern of observations, corresponding to “the diamond remains in the vault.” We have a mechanistic explanation πfor how those correlations arise from the structure of the model.
Now suppose we are given a new input on which our model predicts that the diamond will appear to remain in the vault. We’d like to ask: in this case, does the diamond appear to remain in the vault for the normal reason π?
A problem with this: π can explain the predictions on both train and test distributions without all the test inputs corresponding to safe diamonds. In other words, the predictions can be made for the “normal reason” π even when the normal reason of the diamond being safe doesn’t hold.
(elaborating the comment above)
Because π is a mechanistic (as opposed to teleological, or otherwise reference-sensitive) explanation, its connection to what we would like to consider “normal reasons” has been weakened if not outright broken.
On the training distribution suppose we have two explanations for the “the diamond remains in the vault” predicted observations.
First there is ɸ, the explanation that there was a diamond in the vault and the cameras were working properly, etc. and the predictor is a straightforward predictor with a human-like world-model (ɸ is kinda loose on the details of how the predictor works, and just says that it does work).
Then there is π, which is an explanation that relies on various details about the circuits implemented by the weights of the predictor that traces abstractly how this distribution of inputs produces outputs with the observed properties, and uses various concepts and abstractions that make sense of the particular organisation of this predictor’s weights. (π is kinda glib about real world diamonds but has plenty to say about how the predictor works, and some of what it says looks like there’s a model of the real world in there.)
We might hope that a lot of the concepts π is dealing in do correspond to natural human things like object permanence or diamonds or photons. But suppose not all of them do, and/or there are some subtle mismatches.
Now on some out-of-distribution inputs that produce the same predictions, we’re in trouble when π is still a good explanation of those predictions but ɸ is not. This could happen because, e.g., π’s version of “object permanence” is just broken on this input, and was never really about object permanence but rather about a particular group of circuits that happen to do something object-permanence-like on the training distribution. Or maybe π refers to the predictor’s alien diamond-like concept that humans wouldn’t agree with if they understood it but does nevertheless explain the prediction of the same observations.
Is it an assumption of your work here (or maybe a desideratum of whatever you find to do mechanistic explanations) that the mechanistic explanation is basically in terms of a world model or simulation engine, and we can tell that’s how it’s structured? I.e., it’s not some arbitrary abstract summary of the predictor’s computation. (And also that we can tell that the world model is good by our lights?)
I’m very interested in understanding whether anything like your scenario can happen. Right now it doesn’t look possible to me. I’m interested in attempting to make such scenarios concrete to the extent that we can now, to see where it seems like they might hold up. Handling the issue more precisely seems bottlenecked on a clearer notion of “explanation.”
Right now by “explanation” I mean probabilistic heuristic argument as described here.
A problem with this: π can explain the predictions on both train and test distributions without all the test inputs corresponding to safe diamonds. In other words, the predictions can be made for the “normal reason” π even when the normal reason of the diamond being safe doesn’t hold.
The proposed approach is to be robust over all subsets of π that explain the training performance (or perhaps to be robust over all explanations π, if you can do that without introducing false positives, which depends on pinning down more details about how explanations work).
So it’s OK if there exist explanations that capture both training and test, as long as there also exist explanations that capture training but not test.
We might hope that a lot of the concepts π is dealing in do correspond to natural human things like object permanence or diamonds or photons. But suppose not all of them do, and/or there are some subtle mismatches.
I’m happy to assume that the AI’s model is as mismatched and weird as possible, as long as it gives rise to the appearance of stable diamonds. My tentative view is that this is sufficient, but I’m extremely interested in exploring examples where this approach breaks down.
This could happen because, e.g., π’s version of “object permanence” is just broken on this input, and was never really about object permanence but rather about a particular group of circuits that happen to do something object-permanence-like on the training distribution.
This is the part that doesn’t sound possible to me. The situation you’re worried about seems to be:
We have a predictor M.
There is an explanation π for why M satisfies the “object permanence regularity.”
On a new input, π still captures why M predicts the diamond will appear to just sit there.
But in fact, on this input the diamond isn’t actually the same diamond sitting there, instead something else has happened that merely makes it look like the diamond is sitting there.
I mostly just want to think about more concrete details about how this might happen. Starting with: what is actually happening in the world to make it look like the diamond is sitting there undisturbed? Is it an event that has a description in our ontology (like “the robber stole the diamond and replaced it with a fake” or “the robber tampered with the cameras so they show an image of a diamond” or whatever) or is it something completely beyond our ken? What kind of circuit within M and explanation π naturally capture both our intuitive explanation of the object permanence regularity, and the new mechanism?
(Or is it the case that on the new input the diamond won’t actually appear to remain stable, and this is just a case where M is making a mistake? I’m not nearly so worried about our predictive models simply being wrong, since then we can train on the new data to correct the problem, and this doesn’t put us at a competitive disadvantage compared to someone who just wanted to get power.)
I’m very excited about any examples (or even small steps towards building a possible example) along these lines. Right now I can’t find them, and my tentative view is that this can’t happen.
Is it an assumption of your work here (or maybe a desideratum of whatever you find to do mechanistic explanations) that the mechanistic explanation is basically in terms of a world model or simulation engine, and we can tell that’s how it’s structured? I.e., it’s not some arbitrary abstract summary of the predictor’s computation. (And also that we can tell that the world model is good by our lights?)
I don’t expect our ML systems to be world models or simulation engines, and I don’t expect mechanistic heuristic explanations to explain things in terms of humans’ models. I think this is echoing the point above, that I’m open to the whole universe of counterexamples including cases where our ML systems are incredibly strange.
First there is ɸ, the explanation that there was a diamond in the vault and the cameras were working properly, etc. and the predictor is a straightforward predictor with a human-like world-model (ɸ is kinda loose on the details of how the predictor works, and just says that it does work).
I don’t think you can make a probabilistic heuristic argument like this—a heuristic argument can’t just assert that the predictor works, it needs to actually walk through which activations are correlated in what ways and why that gives rise to human-like predictions.
deceptive reasoning is causally upstream of train output variance (e.g. because the model has read ARC’s post on anomaly detection), so is included in π.
alien philosophy explains train output variance; unfortunately it also has a notion of object permanence we wouldn’t agree with, which the (AGI) robber exploits
deceptive reasoning is causally upstream of train output variance (e.g. because the model has read ARC’s post on anomaly detection), so is included in π.
I’m not sure I fully understand this example, but I think it’s fine. The idea is:
Suppose the model thinks about “Should I tell the truth, or do a treacherous turn?” On the training distribution it always concludes “tell the truth.”
A good explanation of the model’s behavior on the training distribution will capture this fact (otherwise it will completely fail to match the empirics).
If we simply replaced the answer to this question with the constant “tell the truth” then we’d get the same behavior.
So a sub-explanation π can effectively ablate the answer to this question with its training-set-mean, and still explain the variance.
On a new input where the model decides to lie and say “the diamond will remain in the vault,” such a sub-explanation won’t explain why the model said that. And so we will conservatively flag this output as “potentially for abnormal reasons.”
alien philosophy explains train output variance; unfortunately it also has a notion of object permanence we wouldn’t agree with, which the (AGI) robber exploits
Same question here as in my reply to Ramana: what is this alien notion of object permanence? How is this implemented within the model? I’m very interested in concrete counterexamples along these lines, but my current belief is that it’s very plausible that it’s impossible to get to a concrete counterexample because the algorithm actually works.
We obviously won’t be able to settle this question confidently until we have a satisfactory notion of probabilistic heuristic argument. In the interim we will need to work with our pre-formal notion of heuristic argument and so everything will be a bit vague. But I currently think it will be tricky to produce a concrete counterexample that even looks like it plausibly fails.
(Note that it’s totally fine if the counterexample uses some fictionalized physics.)
A problem with this: π can explain the predictions on both train and test distributions without all the test inputs corresponding to safe diamonds. In other words, the predictions can be made for the “normal reason” π even when the normal reason of the diamond being safe doesn’t hold.
(elaborating the comment above)
Because π is a mechanistic (as opposed to teleological, or otherwise reference-sensitive) explanation, its connection to what we would like to consider “normal reasons” has been weakened if not outright broken.
On the training distribution suppose we have two explanations for the “the diamond remains in the vault” predicted observations.
First there is ɸ, the explanation that there was a diamond in the vault and the cameras were working properly, etc. and the predictor is a straightforward predictor with a human-like world-model (ɸ is kinda loose on the details of how the predictor works, and just says that it does work).
Then there is π, which is an explanation that relies on various details about the circuits implemented by the weights of the predictor that traces abstractly how this distribution of inputs produces outputs with the observed properties, and uses various concepts and abstractions that make sense of the particular organisation of this predictor’s weights. (π is kinda glib about real world diamonds but has plenty to say about how the predictor works, and some of what it says looks like there’s a model of the real world in there.)
We might hope that a lot of the concepts π is dealing in do correspond to natural human things like object permanence or diamonds or photons. But suppose not all of them do, and/or there are some subtle mismatches.
Now on some out-of-distribution inputs that produce the same predictions, we’re in trouble when π is still a good explanation of those predictions but ɸ is not. This could happen because, e.g., π’s version of “object permanence” is just broken on this input, and was never really about object permanence but rather about a particular group of circuits that happen to do something object-permanence-like on the training distribution. Or maybe π refers to the predictor’s alien diamond-like concept that humans wouldn’t agree with if they understood it but does nevertheless explain the prediction of the same observations.
Is it an assumption of your work here (or maybe a desideratum of whatever you find to do mechanistic explanations) that the mechanistic explanation is basically in terms of a world model or simulation engine, and we can tell that’s how it’s structured? I.e., it’s not some arbitrary abstract summary of the predictor’s computation. (And also that we can tell that the world model is good by our lights?)
I’m very interested in understanding whether anything like your scenario can happen. Right now it doesn’t look possible to me. I’m interested in attempting to make such scenarios concrete to the extent that we can now, to see where it seems like they might hold up. Handling the issue more precisely seems bottlenecked on a clearer notion of “explanation.”
Right now by “explanation” I mean probabilistic heuristic argument as described here.
The proposed approach is to be robust over all subsets of π that explain the training performance (or perhaps to be robust over all explanations π, if you can do that without introducing false positives, which depends on pinning down more details about how explanations work).
So it’s OK if there exist explanations that capture both training and test, as long as there also exist explanations that capture training but not test.
I’m happy to assume that the AI’s model is as mismatched and weird as possible, as long as it gives rise to the appearance of stable diamonds. My tentative view is that this is sufficient, but I’m extremely interested in exploring examples where this approach breaks down.
This is the part that doesn’t sound possible to me. The situation you’re worried about seems to be:
We have a predictor M.
There is an explanation π for why M satisfies the “object permanence regularity.”
On a new input, π still captures why M predicts the diamond will appear to just sit there.
But in fact, on this input the diamond isn’t actually the same diamond sitting there, instead something else has happened that merely makes it look like the diamond is sitting there.
I mostly just want to think about more concrete details about how this might happen. Starting with: what is actually happening in the world to make it look like the diamond is sitting there undisturbed? Is it an event that has a description in our ontology (like “the robber stole the diamond and replaced it with a fake” or “the robber tampered with the cameras so they show an image of a diamond” or whatever) or is it something completely beyond our ken? What kind of circuit within M and explanation π naturally capture both our intuitive explanation of the object permanence regularity, and the new mechanism?
(Or is it the case that on the new input the diamond won’t actually appear to remain stable, and this is just a case where M is making a mistake? I’m not nearly so worried about our predictive models simply being wrong, since then we can train on the new data to correct the problem, and this doesn’t put us at a competitive disadvantage compared to someone who just wanted to get power.)
I’m very excited about any examples (or even small steps towards building a possible example) along these lines. Right now I can’t find them, and my tentative view is that this can’t happen.
I don’t expect our ML systems to be world models or simulation engines, and I don’t expect mechanistic heuristic explanations to explain things in terms of humans’ models. I think this is echoing the point above, that I’m open to the whole universe of counterexamples including cases where our ML systems are incredibly strange.
I don’t think you can make a probabilistic heuristic argument like this—a heuristic argument can’t just assert that the predictor works, it needs to actually walk through which activations are correlated in what ways and why that gives rise to human-like predictions.
To add some more concrete counter-examples:
deceptive reasoning is causally upstream of train output variance (e.g. because the model has read ARC’s post on anomaly detection), so is included in π.
alien philosophy explains train output variance; unfortunately it also has a notion of object permanence we wouldn’t agree with, which the (AGI) robber exploits
I’m not sure I fully understand this example, but I think it’s fine. The idea is:
Suppose the model thinks about “Should I tell the truth, or do a treacherous turn?” On the training distribution it always concludes “tell the truth.”
A good explanation of the model’s behavior on the training distribution will capture this fact (otherwise it will completely fail to match the empirics).
If we simply replaced the answer to this question with the constant “tell the truth” then we’d get the same behavior.
So a sub-explanation π can effectively ablate the answer to this question with its training-set-mean, and still explain the variance.
On a new input where the model decides to lie and say “the diamond will remain in the vault,” such a sub-explanation won’t explain why the model said that. And so we will conservatively flag this output as “potentially for abnormal reasons.”
Same question here as in my reply to Ramana: what is this alien notion of object permanence? How is this implemented within the model? I’m very interested in concrete counterexamples along these lines, but my current belief is that it’s very plausible that it’s impossible to get to a concrete counterexample because the algorithm actually works.
We obviously won’t be able to settle this question confidently until we have a satisfactory notion of probabilistic heuristic argument. In the interim we will need to work with our pre-formal notion of heuristic argument and so everything will be a bit vague. But I currently think it will be tricky to produce a concrete counterexample that even looks like it plausibly fails.
(Note that it’s totally fine if the counterexample uses some fictionalized physics.)