I’m very interested in understanding whether anything like your scenario can happen. Right now it doesn’t look possible to me. I’m interested in attempting to make such scenarios concrete to the extent that we can now, to see where it seems like they might hold up. Handling the issue more precisely seems bottlenecked on a clearer notion of “explanation.”
Right now by “explanation” I mean probabilistic heuristic argument as described here.
A problem with this: π can explain the predictions on both train and test distributions without all the test inputs corresponding to safe diamonds. In other words, the predictions can be made for the “normal reason” π even when the normal reason of the diamond being safe doesn’t hold.
The proposed approach is to be robust over all subsets of π that explain the training performance (or perhaps to be robust over all explanations π, if you can do that without introducing false positives, which depends on pinning down more details about how explanations work).
So it’s OK if there exist explanations that capture both training and test, as long as there also exist explanations that capture training but not test.
We might hope that a lot of the concepts π is dealing in do correspond to natural human things like object permanence or diamonds or photons. But suppose not all of them do, and/or there are some subtle mismatches.
I’m happy to assume that the AI’s model is as mismatched and weird as possible, as long as it gives rise to the appearance of stable diamonds. My tentative view is that this is sufficient, but I’m extremely interested in exploring examples where this approach breaks down.
This could happen because, e.g., π’s version of “object permanence” is just broken on this input, and was never really about object permanence but rather about a particular group of circuits that happen to do something object-permanence-like on the training distribution.
This is the part that doesn’t sound possible to me. The situation you’re worried about seems to be:
We have a predictor M.
There is an explanation π for why M satisfies the “object permanence regularity.”
On a new input, π still captures why M predicts the diamond will appear to just sit there.
But in fact, on this input the diamond isn’t actually the same diamond sitting there, instead something else has happened that merely makes it look like the diamond is sitting there.
I mostly just want to think about more concrete details about how this might happen. Starting with: what is actually happening in the world to make it look like the diamond is sitting there undisturbed? Is it an event that has a description in our ontology (like “the robber stole the diamond and replaced it with a fake” or “the robber tampered with the cameras so they show an image of a diamond” or whatever) or is it something completely beyond our ken? What kind of circuit within M and explanation π naturally capture both our intuitive explanation of the object permanence regularity, and the new mechanism?
(Or is it the case that on the new input the diamond won’t actually appear to remain stable, and this is just a case where M is making a mistake? I’m not nearly so worried about our predictive models simply being wrong, since then we can train on the new data to correct the problem, and this doesn’t put us at a competitive disadvantage compared to someone who just wanted to get power.)
I’m very excited about any examples (or even small steps towards building a possible example) along these lines. Right now I can’t find them, and my tentative view is that this can’t happen.
Is it an assumption of your work here (or maybe a desideratum of whatever you find to do mechanistic explanations) that the mechanistic explanation is basically in terms of a world model or simulation engine, and we can tell that’s how it’s structured? I.e., it’s not some arbitrary abstract summary of the predictor’s computation. (And also that we can tell that the world model is good by our lights?)
I don’t expect our ML systems to be world models or simulation engines, and I don’t expect mechanistic heuristic explanations to explain things in terms of humans’ models. I think this is echoing the point above, that I’m open to the whole universe of counterexamples including cases where our ML systems are incredibly strange.
First there is ɸ, the explanation that there was a diamond in the vault and the cameras were working properly, etc. and the predictor is a straightforward predictor with a human-like world-model (ɸ is kinda loose on the details of how the predictor works, and just says that it does work).
I don’t think you can make a probabilistic heuristic argument like this—a heuristic argument can’t just assert that the predictor works, it needs to actually walk through which activations are correlated in what ways and why that gives rise to human-like predictions.
I’m very interested in understanding whether anything like your scenario can happen. Right now it doesn’t look possible to me. I’m interested in attempting to make such scenarios concrete to the extent that we can now, to see where it seems like they might hold up. Handling the issue more precisely seems bottlenecked on a clearer notion of “explanation.”
Right now by “explanation” I mean probabilistic heuristic argument as described here.
The proposed approach is to be robust over all subsets of π that explain the training performance (or perhaps to be robust over all explanations π, if you can do that without introducing false positives, which depends on pinning down more details about how explanations work).
So it’s OK if there exist explanations that capture both training and test, as long as there also exist explanations that capture training but not test.
I’m happy to assume that the AI’s model is as mismatched and weird as possible, as long as it gives rise to the appearance of stable diamonds. My tentative view is that this is sufficient, but I’m extremely interested in exploring examples where this approach breaks down.
This is the part that doesn’t sound possible to me. The situation you’re worried about seems to be:
We have a predictor M.
There is an explanation π for why M satisfies the “object permanence regularity.”
On a new input, π still captures why M predicts the diamond will appear to just sit there.
But in fact, on this input the diamond isn’t actually the same diamond sitting there, instead something else has happened that merely makes it look like the diamond is sitting there.
I mostly just want to think about more concrete details about how this might happen. Starting with: what is actually happening in the world to make it look like the diamond is sitting there undisturbed? Is it an event that has a description in our ontology (like “the robber stole the diamond and replaced it with a fake” or “the robber tampered with the cameras so they show an image of a diamond” or whatever) or is it something completely beyond our ken? What kind of circuit within M and explanation π naturally capture both our intuitive explanation of the object permanence regularity, and the new mechanism?
(Or is it the case that on the new input the diamond won’t actually appear to remain stable, and this is just a case where M is making a mistake? I’m not nearly so worried about our predictive models simply being wrong, since then we can train on the new data to correct the problem, and this doesn’t put us at a competitive disadvantage compared to someone who just wanted to get power.)
I’m very excited about any examples (or even small steps towards building a possible example) along these lines. Right now I can’t find them, and my tentative view is that this can’t happen.
I don’t expect our ML systems to be world models or simulation engines, and I don’t expect mechanistic heuristic explanations to explain things in terms of humans’ models. I think this is echoing the point above, that I’m open to the whole universe of counterexamples including cases where our ML systems are incredibly strange.
I don’t think you can make a probabilistic heuristic argument like this—a heuristic argument can’t just assert that the predictor works, it needs to actually walk through which activations are correlated in what ways and why that gives rise to human-like predictions.