It seems to me like there are a couple of different notions of “being able to distinguish between mechanisms” we might want to use:
There exists an efficient algorithm which when run on a model input x will output which mechanism model M uses when run on x.
There exists an efficient algorithm which when run on a model Mwill output another programme P such that when we run P on a model input x it will output (in reasonable time) which mechanism M uses when run on x.
In general, being able to do (2) implies that we are able to do (1). It seems that in practice we’d like to be able to do (2), since then we can apply this to our predictive model and get an algorithm for anomaly detection in any particular case. (In contrast, the first statement gives us no guide on how to construct the relevant distinguishing algorithm.)
In your “prime detection” example, we can do (1) - using standard primality tests. However, we don’t know of a method for (2) that could be used to generate this particular (or any) solution to (1).
It’s not clear to me which notion you want to use at various points in your argument. In several places you talk about there not existing an efficient discriminator (i.e. (1)) - for example, as a requirement for interpretability—but I think in this case we’d really need (2) in order for these methods to be useful in general.
Thinking about what we expect to be true in the real world, I share your intuition that (1) is probably false in the fully general setting (but could possibly be true). That means we probably shouldn’t hope for a general solution to (2).
But, I also think that for us to have any chance of aligning a possibly-sensor-tampering AGI, we require (1) to be true in the sensor-tampering case. This is because if it were false, that would mean there’s no algorithm at all that can distinguish between actually-good and sensor-tampering outcomes, which would suggest that whether an AGI is aligned is undecidable in some sense. (This is similar to the first point Charlie makes.)
Since my intuition for why (2) is false in general mostly runs through my intuition that (1) is false in general, but I think (1) is true in the sensor-tampering case (or at least am inclined to focus on such worlds), I’m optimistic that there might be key differences between the sensor-tampering case and the general setting which can be exploited to provide a solution to (2) in the cases we care about. I’m less sure about what those differences should be.
Your overall picture sounds pretty similar to mine. A few differences.
I don’t think the literal version of (2) is plausible. For example, consider an obfuscated circuit.
The reason that’s OK is that finding the de-obfuscation is just as easy as finding the obfuscated circuit, so if gradient descent can do one it can do the other. So I’m really interested in some modified version of (2), call it (2′). This is roughly like adding an advice string as input to P, with the requirement that the advice string is no harder to learn than M itself (though this isn’t exactly right).
I mostly care about (1) because the difficulty of (1) seems like the main reason to think that (2′) is impossible. Conversely, if we understood why (1) was always possible it would likely give some hints for how to do (2). And generally working on an easier warm-up is often good.
I think that if (1) is false in general, we should be able to find some example of it. So that’s a fairly high priority for me, given that this is a crucial question for the feasibility or character of the worst-case approach.
That said, I’m also still worried about the leap from (1) to (2′), and as mentioned in my other comment I’m very interested in finding a way to solve the harder problem in the case of primality testing.
It seems to me like there are a couple of different notions of “being able to distinguish between mechanisms” we might want to use:
There exists an efficient algorithm which when run on a model input x will output which mechanism model M uses when run on x.
There exists an efficient algorithm which when run on a model Mwill output another programme P such that when we run P on a model input x it will output (in reasonable time) which mechanism M uses when run on x.
In general, being able to do (2) implies that we are able to do (1). It seems that in practice we’d like to be able to do (2), since then we can apply this to our predictive model and get an algorithm for anomaly detection in any particular case. (In contrast, the first statement gives us no guide on how to construct the relevant distinguishing algorithm.)
In your “prime detection” example, we can do (1) - using standard primality tests. However, we don’t know of a method for (2) that could be used to generate this particular (or any) solution to (1).
It’s not clear to me which notion you want to use at various points in your argument. In several places you talk about there not existing an efficient discriminator (i.e. (1)) - for example, as a requirement for interpretability—but I think in this case we’d really need (2) in order for these methods to be useful in general.
Thinking about what we expect to be true in the real world, I share your intuition that (1) is probably false in the fully general setting (but could possibly be true). That means we probably shouldn’t hope for a general solution to (2).
But, I also think that for us to have any chance of aligning a possibly-sensor-tampering AGI, we require (1) to be true in the sensor-tampering case. This is because if it were false, that would mean there’s no algorithm at all that can distinguish between actually-good and sensor-tampering outcomes, which would suggest that whether an AGI is aligned is undecidable in some sense. (This is similar to the first point Charlie makes.)
Since my intuition for why (2) is false in general mostly runs through my intuition that (1) is false in general, but I think (1) is true in the sensor-tampering case (or at least am inclined to focus on such worlds), I’m optimistic that there might be key differences between the sensor-tampering case and the general setting which can be exploited to provide a solution to (2) in the cases we care about. I’m less sure about what those differences should be.
Your overall picture sounds pretty similar to mine. A few differences.
I don’t think the literal version of (2) is plausible. For example, consider an obfuscated circuit.
The reason that’s OK is that finding the de-obfuscation is just as easy as finding the obfuscated circuit, so if gradient descent can do one it can do the other. So I’m really interested in some modified version of (2), call it (2′). This is roughly like adding an advice string as input to P, with the requirement that the advice string is no harder to learn than M itself (though this isn’t exactly right).
I mostly care about (1) because the difficulty of (1) seems like the main reason to think that (2′) is impossible. Conversely, if we understood why (1) was always possible it would likely give some hints for how to do (2). And generally working on an easier warm-up is often good.
I think that if (1) is false in general, we should be able to find some example of it. So that’s a fairly high priority for me, given that this is a crucial question for the feasibility or character of the worst-case approach.
That said, I’m also still worried about the leap from (1) to (2′), and as mentioned in my other comment I’m very interested in finding a way to solve the harder problem in the case of primality testing.