I think there’s something a little bit deeply confused about the core idea of “internal representation” and that it’s also not that hard to fix.
I think it’s important that our safety concepts around trained AI models/policies respect extensional equivalence, because safety or unsafety supervenes on their behaviour as opaque mathematical functions (except for very niche threat models where external adversaries are corrupting the weights or activations directly). If two models have the same input/output mapping, and only one of them has “internally represented goals”, calling the other one safer—or different at all, from a safety perspective—would be a mistake. (And, in the long run, a mistake that opens Goodhartish loopholes for models to avoid having the internal properties we don’t like without actually being safer.)
The fix is, roughly, to identify the “internal representations” in any suitable causal explanation of the extensionally observable behavior, including but not limited to a causal explanation whose variables correspond naturally to the “actual” computational graph that implements the policy/model.
If we can identify internally represented [goals, etc] in the actual computational graph, that is of course strong evidence of (and in the limit of certainty and non-approximation, logically implies) internally represented goals in some suitable causal explanation. But the converse and the inverse of that implication would not always hold.
I believe many non-safety ML people have a suspicion that safety people are making a vaguely superstitious error by assigning such significance to “internal representations”. I don’t think they are right exactly, but I think this is the steelman, and that if you can adjust to this perspective, it will make those kinds of critics take a second look. Extensional properties that scientists give causal interpretations to are far more intuitively “real” to people with a culturally stats-y background than supposedly internal/intensional properties.
Sorry these comments come at the last day; I wish I had read it in more depth a few days ago.
Overall the paper is good and I’m glad you’re doing it!
I think there’s something a little bit deeply confused about the core idea of “internal representation” and that it’s also not that hard to fix.
I think it’s important that our safety concepts around trained AI models/policies respect extensional equivalence, because safety or unsafety supervenes on their behaviour as opaque mathematical functions (except for very niche threat models where external adversaries are corrupting the weights or activations directly). If two models have the same input/output mapping, and only one of them has “internally represented goals”, calling the other one safer—or different at all, from a safety perspective—would be a mistake. (And, in the long run, a mistake that opens Goodhartish loopholes for models to avoid having the internal properties we don’t like without actually being safer.)
The fix is, roughly, to identify the “internal representations” in any suitable causal explanation of the extensionally observable behavior, including but not limited to a causal explanation whose variables correspond naturally to the “actual” computational graph that implements the policy/model.
If we can identify internally represented [goals, etc] in the actual computational graph, that is of course strong evidence of (and in the limit of certainty and non-approximation, logically implies) internally represented goals in some suitable causal explanation. But the converse and the inverse of that implication would not always hold.
I believe many non-safety ML people have a suspicion that safety people are making a vaguely superstitious error by assigning such significance to “internal representations”. I don’t think they are right exactly, but I think this is the steelman, and that if you can adjust to this perspective, it will make those kinds of critics take a second look. Extensional properties that scientists give causal interpretations to are far more intuitively “real” to people with a culturally stats-y background than supposedly internal/intensional properties.
Sorry these comments come at the last day; I wish I had read it in more depth a few days ago.
Overall the paper is good and I’m glad you’re doing it!