Recalling the 3 subclaims of the Natural Abstraction Hypothesis which I will quote verbatim:
Abstractability: for most physical systems, the information relevant “far away” can be represented by a summary much lower-dimensional than the system itself.
Human-Compatibility: These summaries are the abstractions used by humans in day-to-day thought/language.
Convergence: a wide variety of cognitive architectures learn and use approximately-the-same summaries.
I will note that despite the ordering I think claim 2 is the weakest. I strongly disagree that these claims being partially, or completely correct means we can expect an AI not to be deceptive.
The Natural Abstraction Hypothesis is not a statement about systems converging to valuing certain abstractions the same, merely that they will use similar summaries of data in their decision making processes. Two opposing chess players may use completely identical abstractions in how they view the board (material advantage, king safety etc) but they directly opposed goals.
Extending that last point, we realise that some understanding of human abstractions is a powerful tool for effective deceptive alignment. If there is a behaviour that is selected for when humans are unaware of it (say, reward hacking) but is strongly selected against when humans are aware of it, it is possible the AI will learn “humans dislike this” but that doesn’t mean that it will “dislike this”
The point of the natural abstractions hypothesis is really a question of how far can we get using interpretability on AI without goodharting? And the natural abstractions hypothesis says that we can functionally interpret the abstractions that the AI is using, even at really high levels of capabilities.
Obviously it’s a broader question than what I said, but from an AI safety perspective, the value of the natural abstractions hypothesis, conditional on it being right at least partially, is the following:
Interpretability becomes easier as we can get at least some guarantees about how they form abstractions.
Given that they’re lower dimensional summaries, there’s a chance we can understand the abstractions the AI is using, even when they are alien to us.
As far as Goodhart: a scenario that could come up is that trying to make the model explain itself might instead push us towards the failure mode where we don’t have any real understanding, just simple sounding summaries that don’t reveal much of anything. The natural abstractions hypothesis says that by default, AIs will make themselves more interpretable as they are more capable, avoiding goodharting interpretability efforts.
Recalling the 3 subclaims of the Natural Abstraction Hypothesis which I will quote verbatim:
Abstractability: for most physical systems, the information relevant “far away” can be represented by a summary much lower-dimensional than the system itself.
Human-Compatibility: These summaries are the abstractions used by humans in day-to-day thought/language.
Convergence: a wide variety of cognitive architectures learn and use approximately-the-same summaries.
I will note that despite the ordering I think claim 2 is the weakest. I strongly disagree that these claims being partially, or completely correct means we can expect an AI not to be deceptive.
The Natural Abstraction Hypothesis is not a statement about systems converging to valuing certain abstractions the same, merely that they will use similar summaries of data in their decision making processes. Two opposing chess players may use completely identical abstractions in how they view the board (material advantage, king safety etc) but they directly opposed goals.
Extending that last point, we realise that some understanding of human abstractions is a powerful tool for effective deceptive alignment. If there is a behaviour that is selected for when humans are unaware of it (say, reward hacking) but is strongly selected against when humans are aware of it, it is possible the AI will learn “humans dislike this” but that doesn’t mean that it will “dislike this”
The point of the natural abstractions hypothesis is really a question of how far can we get using interpretability on AI without goodharting? And the natural abstractions hypothesis says that we can functionally interpret the abstractions that the AI is using, even at really high levels of capabilities.
My interpretation is very wrong in that case. Could you spell out the goodharting connection for me?
Obviously it’s a broader question than what I said, but from an AI safety perspective, the value of the natural abstractions hypothesis, conditional on it being right at least partially, is the following:
Interpretability becomes easier as we can get at least some guarantees about how they form abstractions.
Given that they’re lower dimensional summaries, there’s a chance we can understand the abstractions the AI is using, even when they are alien to us.
As far as Goodhart: a scenario that could come up is that trying to make the model explain itself might instead push us towards the failure mode where we don’t have any real understanding, just simple sounding summaries that don’t reveal much of anything. The natural abstractions hypothesis says that by default, AIs will make themselves more interpretable as they are more capable, avoiding goodharting interpretability efforts.
That’s a really clear explanation.
I was thinking of the general case of Goodharting and hadnt made the connection to Goodharting the explanations.