The point of the natural abstractions hypothesis is really a question of how far can we get using interpretability on AI without goodharting? And the natural abstractions hypothesis says that we can functionally interpret the abstractions that the AI is using, even at really high levels of capabilities.
Obviously it’s a broader question than what I said, but from an AI safety perspective, the value of the natural abstractions hypothesis, conditional on it being right at least partially, is the following:
Interpretability becomes easier as we can get at least some guarantees about how they form abstractions.
Given that they’re lower dimensional summaries, there’s a chance we can understand the abstractions the AI is using, even when they are alien to us.
As far as Goodhart: a scenario that could come up is that trying to make the model explain itself might instead push us towards the failure mode where we don’t have any real understanding, just simple sounding summaries that don’t reveal much of anything. The natural abstractions hypothesis says that by default, AIs will make themselves more interpretable as they are more capable, avoiding goodharting interpretability efforts.
The point of the natural abstractions hypothesis is really a question of how far can we get using interpretability on AI without goodharting? And the natural abstractions hypothesis says that we can functionally interpret the abstractions that the AI is using, even at really high levels of capabilities.
My interpretation is very wrong in that case. Could you spell out the goodharting connection for me?
Obviously it’s a broader question than what I said, but from an AI safety perspective, the value of the natural abstractions hypothesis, conditional on it being right at least partially, is the following:
Interpretability becomes easier as we can get at least some guarantees about how they form abstractions.
Given that they’re lower dimensional summaries, there’s a chance we can understand the abstractions the AI is using, even when they are alien to us.
As far as Goodhart: a scenario that could come up is that trying to make the model explain itself might instead push us towards the failure mode where we don’t have any real understanding, just simple sounding summaries that don’t reveal much of anything. The natural abstractions hypothesis says that by default, AIs will make themselves more interpretable as they are more capable, avoiding goodharting interpretability efforts.
That’s a really clear explanation.
I was thinking of the general case of Goodharting and hadnt made the connection to Goodharting the explanations.