I think work on the study of abstraction, one way or another, will be essential to AI alignment. Even “just” being able to make very precise high-level predictions of (an AI’s behavior FROM its internal state) or (human values FROM measured neurological data), requires enough abstraction-understanding to know whether the simplification is really capturing what we want.
I don’t know if the natural abstractions hypothesis is really necessary for this. But something like a more developed/complete version of Wentworth’s “minimal maps” representation of abstraction, seems more needed.
Maybe if it’s “direct” enough, we just get mech. interp. again? In my head, some kind of abstraction is necessary if we go by the “Rocket Alignment” analogy.
I think work on the study of abstraction, one way or another, will be essential to AI alignment. Even “just” being able to make very precise high-level predictions of (an AI’s behavior FROM its internal state) or (human values FROM measured neurological data), requires enough abstraction-understanding to know whether the simplification is really capturing what we want.
I don’t know if the natural abstractions hypothesis is really necessary for this. But something like a more developed/complete version of Wentworth’s “minimal maps” representation of abstraction, seems more needed.
Maybe if it’s “direct” enough, we just get mech. interp. again? In my head, some kind of abstraction is necessary if we go by the “Rocket Alignment” analogy.