If I’ve understood it correctly, I think this is a really important point, so thanks for writing a post about it. This post highlights that mesa objectives and base objectives are typically going to be of different “types”, because the base objective will typically be designed to evaluate things in the world as humans understand it (or as modelled by the formal training setup) whereas the mesa objective will be evaluating things in the AI’s world model (or if it doesn’t really have a world model, then more local things like actions themselves as opposed to their distant consequences).
Am I acting in bad faith?… Surely I “get what they mean”?
I’m certainly glad to see people suspending their sense of “getting it” when it comes to reference (aka pointers, aka representation) since I don’t think we have solid foundations for these topics and I think they are core issues in AI alignment.
If I’ve understood it correctly, I think this is a really important point, so thanks for writing a post about it. This post highlights that mesa objectives and base objectives are typically going to be of different “types”, because the base objective will typically be designed to evaluate things in the world as humans understand it (or as modelled by the formal training setup) whereas the mesa objective will be evaluating things in the AI’s world model (or if it doesn’t really have a world model, then more local things like actions themselves as opposed to their distant consequences).
I’m certainly glad to see people suspending their sense of “getting it” when it comes to reference (aka pointers, aka representation) since I don’t think we have solid foundations for these topics and I think they are core issues in AI alignment.