Most general: The inner agent could malignly generalize in some arbitrary bad way.
Middle: The inner agent malignly generalizes in such a way that it makes sense to call it goal-directed, and the mesa-goal (= intentional-stance-goal) is different from the base-goal.
Most specific: The inner agent encodes an explicit search algorithm, an explicit world model, and an explicit utility function.
I worry about the middle case. It seems like upon reading the mesa optimizers paper, most people start to worry about the last case. I would like people to worry about the middle case instead, and test their proposed solutions against that. (Well, ideally they’d test it against the most general case, but if it doesn’t work against that, which it probably won’t, that isn’t necessarily a deal breaker.) I feel better about people accidentally worrying about the most general case, rather than people accidentally worrying about the most specific case.
The Inner Alignment Problem might be it, though it feels like it’s not quite specific enough.
I like “inner alignment”, and am not sure why you think it isn’t specific enough.
I think we basically agree. I would also prefer people to think more about the middle case. Indeed, when I use the term mesa-optimiser, I usually intend to talk about the middle picture, though strictly that’s sinful as the term is tied to Optimisers.
Re: inner alignment
I think it’s basically the right term. I guess in my mind I want to say something like, “Inner Alignment is the problem of aligning objectives across the Mesa≠Base gap”, which shows how the two have slightly different shapes. But the difference isn’t really important.
From my perspective, there are three levels:
Most general: The inner agent could malignly generalize in some arbitrary bad way.
Middle: The inner agent malignly generalizes in such a way that it makes sense to call it goal-directed, and the mesa-goal (= intentional-stance-goal) is different from the base-goal.
Most specific: The inner agent encodes an explicit search algorithm, an explicit world model, and an explicit utility function.
I worry about the middle case. It seems like upon reading the mesa optimizers paper, most people start to worry about the last case. I would like people to worry about the middle case instead, and test their proposed solutions against that. (Well, ideally they’d test it against the most general case, but if it doesn’t work against that, which it probably won’t, that isn’t necessarily a deal breaker.) I feel better about people accidentally worrying about the most general case, rather than people accidentally worrying about the most specific case.
I like “inner alignment”, and am not sure why you think it isn’t specific enough.
I think we basically agree. I would also prefer people to think more about the middle case. Indeed, when I use the term mesa-optimiser, I usually intend to talk about the middle picture, though strictly that’s sinful as the term is tied to Optimisers.
Re: inner alignment
I think it’s basically the right term. I guess in my mind I want to say something like, “Inner Alignment is the problem of aligning objectives across the Mesa≠Base gap”, which shows how the two have slightly different shapes. But the difference isn’t really important.
Inner alignment gap? Inner objective gap?