It sounds like you’re assuming that the target configuration set is built into the AI system. According to me, a major point of this post / framework is to avoid that assumption altogether, and only describe problems in terms of the actual observed system behavior.
(This is why within this framework I couldn’t formalize outer alignment, and why wireheading and the search / mesa-objective split is unnatural.)
I see the tension you’re pointing at. I think I had in mind something like “an AI is reliably optimizing utility function u over the configuration space (but not necessarily over universe-histories!) if it reliably moves into high-rated configurations”, and you could draw different epsilon-neighborhoods of optimality in configuration space. It seems like you should be able to talk about dog-maximizers without requiring that the agent robustly end up in the maximum-dog configurations (and not in max-minus-one-dog configs).
It sounds like you’re assuming that the target configuration set is built into the AI system. According to me, a major point of this post / framework is to avoid that assumption altogether, and only describe problems in terms of the actual observed system behavior.
(This is why within this framework I couldn’t formalize outer alignment, and why wireheading and the search / mesa-objective split is unnatural.)
I see the tension you’re pointing at. I think I had in mind something like “an AI is reliably optimizing utility function u over the configuration space (but not necessarily over universe-histories!) if it reliably moves into high-rated configurations”, and you could draw different epsilon-neighborhoods of optimality in configuration space. It seems like you should be able to talk about dog-maximizers without requiring that the agent robustly end up in the maximum-dog configurations (and not in max-minus-one-dog configs).
I’m still confused about parts of this.