I assume you instead mean all data points that it could ever encounter? Otherwise memorisation is a sufficient strategy, since it will only ever have encountered a finite number of data points.
No—all data points that it could ever encounter is stronger than I need and harder to define, since it relies on a counterfactual. All I need is for the model to always output the optimal loss answer for every input that it’s ever actually given at any point.
When you say “the optimal policy on the actual MDP that it experiences”, is this just during training, or also during deployment? And if the latter, given that the world is non-stationary, in what sense are you referring to the “actual MDP”? (This is a hard question, and I’d be happy if you handwave it as long as you do so explicitly. Although I do think that the fact that the world is not a MDP is an important and overlooked fact).
Deployment, but I agree that this one gets tricky. I don’t think that the fact that the world is non-stationary is a problem for conceptualizing it as an MDP, since whatever transitions occur can just be thought of as part of a more abstract state. That being said, modeling the world as an MDP does still have problems—for example, the original reward function might not really be well-defined over the whole world. In those sorts of situations, I do think it gets to the point where outer alignment starts breaking down as a concept.
I’m not sure you have addressed Richard’s point—if you keep your current definition of outer alignment, then memorizing the answers to the finite set of data is always a way to score perfect loss, but intuitively doesn’t seem like it would be intent aligned. And if memorization were never intent aligned, then your definition of outer alignment would be impossible.
No—all data points that it could ever encounter is stronger than I need and harder to define, since it relies on a counterfactual. All I need is for the model to always output the optimal loss answer for every input that it’s ever actually given at any point.
Deployment, but I agree that this one gets tricky. I don’t think that the fact that the world is non-stationary is a problem for conceptualizing it as an MDP, since whatever transitions occur can just be thought of as part of a more abstract state. That being said, modeling the world as an MDP does still have problems—for example, the original reward function might not really be well-defined over the whole world. In those sorts of situations, I do think it gets to the point where outer alignment starts breaking down as a concept.
I’m not sure you have addressed Richard’s point—if you keep your current definition of outer alignment, then memorizing the answers to the finite set of data is always a way to score perfect loss, but intuitively doesn’t seem like it would be intent aligned. And if memorization were never intent aligned, then your definition of outer alignment would be impossible.