Proposal: in the same way we might try to infer human values from the state of the world, might we be able to infer a high-level set of features such that existing agents like us seem to optimize simple functions of these features? Then we would penalize actions that cause irreversible changes with respect to these high-level features.
This might be entirely within the framework of similarity-based reachability. This might also be exactly what you were just suggesting.
Proposal: in the same way we might try to infer human values from the state of the world, might we be able to infer a high-level set of features such that existing agents like us seem to optimize simple functions of these features? Then we would penalize actions that cause irreversible changes with respect to these high-level features.
This might be entirely within the framework of similarity-based reachability. This might also be exactly what you were just suggesting.