Yes, this is basically what I had in mind! I really like this grounding; thanks for writing it out. If there were a value fragility research agenda, this might be a good start; I haven’t yet decided whether I think there are good theorems to be found here, though.
Can you expand on
including when the maximization is subject to fairly general constraints… Ideally, we’d find some compact criterion for which perturbations preserve value under which constraints.
This is maxw∈Wu(w), right? And then you might just constrain the subset of W which the agent can search over? Or did you have something else in mind?
This is maxw∈Wu(w), right? And then you might just constrain the subset of W which the agent can search over?
Exactly.
One toy model to conceptualize what a “compact criterion” might look like: imagine we take a second-order expansion of u around some u-maximal world-state w∗. Then, the eigendecomposition of the Hessian of u around w∗ tells us which directions-of-change in the world state u cares about a little or a lot. If the constraints lock the accessible world-states into the directions which u doesn’t care about much (i.e. eigenvalues near 0), then any accessible world-state near w∗ compatible with the constraints will have near-maximal u. On the other hand, if the constraints allow variation in directions which u does care about a lot (i.e. large eigenvalues), then u will be fragile to perturbations to u’ which move the u’-optimal world-state along those directions.
That toy model has a very long list of problems with it, but I think it conveys roughly what kind of things are involved in modelling value fragility.
Yes, this is basically what I had in mind! I really like this grounding; thanks for writing it out. If there were a value fragility research agenda, this might be a good start; I haven’t yet decided whether I think there are good theorems to be found here, though.
Can you expand on
This is maxw∈Wu(w), right? And then you might just constrain the subset of W which the agent can search over? Or did you have something else in mind?
Exactly.
One toy model to conceptualize what a “compact criterion” might look like: imagine we take a second-order expansion of u around some u-maximal world-state w∗. Then, the eigendecomposition of the Hessian of u around w∗ tells us which directions-of-change in the world state u cares about a little or a lot. If the constraints lock the accessible world-states into the directions which u doesn’t care about much (i.e. eigenvalues near 0), then any accessible world-state near w∗ compatible with the constraints will have near-maximal u. On the other hand, if the constraints allow variation in directions which u does care about a lot (i.e. large eigenvalues), then u will be fragile to perturbations to u’ which move the u’-optimal world-state along those directions.
That toy model has a very long list of problems with it, but I think it conveys roughly what kind of things are involved in modelling value fragility.