So one example would be, fix an EU maximizer. To compute value sensitivity, we consider the sensitivity of outcome value with respect to a range of feasible perturbations to the agent’s utility function. The perturbations only affect the utility function, and so everything else is considered to be part of the dynamics of the situation. You might swap out the EU maximizer for a quantilizer, or change the broader society in which the agent is deployed, but these wouldn’t classify as ‘perturbations’ in the original ontology.
Let me know if this is what you’re saying:
we have an agent which chooses X to maximize E[u(X)] (maybe with a do() operator in there)
we perturb the utility function to u’(X)
we then ask whether max E[u(X)] is approximately E[u(X’)], where X’ is the decision maximizing E[u’(X’)]
… so basically it’s a Goodhart model, where we have some proxy utility function and want to check whether the proxy achieves similar value to the original.
Then the value-fragility question asks: under which perturbation distributions are the two values approximately the same? Or, the distance function version: if we assume that u’ is “close to” u, then under what distance functions does that imply the values are close together?
Then your argument would be: the answer to that question depends on the dynamics, specifically on how X influences u. Is that right?
Assuming all that is what you’re saying… I’m imagining another variable, which is roughly a world-state W. When we write utility as a function of X directly (i.e. u(X)), we’re implicitly integrating over world states. Really, the utility function is u(W(X)): X influences the world-state, and then the utility is over (estimated) world-states. When I talk about “factoring out the dynamics”, I mean that we think about the function u(W), ignoring X. The sensitivity question is then something like: under what perturbations is u’(W) a good approximation of u(W), and in particular when are maxima of u’(W) near-maximal for u(W), including when the maximization is subject to fairly general constraints. The maximization is no longer over X, but instead over world-states W directly—we’re asking which world-states (compatible with the constraints) maximize each utility. (For specific scenarios, the constraints would encode the world-states reachable by the dynamics.) Ideally, we’d find some compact criterion for which perturbations preserve value under which constraints.
(Meta: this was useful, I understand this better for having written it out.)
Yes, this is basically what I had in mind! I really like this grounding; thanks for writing it out. If there were a value fragility research agenda, this might be a good start; I haven’t yet decided whether I think there are good theorems to be found here, though.
Can you expand on
including when the maximization is subject to fairly general constraints… Ideally, we’d find some compact criterion for which perturbations preserve value under which constraints.
This is maxw∈Wu(w), right? And then you might just constrain the subset of W which the agent can search over? Or did you have something else in mind?
This is maxw∈Wu(w), right? And then you might just constrain the subset of W which the agent can search over?
Exactly.
One toy model to conceptualize what a “compact criterion” might look like: imagine we take a second-order expansion of u around some u-maximal world-state w∗. Then, the eigendecomposition of the Hessian of u around w∗ tells us which directions-of-change in the world state u cares about a little or a lot. If the constraints lock the accessible world-states into the directions which u doesn’t care about much (i.e. eigenvalues near 0), then any accessible world-state near w∗ compatible with the constraints will have near-maximal u. On the other hand, if the constraints allow variation in directions which u does care about a lot (i.e. large eigenvalues), then u will be fragile to perturbations to u’ which move the u’-optimal world-state along those directions.
That toy model has a very long list of problems with it, but I think it conveys roughly what kind of things are involved in modelling value fragility.
Let me know if this is what you’re saying:
we have an agent which chooses X to maximize E[u(X)] (maybe with a do() operator in there)
we perturb the utility function to u’(X)
we then ask whether max E[u(X)] is approximately E[u(X’)], where X’ is the decision maximizing E[u’(X’)]
… so basically it’s a Goodhart model, where we have some proxy utility function and want to check whether the proxy achieves similar value to the original.
Then the value-fragility question asks: under which perturbation distributions are the two values approximately the same? Or, the distance function version: if we assume that u’ is “close to” u, then under what distance functions does that imply the values are close together?
Then your argument would be: the answer to that question depends on the dynamics, specifically on how X influences u. Is that right?
Assuming all that is what you’re saying… I’m imagining another variable, which is roughly a world-state W. When we write utility as a function of X directly (i.e. u(X)), we’re implicitly integrating over world states. Really, the utility function is u(W(X)): X influences the world-state, and then the utility is over (estimated) world-states. When I talk about “factoring out the dynamics”, I mean that we think about the function u(W), ignoring X. The sensitivity question is then something like: under what perturbations is u’(W) a good approximation of u(W), and in particular when are maxima of u’(W) near-maximal for u(W), including when the maximization is subject to fairly general constraints. The maximization is no longer over X, but instead over world-states W directly—we’re asking which world-states (compatible with the constraints) maximize each utility. (For specific scenarios, the constraints would encode the world-states reachable by the dynamics.) Ideally, we’d find some compact criterion for which perturbations preserve value under which constraints.
(Meta: this was useful, I understand this better for having written it out.)
Yes, this is basically what I had in mind! I really like this grounding; thanks for writing it out. If there were a value fragility research agenda, this might be a good start; I haven’t yet decided whether I think there are good theorems to be found here, though.
Can you expand on
This is maxw∈Wu(w), right? And then you might just constrain the subset of W which the agent can search over? Or did you have something else in mind?
Exactly.
One toy model to conceptualize what a “compact criterion” might look like: imagine we take a second-order expansion of u around some u-maximal world-state w∗. Then, the eigendecomposition of the Hessian of u around w∗ tells us which directions-of-change in the world state u cares about a little or a lot. If the constraints lock the accessible world-states into the directions which u doesn’t care about much (i.e. eigenvalues near 0), then any accessible world-state near w∗ compatible with the constraints will have near-maximal u. On the other hand, if the constraints allow variation in directions which u does care about a lot (i.e. large eigenvalues), then u will be fragile to perturbations to u’ which move the u’-optimal world-state along those directions.
That toy model has a very long list of problems with it, but I think it conveys roughly what kind of things are involved in modelling value fragility.