Rather than asking “are human values fragile?”, we ask “under what distance metric(s) are human values fragile?”—that’s the new “API” of the value-fragility question.
In other words: “against which compact ways of generating perturbations is human value fragile?”. But don’t you still need to consider some dynamics for this question to be well-defined? So it doesn’t seem like it captures all of the regularities implied by:
Distance metrics allow us to “factor out” that context-dependence, to wrap it in a clean API.
But I do presently agree that it’s a good conceptual handle for exploring robustness against different sets of perturbations.
In other words: “against which compact ways of generating perturbations is human value fragile?”. But don’t you still need to consider some dynamics for this question to be well-defined?
Not quite. If we frame the question as “which compact ways of generating perturbations”, then that’s implicitly talking about dynamics, since we’re asking how the perturbations were generated. But if we know what perturbations are generated, then we can say whether human value is fragile against those perturbations, regardless of how they’re generated. So, rather than framing the question as “which compact ways of generating perturbations”, we frame it as “which sets of perturbations” or “densities of perturbations” or a distance function on perturbations.
Ideally, we come up with a compact criterion for when human values are fragile against such sets/densities/distance functions.
(I meant to say ‘perturbations’, not ‘permutations’)
Not quite. If we frame the question as “which compact ways of generating permutations”, then that’s implicitly talking about dynamics, since we’re asking how the permutations were generated.
Hm, maybe we have two different conceptions. I’ve been imagining singling out a variable (e.g. the utility function) and perturbing it in different ways, and then filing everything else under the ‘dynamics.’
So one example would be, fix an EU maximizer. To compute value sensitivity, we consider the sensitivity of outcome value with respect to a range of feasible perturbations to the agent’s utility function. The perturbations only affect the utility function, and so everything else is considered to be part of the dynamics of the situation. You might swap out the EU maximizer for a quantilizer, or change the broader society in which the agent is deployed, but these wouldn’t classify as ‘perturbations’ in the original ontology.
Point is, these perturbations aren’t actually generated within the imagined scenarios, but we generate them outside of the scenarios in order to estimate outcome sensitivity.
Perhaps this isn’t clean, and perhaps I should rewrite parts of the review with a clearer decomposition.
So one example would be, fix an EU maximizer. To compute value sensitivity, we consider the sensitivity of outcome value with respect to a range of feasible perturbations to the agent’s utility function. The perturbations only affect the utility function, and so everything else is considered to be part of the dynamics of the situation. You might swap out the EU maximizer for a quantilizer, or change the broader society in which the agent is deployed, but these wouldn’t classify as ‘perturbations’ in the original ontology.
Let me know if this is what you’re saying:
we have an agent which chooses X to maximize E[u(X)] (maybe with a do() operator in there)
we perturb the utility function to u’(X)
we then ask whether max E[u(X)] is approximately E[u(X’)], where X’ is the decision maximizing E[u’(X’)]
… so basically it’s a Goodhart model, where we have some proxy utility function and want to check whether the proxy achieves similar value to the original.
Then the value-fragility question asks: under which perturbation distributions are the two values approximately the same? Or, the distance function version: if we assume that u’ is “close to” u, then under what distance functions does that imply the values are close together?
Then your argument would be: the answer to that question depends on the dynamics, specifically on how X influences u. Is that right?
Assuming all that is what you’re saying… I’m imagining another variable, which is roughly a world-state W. When we write utility as a function of X directly (i.e. u(X)), we’re implicitly integrating over world states. Really, the utility function is u(W(X)): X influences the world-state, and then the utility is over (estimated) world-states. When I talk about “factoring out the dynamics”, I mean that we think about the function u(W), ignoring X. The sensitivity question is then something like: under what perturbations is u’(W) a good approximation of u(W), and in particular when are maxima of u’(W) near-maximal for u(W), including when the maximization is subject to fairly general constraints. The maximization is no longer over X, but instead over world-states W directly—we’re asking which world-states (compatible with the constraints) maximize each utility. (For specific scenarios, the constraints would encode the world-states reachable by the dynamics.) Ideally, we’d find some compact criterion for which perturbations preserve value under which constraints.
(Meta: this was useful, I understand this better for having written it out.)
Yes, this is basically what I had in mind! I really like this grounding; thanks for writing it out. If there were a value fragility research agenda, this might be a good start; I haven’t yet decided whether I think there are good theorems to be found here, though.
Can you expand on
including when the maximization is subject to fairly general constraints… Ideally, we’d find some compact criterion for which perturbations preserve value under which constraints.
This is maxw∈Wu(w), right? And then you might just constrain the subset of W which the agent can search over? Or did you have something else in mind?
This is maxw∈Wu(w), right? And then you might just constrain the subset of W which the agent can search over?
Exactly.
One toy model to conceptualize what a “compact criterion” might look like: imagine we take a second-order expansion of u around some u-maximal world-state w∗. Then, the eigendecomposition of the Hessian of u around w∗ tells us which directions-of-change in the world state u cares about a little or a lot. If the constraints lock the accessible world-states into the directions which u doesn’t care about much (i.e. eigenvalues near 0), then any accessible world-state near w∗ compatible with the constraints will have near-maximal u. On the other hand, if the constraints allow variation in directions which u does care about a lot (i.e. large eigenvalues), then u will be fragile to perturbations to u’ which move the u’-optimal world-state along those directions.
That toy model has a very long list of problems with it, but I think it conveys roughly what kind of things are involved in modelling value fragility.
In other words: “against which compact ways of generating perturbations is human value fragile?”. But don’t you still need to consider some dynamics for this question to be well-defined? So it doesn’t seem like it captures all of the regularities implied by:
But I do presently agree that it’s a good conceptual handle for exploring robustness against different sets of perturbations.
Not quite. If we frame the question as “which compact ways of generating perturbations”, then that’s implicitly talking about dynamics, since we’re asking how the perturbations were generated. But if we know what perturbations are generated, then we can say whether human value is fragile against those perturbations, regardless of how they’re generated. So, rather than framing the question as “which compact ways of generating perturbations”, we frame it as “which sets of perturbations” or “densities of perturbations” or a distance function on perturbations.
Ideally, we come up with a compact criterion for when human values are fragile against such sets/densities/distance functions.
(I meant to say ‘perturbations’, not ‘permutations’)
Hm, maybe we have two different conceptions. I’ve been imagining singling out a variable (e.g. the utility function) and perturbing it in different ways, and then filing everything else under the ‘dynamics.’
So one example would be, fix an EU maximizer. To compute value sensitivity, we consider the sensitivity of outcome value with respect to a range of feasible perturbations to the agent’s utility function. The perturbations only affect the utility function, and so everything else is considered to be part of the dynamics of the situation. You might swap out the EU maximizer for a quantilizer, or change the broader society in which the agent is deployed, but these wouldn’t classify as ‘perturbations’ in the original ontology.
Point is, these perturbations aren’t actually generated within the imagined scenarios, but we generate them outside of the scenarios in order to estimate outcome sensitivity.
Perhaps this isn’t clean, and perhaps I should rewrite parts of the review with a clearer decomposition.
Let me know if this is what you’re saying:
we have an agent which chooses X to maximize E[u(X)] (maybe with a do() operator in there)
we perturb the utility function to u’(X)
we then ask whether max E[u(X)] is approximately E[u(X’)], where X’ is the decision maximizing E[u’(X’)]
… so basically it’s a Goodhart model, where we have some proxy utility function and want to check whether the proxy achieves similar value to the original.
Then the value-fragility question asks: under which perturbation distributions are the two values approximately the same? Or, the distance function version: if we assume that u’ is “close to” u, then under what distance functions does that imply the values are close together?
Then your argument would be: the answer to that question depends on the dynamics, specifically on how X influences u. Is that right?
Assuming all that is what you’re saying… I’m imagining another variable, which is roughly a world-state W. When we write utility as a function of X directly (i.e. u(X)), we’re implicitly integrating over world states. Really, the utility function is u(W(X)): X influences the world-state, and then the utility is over (estimated) world-states. When I talk about “factoring out the dynamics”, I mean that we think about the function u(W), ignoring X. The sensitivity question is then something like: under what perturbations is u’(W) a good approximation of u(W), and in particular when are maxima of u’(W) near-maximal for u(W), including when the maximization is subject to fairly general constraints. The maximization is no longer over X, but instead over world-states W directly—we’re asking which world-states (compatible with the constraints) maximize each utility. (For specific scenarios, the constraints would encode the world-states reachable by the dynamics.) Ideally, we’d find some compact criterion for which perturbations preserve value under which constraints.
(Meta: this was useful, I understand this better for having written it out.)
Yes, this is basically what I had in mind! I really like this grounding; thanks for writing it out. If there were a value fragility research agenda, this might be a good start; I haven’t yet decided whether I think there are good theorems to be found here, though.
Can you expand on
This is maxw∈Wu(w), right? And then you might just constrain the subset of W which the agent can search over? Or did you have something else in mind?
Exactly.
One toy model to conceptualize what a “compact criterion” might look like: imagine we take a second-order expansion of u around some u-maximal world-state w∗. Then, the eigendecomposition of the Hessian of u around w∗ tells us which directions-of-change in the world state u cares about a little or a lot. If the constraints lock the accessible world-states into the directions which u doesn’t care about much (i.e. eigenvalues near 0), then any accessible world-state near w∗ compatible with the constraints will have near-maximal u. On the other hand, if the constraints allow variation in directions which u does care about a lot (i.e. large eigenvalues), then u will be fragile to perturbations to u’ which move the u’-optimal world-state along those directions.
That toy model has a very long list of problems with it, but I think it conveys roughly what kind of things are involved in modelling value fragility.