I read through the first part of this review, and generally thought “yep, this is basically right, except it should factor out the distance metric explicitly rather than dragging in all this stuff about dynamics”. I had completely forgotten that I said the same thing a year ago, so I was pretty amused when I reached the quote.
Anyway, I’ll defend the distance metric thing a bit here.
But what exactly happens between “we write down something too distant from the ‘truth’” and the result? The AI happens. But this part, the dynamics, it’s kept invisible.
I claim that “keeping the dynamics invisible” is desirable here.
The reason that “fragility of human values” is a useful concept/hypothesis in the first place is that it cuts reality at the joints. What does that mean? Roughly speaking, it means that there’s a broad class of different questions for which “are human values fragile?” is an interesting and useful subquestion, without needing a lot of additional context. We can factor out the “are human values fragile?” question, and send someone off to go think about that question, without a bunch of context about why exactly we want to answer the question. Conversely, because the answer isn’t highly context-dependent, we can think about the question once and then re-use the answer when thinking about many different scenarios—e.g. foom or CAIS or multipolar takeoff or …. Fragility of human values is a gear in our models, and once we’ve made the investment to understand that gear, we can re-use it over and over again as the rest of the model varies.
Of course, that only works to the extent that fragility of human values actually doesn’t depend on a bunch of extra context. Which it obviously does, as this review points out. Distance metrics allow us to “factor out” that context-dependence, to wrap it in a clean API.
Rather than asking “are human values fragile?”, we ask “under what distance metric(s) are human values fragile?”—that’s the new “API” of the value-fragility question. Then, when someone comes along with a specific scenario (like foom or CAIS or …), we ask what distance metric is relevant to the dynamics of that scenario. For instance, in a foom scenario, the relevant distance metric is probably determined by the AI’s ontology—i.e. what things the AI thinks are “similar”. In a corporate-flavored multipolar takeoff scenario, the relevant distance metric might be driven by economic/game-theoretic considerations: outcomes with similar economic results (e.g. profitability of AI-run companies) will be “similar”.
The point is that these distance metrics tell us what particular aspects/properties of each scenario are relevant to value fragility.
Rather than asking “are human values fragile?”, we ask “under what distance metric(s) are human values fragile?”—that’s the new “API” of the value-fragility question.
In other words: “against which compact ways of generating perturbations is human value fragile?”. But don’t you still need to consider some dynamics for this question to be well-defined? So it doesn’t seem like it captures all of the regularities implied by:
Distance metrics allow us to “factor out” that context-dependence, to wrap it in a clean API.
But I do presently agree that it’s a good conceptual handle for exploring robustness against different sets of perturbations.
In other words: “against which compact ways of generating perturbations is human value fragile?”. But don’t you still need to consider some dynamics for this question to be well-defined?
Not quite. If we frame the question as “which compact ways of generating perturbations”, then that’s implicitly talking about dynamics, since we’re asking how the perturbations were generated. But if we know what perturbations are generated, then we can say whether human value is fragile against those perturbations, regardless of how they’re generated. So, rather than framing the question as “which compact ways of generating perturbations”, we frame it as “which sets of perturbations” or “densities of perturbations” or a distance function on perturbations.
Ideally, we come up with a compact criterion for when human values are fragile against such sets/densities/distance functions.
(I meant to say ‘perturbations’, not ‘permutations’)
Not quite. If we frame the question as “which compact ways of generating permutations”, then that’s implicitly talking about dynamics, since we’re asking how the permutations were generated.
Hm, maybe we have two different conceptions. I’ve been imagining singling out a variable (e.g. the utility function) and perturbing it in different ways, and then filing everything else under the ‘dynamics.’
So one example would be, fix an EU maximizer. To compute value sensitivity, we consider the sensitivity of outcome value with respect to a range of feasible perturbations to the agent’s utility function. The perturbations only affect the utility function, and so everything else is considered to be part of the dynamics of the situation. You might swap out the EU maximizer for a quantilizer, or change the broader society in which the agent is deployed, but these wouldn’t classify as ‘perturbations’ in the original ontology.
Point is, these perturbations aren’t actually generated within the imagined scenarios, but we generate them outside of the scenarios in order to estimate outcome sensitivity.
Perhaps this isn’t clean, and perhaps I should rewrite parts of the review with a clearer decomposition.
So one example would be, fix an EU maximizer. To compute value sensitivity, we consider the sensitivity of outcome value with respect to a range of feasible perturbations to the agent’s utility function. The perturbations only affect the utility function, and so everything else is considered to be part of the dynamics of the situation. You might swap out the EU maximizer for a quantilizer, or change the broader society in which the agent is deployed, but these wouldn’t classify as ‘perturbations’ in the original ontology.
Let me know if this is what you’re saying:
we have an agent which chooses X to maximize E[u(X)] (maybe with a do() operator in there)
we perturb the utility function to u’(X)
we then ask whether max E[u(X)] is approximately E[u(X’)], where X’ is the decision maximizing E[u’(X’)]
… so basically it’s a Goodhart model, where we have some proxy utility function and want to check whether the proxy achieves similar value to the original.
Then the value-fragility question asks: under which perturbation distributions are the two values approximately the same? Or, the distance function version: if we assume that u’ is “close to” u, then under what distance functions does that imply the values are close together?
Then your argument would be: the answer to that question depends on the dynamics, specifically on how X influences u. Is that right?
Assuming all that is what you’re saying… I’m imagining another variable, which is roughly a world-state W. When we write utility as a function of X directly (i.e. u(X)), we’re implicitly integrating over world states. Really, the utility function is u(W(X)): X influences the world-state, and then the utility is over (estimated) world-states. When I talk about “factoring out the dynamics”, I mean that we think about the function u(W), ignoring X. The sensitivity question is then something like: under what perturbations is u’(W) a good approximation of u(W), and in particular when are maxima of u’(W) near-maximal for u(W), including when the maximization is subject to fairly general constraints. The maximization is no longer over X, but instead over world-states W directly—we’re asking which world-states (compatible with the constraints) maximize each utility. (For specific scenarios, the constraints would encode the world-states reachable by the dynamics.) Ideally, we’d find some compact criterion for which perturbations preserve value under which constraints.
(Meta: this was useful, I understand this better for having written it out.)
Yes, this is basically what I had in mind! I really like this grounding; thanks for writing it out. If there were a value fragility research agenda, this might be a good start; I haven’t yet decided whether I think there are good theorems to be found here, though.
Can you expand on
including when the maximization is subject to fairly general constraints… Ideally, we’d find some compact criterion for which perturbations preserve value under which constraints.
This is maxw∈Wu(w), right? And then you might just constrain the subset of W which the agent can search over? Or did you have something else in mind?
This is maxw∈Wu(w), right? And then you might just constrain the subset of W which the agent can search over?
Exactly.
One toy model to conceptualize what a “compact criterion” might look like: imagine we take a second-order expansion of u around some u-maximal world-state w∗. Then, the eigendecomposition of the Hessian of u around w∗ tells us which directions-of-change in the world state u cares about a little or a lot. If the constraints lock the accessible world-states into the directions which u doesn’t care about much (i.e. eigenvalues near 0), then any accessible world-state near w∗ compatible with the constraints will have near-maximal u. On the other hand, if the constraints allow variation in directions which u does care about a lot (i.e. large eigenvalues), then u will be fragile to perturbations to u’ which move the u’-optimal world-state along those directions.
That toy model has a very long list of problems with it, but I think it conveys roughly what kind of things are involved in modelling value fragility.
I read through the first part of this review, and generally thought “yep, this is basically right, except it should factor out the distance metric explicitly rather than dragging in all this stuff about dynamics”. I had completely forgotten that I said the same thing a year ago, so I was pretty amused when I reached the quote.
Anyway, I’ll defend the distance metric thing a bit here.
I claim that “keeping the dynamics invisible” is desirable here.
The reason that “fragility of human values” is a useful concept/hypothesis in the first place is that it cuts reality at the joints. What does that mean? Roughly speaking, it means that there’s a broad class of different questions for which “are human values fragile?” is an interesting and useful subquestion, without needing a lot of additional context. We can factor out the “are human values fragile?” question, and send someone off to go think about that question, without a bunch of context about why exactly we want to answer the question. Conversely, because the answer isn’t highly context-dependent, we can think about the question once and then re-use the answer when thinking about many different scenarios—e.g. foom or CAIS or multipolar takeoff or …. Fragility of human values is a gear in our models, and once we’ve made the investment to understand that gear, we can re-use it over and over again as the rest of the model varies.
Of course, that only works to the extent that fragility of human values actually doesn’t depend on a bunch of extra context. Which it obviously does, as this review points out. Distance metrics allow us to “factor out” that context-dependence, to wrap it in a clean API.
Rather than asking “are human values fragile?”, we ask “under what distance metric(s) are human values fragile?”—that’s the new “API” of the value-fragility question. Then, when someone comes along with a specific scenario (like foom or CAIS or …), we ask what distance metric is relevant to the dynamics of that scenario. For instance, in a foom scenario, the relevant distance metric is probably determined by the AI’s ontology—i.e. what things the AI thinks are “similar”. In a corporate-flavored multipolar takeoff scenario, the relevant distance metric might be driven by economic/game-theoretic considerations: outcomes with similar economic results (e.g. profitability of AI-run companies) will be “similar”.
The point is that these distance metrics tell us what particular aspects/properties of each scenario are relevant to value fragility.
In other words: “against which compact ways of generating perturbations is human value fragile?”. But don’t you still need to consider some dynamics for this question to be well-defined? So it doesn’t seem like it captures all of the regularities implied by:
But I do presently agree that it’s a good conceptual handle for exploring robustness against different sets of perturbations.
Not quite. If we frame the question as “which compact ways of generating perturbations”, then that’s implicitly talking about dynamics, since we’re asking how the perturbations were generated. But if we know what perturbations are generated, then we can say whether human value is fragile against those perturbations, regardless of how they’re generated. So, rather than framing the question as “which compact ways of generating perturbations”, we frame it as “which sets of perturbations” or “densities of perturbations” or a distance function on perturbations.
Ideally, we come up with a compact criterion for when human values are fragile against such sets/densities/distance functions.
(I meant to say ‘perturbations’, not ‘permutations’)
Hm, maybe we have two different conceptions. I’ve been imagining singling out a variable (e.g. the utility function) and perturbing it in different ways, and then filing everything else under the ‘dynamics.’
So one example would be, fix an EU maximizer. To compute value sensitivity, we consider the sensitivity of outcome value with respect to a range of feasible perturbations to the agent’s utility function. The perturbations only affect the utility function, and so everything else is considered to be part of the dynamics of the situation. You might swap out the EU maximizer for a quantilizer, or change the broader society in which the agent is deployed, but these wouldn’t classify as ‘perturbations’ in the original ontology.
Point is, these perturbations aren’t actually generated within the imagined scenarios, but we generate them outside of the scenarios in order to estimate outcome sensitivity.
Perhaps this isn’t clean, and perhaps I should rewrite parts of the review with a clearer decomposition.
Let me know if this is what you’re saying:
we have an agent which chooses X to maximize E[u(X)] (maybe with a do() operator in there)
we perturb the utility function to u’(X)
we then ask whether max E[u(X)] is approximately E[u(X’)], where X’ is the decision maximizing E[u’(X’)]
… so basically it’s a Goodhart model, where we have some proxy utility function and want to check whether the proxy achieves similar value to the original.
Then the value-fragility question asks: under which perturbation distributions are the two values approximately the same? Or, the distance function version: if we assume that u’ is “close to” u, then under what distance functions does that imply the values are close together?
Then your argument would be: the answer to that question depends on the dynamics, specifically on how X influences u. Is that right?
Assuming all that is what you’re saying… I’m imagining another variable, which is roughly a world-state W. When we write utility as a function of X directly (i.e. u(X)), we’re implicitly integrating over world states. Really, the utility function is u(W(X)): X influences the world-state, and then the utility is over (estimated) world-states. When I talk about “factoring out the dynamics”, I mean that we think about the function u(W), ignoring X. The sensitivity question is then something like: under what perturbations is u’(W) a good approximation of u(W), and in particular when are maxima of u’(W) near-maximal for u(W), including when the maximization is subject to fairly general constraints. The maximization is no longer over X, but instead over world-states W directly—we’re asking which world-states (compatible with the constraints) maximize each utility. (For specific scenarios, the constraints would encode the world-states reachable by the dynamics.) Ideally, we’d find some compact criterion for which perturbations preserve value under which constraints.
(Meta: this was useful, I understand this better for having written it out.)
Yes, this is basically what I had in mind! I really like this grounding; thanks for writing it out. If there were a value fragility research agenda, this might be a good start; I haven’t yet decided whether I think there are good theorems to be found here, though.
Can you expand on
This is maxw∈Wu(w), right? And then you might just constrain the subset of W which the agent can search over? Or did you have something else in mind?
Exactly.
One toy model to conceptualize what a “compact criterion” might look like: imagine we take a second-order expansion of u around some u-maximal world-state w∗. Then, the eigendecomposition of the Hessian of u around w∗ tells us which directions-of-change in the world state u cares about a little or a lot. If the constraints lock the accessible world-states into the directions which u doesn’t care about much (i.e. eigenvalues near 0), then any accessible world-state near w∗ compatible with the constraints will have near-maximal u. On the other hand, if the constraints allow variation in directions which u does care about a lot (i.e. large eigenvalues), then u will be fragile to perturbations to u’ which move the u’-optimal world-state along those directions.
That toy model has a very long list of problems with it, but I think it conveys roughly what kind of things are involved in modelling value fragility.