Someone at the coffee hour (Viktoriya? Apologies if I’ve forgotten a name) gave a short explanation of this using cycles. If you imagine an agent moving either to the left or the right along a hallway, you can change its utility function in a cycle such that it repeatedly ends up in the same place in the hallway with the same utility function.
This basically eliminates expected utility (as a discounted sum of utilities of states) maximization as producing this behavior. But you can still imagine selecting a policy such that it takes the right actions in response to you sending it signals. I think a sensible way to do this is like in tailcalled’s recent post, with causal counterfactuals for sending one signal or another.
🤔 I was about to say that I felt like my approach could still be done in terms of state rewards, and that it’s just that my approach violates some of the technical assumptions in the OP. After all, you could just reward for being in a state such that the various counterfactuals apply when rolling out from this state; this would assign higher utility to the blue states than the red states, encouraging corrigibility, and contradicting TurnTrout’s assumption that utility would be assigned solely based on the letter.
But then I realized that this introduces a policy dependence to the reward function; the way you roll out from a state depends on which policy you have. (Well, in principle; in practice some MDPs may not have much dependence on it.) The special thing about state-based rewards is that you can assign utilities to trajectories without considering the policy that generates the trajectory at all. (Which to me seems bad for corrigibility, since corrigibility depends on the reasons for the trajectories, and not just the trajectories themselves.)
But now consider the following: If you have the policy, you can figure out which actions were taken, just by applying the policy to the state/history. And instrumental convergence does not apply to utility functions over action-observation histories. So therefore it doesn’t apply to utility functions over (policies, observation histories). (I think?? At least if the set of policies is closed under replacing an action under a specified condition, and there’s no Newcombian issues that creates non-causal dependencies between policies and observation histories?).
So a lot of the instrumental convergence power comes from restricting the things you can consider in the utility function. u-AOH is clearly too broad, since it allows assigning utilities to arbitrary sequences of actions with identical effects, and simultaneously u-AOH, u-OH, and ordinary state-based reward functions (can we call that u-S?) are all too narrow, since none of them allow assigning utilities to counterfactuals, which is required in order to phrase things like “humans have control over the AI” (as this is a causal statement and thus depends on the AI).
We could consider u-P, utility functions over policies. This is the most general sort of utility function (I think??), and as such it is also way way too general, just like u-AOH is. I think maybe what I should try to do is define some causal/counterfactual generalizations of u-AOH, u-OH, and u-S, which allow better behaved utility functions.
I think instrumental convergence should still apply to some utility functions over policies, specifically the ones that seem to produce “smart” or “powerful” behavior from simple rules. But I don’t know how to formalize this or if anyone else has.
Since you can convert a utility function over states or observation-histories into a utility function over policies (well, as long as you have a model for measuring the utility of a policy), and since utility functions over states/observation-histories do satisfy instrumental convergence, yes you are correct.
I feel like in a way, one could see the restriction to defining it in terms of e.g. states as a definition of “smart” behavior; if you define a reward in terms of states, then the policy must “smartly” generate those states, rather than just yield some sort of arbitrary behavior.
🤔 I wonder if this approach could generalize TurnTrout’s approach. I’m not entirely sure how, but we might imagine that a structured utility function u(π) over policies could be decomposed into r(f(π)), where f is the features that the utility function pays attention to, and r is the utility function expressed in terms of those features. E.g. for state-based rewards, one might take f to be a model that yields the distribution of states visited by the policy, and r to be the reward function for the individual states (some sort of modification would have to be made to address the fact that f outputs a distribution but r takes in a single state… I guess this could be handled by working in the category of vector spaces and linear transformations but I’m not sure if that’s the best approach in general—though since Set can be embedded into this category, it surely can’t hurt too much).
Then the power-seeking situation boils down to that the vast majority of policies π lead to essentially the same features f(π), but that there is a small set of power-seeking policies that lead to a vastly greater range of different features? And so for most r, a π that optimizes/satisfices/etc.r∘f will come from this small set of power-seeking policies.
I’m not sure how to formalize this. I think it won’t hold for generic vector spaces, since almost all linear transformations are invertible? But it seems to me that in reality, there’s a great degree of non-injectivity. The idea of “chaos inducing abstractions” seems relevant, in the sense that parameter changes in π will mostly tend to lead to completely unpredictable/unsystematic/dissipated effects, and partly tend to lead to predictable and systematic effects. If most of the effects are unpredictable/unsystematic, then f must be extremely non-injective, and this non-injectivity then generates power-seeking.
(Or does it? I guess you’d have to have some sort of interaction effect, where some parameters control the degree to which the function is injective with regards to other parameters. But that seems to holds in practice.)
I’m not sure whether I’ve said anything new or useful.
though since Set can be embedded into [Vect], it surely can’t hurt too much
As an aside, can you link to/say more about this? Do you mean that there exists a faithful functor from Set to Vect (the category of vector spaces)? If you mean that, then every concrete category can be embedded into Vect, no? And if that’s what you’re saying, maybe the functor Set → Vect is something like the “Group to its group algebra over field k” functor.
As an aside, can you link to/say more about this? Do you mean that there exists a faithful functor from Set to Vect (the category of vector spaces)? If you mean that, then every concrete category can be embedded into Vect, no?
Yes, the free vector space functor. For a finite set X, it’s just the functions X→R, with operations defined pointwise. For infinite sets, it is the subset of those functions that have finite support. It’s essentially the same as what you’ve been doing by considering Rd for an outcome set with d outcomes, except with members of a set as indices, rather than numerically numbering the outcomes.
Actually I just realized I should probably clarify how it lifts functions to linear transformations too, because it doesn’t do so in the obvious way. If F is the free vector space functor and f:X→Y is a function, then F(f):F(X)→F(Y) is given by F(f)(y)=∑x∈f−1({y})f(x). (One way of understanding why the functions X→R must have finite support is in ensuring that this sum is well-defined. Though there are alternatives to requiring finite support, as long as one is willing to embed a more structured category than Set into a more structured category than Vect.)
It may be more intuitive to see the free vector space over X as containing formal sums c0x0+⋯+cnxn for xi∈X and ci∈R. The downside to this is that it requires a bunch of quotients, e.g. to ensure commutativity, associativity, distributivity, etc..
Imagine that policies decompose into two components, π=ρ⊗σ. For instance, they may be different sets of parameters in a neural network. We can then talk about the effect of one of the components by considering how it influences the power/injectivity of the features with respect to the other component.
Suppose, for instance, that ρ is such that the policy just ends up acting in a completely random-twitching way. Technically σ has a lot of effect too, in that it chaotically controls the pattern of the twitching, but in terms of the features f, σ is basically constant. This is a low power situation, and if one actually specified what f would be, then a TurnTrout-style argument could probably prove that such values of ρ would be avoided for power-seeking reasons. On the other hand, if ρ made the policy act like an optimizer which optimizes a utility function over the features of f with the utility function being specified by σ, then that would lead to a lot more power/injectivity.
On the other hand, I wonder if there’s a limit to this style of argument. Too much noninjectivity would require crazy interaction effects to fill out the space in a Hilbert-curve-style way, which would be hard to optimize?
I think instrumental convergence should still apply to some utility functions over policies, specifically the ones that seem to produce “smart” or “powerful” behavior from simple rules.
I share an intuition in this area, but “powerful” behavior tendencies seems nearly equivalent to instrumental convergence to me. It feels logically downstream of instrumental convergence.
from simple rules
I already have a (somewhat weak) result on power-seeking wrt the simplicity prior over state-based reward functions. This isn’t about utility functions over policies, though.
So a lot of the instrumental convergence power comes from restricting the things you can consider in the utility function. u-AOH is clearly too broad, since it allows assigning utilities to arbitrary sequences of actions with identical effects, and simultaneously u-AOH, u-OH, and ordinary state-based reward functions (can we call that u-S?) are all too narrow, since none of them allow assigning utilities to counterfactuals, which is required in order to phrase things like “humans have control over the AI” (as this is a causal statement and thus depends on the AI).
Note that we can get a u-AOH which mostly solves ABC-corrigibility:
u(history):={0if disable taken in historyR(last state)else
(Credit to AI_WAIFU on the EleutherAI Discord)
Where R is some positive reward function over terminal states. Do note that there isn’t a “get yourself corrected on your own” incentive. EDIT note that manipulation can still be weakly optimal.
This seems hacky; we’re just ruling out the incorrigible policies directly. We aren’t doing any counterfactual reasoning, we just pick out the “bad action.”
change its utility function in a cycle such that it repeatedly ends up in the same place in the hallway with the same utility function.
I’m not parsing this. You change the utility function, but it ends up in the same place with the same utility function? Did we change it or not? (I think simply rewording it will communicate your point to me)
So we have a switch with two positions, “R” and “L.”
When the switch is “R,” the agent is supposed to want to go to the right end of the hallway, and vice versa for “L” and left. It’s not that you want this agent to be uncertain about the “correct” value of the switch and so it’s learning more about the world as you send it signals—you just want the agent to want to go to the left when the switch is “L,” and to the right when the switch is “R.”
If you start with the agent going to the right along this hallway, and you change the switch to “L,” and then a minute later change your mind and switch back to “R,” it will have turned around and passed through the same spot in the hallway multiple times.
The point is that if you try to define a utility as a function of the state for this agent, you run into an issue with cycles—if you’re continuously moving “downhill”, you can’t get back to where you were before.
Yea, thanks for remembering me! You can also posit that the agent is omniscient from the start, so it did not change its policy due to learning. This argument proves that an agent cannot be corrigible and a maximizer of the same expected utility funtion of world states over multiple shutdowns. But still leaves the possibility for the agent to be corrigible while rewriting his utility function after every correction.
Someone at the coffee hour (Viktoriya? Apologies if I’ve forgotten a name) gave a short explanation of this using cycles. If you imagine an agent moving either to the left or the right along a hallway, you can change its utility function in a cycle such that it repeatedly ends up in the same place in the hallway with the same utility function.
This basically eliminates expected utility (as a discounted sum of utilities of states) maximization as producing this behavior. But you can still imagine selecting a policy such that it takes the right actions in response to you sending it signals. I think a sensible way to do this is like in tailcalled’s recent post, with causal counterfactuals for sending one signal or another.
🤔 I was about to say that I felt like my approach could still be done in terms of state rewards, and that it’s just that my approach violates some of the technical assumptions in the OP. After all, you could just reward for being in a state such that the various counterfactuals apply when rolling out from this state; this would assign higher utility to the blue states than the red states, encouraging corrigibility, and contradicting TurnTrout’s assumption that utility would be assigned solely based on the letter.
But then I realized that this introduces a policy dependence to the reward function; the way you roll out from a state depends on which policy you have. (Well, in principle; in practice some MDPs may not have much dependence on it.) The special thing about state-based rewards is that you can assign utilities to trajectories without considering the policy that generates the trajectory at all. (Which to me seems bad for corrigibility, since corrigibility depends on the reasons for the trajectories, and not just the trajectories themselves.)
But now consider the following: If you have the policy, you can figure out which actions were taken, just by applying the policy to the state/history. And instrumental convergence does not apply to utility functions over action-observation histories. So therefore it doesn’t apply to utility functions over (policies, observation histories). (I think?? At least if the set of policies is closed under replacing an action under a specified condition, and there’s no Newcombian issues that creates non-causal dependencies between policies and observation histories?).
So a lot of the instrumental convergence power comes from restricting the things you can consider in the utility function. u-AOH is clearly too broad, since it allows assigning utilities to arbitrary sequences of actions with identical effects, and simultaneously u-AOH, u-OH, and ordinary state-based reward functions (can we call that u-S?) are all too narrow, since none of them allow assigning utilities to counterfactuals, which is required in order to phrase things like “humans have control over the AI” (as this is a causal statement and thus depends on the AI).
We could consider u-P, utility functions over policies. This is the most general sort of utility function (I think??), and as such it is also way way too general, just like u-AOH is. I think maybe what I should try to do is define some causal/counterfactual generalizations of u-AOH, u-OH, and u-S, which allow better behaved utility functions.
I think instrumental convergence should still apply to some utility functions over policies, specifically the ones that seem to produce “smart” or “powerful” behavior from simple rules. But I don’t know how to formalize this or if anyone else has.
Since you can convert a utility function over states or observation-histories into a utility function over policies (well, as long as you have a model for measuring the utility of a policy), and since utility functions over states/observation-histories do satisfy instrumental convergence, yes you are correct.
I feel like in a way, one could see the restriction to defining it in terms of e.g. states as a definition of “smart” behavior; if you define a reward in terms of states, then the policy must “smartly” generate those states, rather than just yield some sort of arbitrary behavior.
🤔 I wonder if this approach could generalize TurnTrout’s approach. I’m not entirely sure how, but we might imagine that a structured utility function u(π) over policies could be decomposed into r(f(π)), where f is the features that the utility function pays attention to, and r is the utility function expressed in terms of those features. E.g. for state-based rewards, one might take f to be a model that yields the distribution of states visited by the policy, and r to be the reward function for the individual states (some sort of modification would have to be made to address the fact that f outputs a distribution but r takes in a single state… I guess this could be handled by working in the category of vector spaces and linear transformations but I’m not sure if that’s the best approach in general—though since Set can be embedded into this category, it surely can’t hurt too much).
Then the power-seeking situation boils down to that the vast majority of policies π lead to essentially the same features f(π), but that there is a small set of power-seeking policies that lead to a vastly greater range of different features? And so for most r, a π that optimizes/satisfices/etc.r∘f will come from this small set of power-seeking policies.
I’m not sure how to formalize this. I think it won’t hold for generic vector spaces, since almost all linear transformations are invertible? But it seems to me that in reality, there’s a great degree of non-injectivity. The idea of “chaos inducing abstractions” seems relevant, in the sense that parameter changes in π will mostly tend to lead to completely unpredictable/unsystematic/dissipated effects, and partly tend to lead to predictable and systematic effects. If most of the effects are unpredictable/unsystematic, then f must be extremely non-injective, and this non-injectivity then generates power-seeking.
(Or does it? I guess you’d have to have some sort of interaction effect, where some parameters control the degree to which the function is injective with regards to other parameters. But that seems to holds in practice.)
I’m not sure whether I’ve said anything new or useful.
As an aside, can you link to/say more about this? Do you mean that there exists a faithful functor from Set to Vect (the category of vector spaces)? If you mean that, then every concrete category can be embedded into Vect, no? And if that’s what you’re saying, maybe the functor Set → Vect is something like the “Group to its group algebra over field k” functor.
Yes, the free vector space functor. For a finite set X, it’s just the functions X→R, with operations defined pointwise. For infinite sets, it is the subset of those functions that have finite support. It’s essentially the same as what you’ve been doing by considering Rd for an outcome set with d outcomes, except with members of a set as indices, rather than numerically numbering the outcomes.
Actually I just realized I should probably clarify how it lifts functions to linear transformations too, because it doesn’t do so in the obvious way. If F is the free vector space functor and f:X→Y is a function, then F(f):F(X)→F(Y) is given by F(f)(y)=∑x∈f−1({y})f(x). (One way of understanding why the functions X→R must have finite support is in ensuring that this sum is well-defined. Though there are alternatives to requiring finite support, as long as one is willing to embed a more structured category than Set into a more structured category than Vect.)
It may be more intuitive to see the free vector space over X as containing formal sums c0x0+⋯+cnxn for xi∈X and ci∈R. The downside to this is that it requires a bunch of quotients, e.g. to ensure commutativity, associativity, distributivity, etc..
Imagine that policies decompose into two components, π=ρ⊗σ. For instance, they may be different sets of parameters in a neural network. We can then talk about the effect of one of the components by considering how it influences the power/injectivity of the features with respect to the other component.
Suppose, for instance, that ρ is such that the policy just ends up acting in a completely random-twitching way. Technically σ has a lot of effect too, in that it chaotically controls the pattern of the twitching, but in terms of the features f, σ is basically constant. This is a low power situation, and if one actually specified what f would be, then a TurnTrout-style argument could probably prove that such values of ρ would be avoided for power-seeking reasons. On the other hand, if ρ made the policy act like an optimizer which optimizes a utility function over the features of f with the utility function being specified by σ, then that would lead to a lot more power/injectivity.
On the other hand, I wonder if there’s a limit to this style of argument. Too much noninjectivity would require crazy interaction effects to fill out the space in a Hilbert-curve-style way, which would be hard to optimize?
Actually upon thinking further I don’t think this argument works, at least not as it is written right now.
I share an intuition in this area, but “powerful” behavior tendencies seems nearly equivalent to instrumental convergence to me. It feels logically downstream of instrumental convergence.
I already have a (somewhat weak) result on power-seeking wrt the simplicity prior over state-based reward functions. This isn’t about utility functions over policies, though.
Note that we can get a u-AOH which mostly solves ABC-corrigibility:
u(history):={0if disable taken in historyR(last state)else(Credit to AI_WAIFU on the EleutherAI Discord)
Where R is some positive reward function over terminal states. Do note that there isn’t a “get yourself corrected on your own” incentive. EDIT note that manipulation can still be weakly optimal.
This seems hacky; we’re just ruling out the incorrigible policies directly. We aren’t doing any counterfactual reasoning, we just pick out the “bad action.”
I’m not parsing this. You change the utility function, but it ends up in the same place with the same utility function? Did we change it or not? (I think simply rewording it will communicate your point to me)
So we have a switch with two positions, “R” and “L.”
When the switch is “R,” the agent is supposed to want to go to the right end of the hallway, and vice versa for “L” and left. It’s not that you want this agent to be uncertain about the “correct” value of the switch and so it’s learning more about the world as you send it signals—you just want the agent to want to go to the left when the switch is “L,” and to the right when the switch is “R.”
If you start with the agent going to the right along this hallway, and you change the switch to “L,” and then a minute later change your mind and switch back to “R,” it will have turned around and passed through the same spot in the hallway multiple times.
The point is that if you try to define a utility as a function of the state for this agent, you run into an issue with cycles—if you’re continuously moving “downhill”, you can’t get back to where you were before.
Yea, thanks for remembering me! You can also posit that the agent is omniscient from the start, so it did not change its policy due to learning. This argument proves that an agent cannot be corrigible and a maximizer of the same expected utility funtion of world states over multiple shutdowns. But still leaves the possibility for the agent to be corrigible while rewriting his utility function after every correction.