Steven Byrnes comments on Consequentialism & corrigibility

Steven Byrnes 14 Dec 2021 21:35 UTC
2 points
I’m confused about “preference over policies”. I thought people usually describe an MDP agent as having a policy, not a preference over policies. Right?
My framework instead is: I’m not thinking of MDP agents with policies, I’m thinking of planning agents which are constantly choosing actions / plans based on a search over a wide variety of possible actions / plans. We can thus describe them as having a “preference” for whatever objective that search is maximizing (at any given time). A universe-history is “anything in the world, both present and future”, which struck me as sufficiently broad to capture any aspect of a plan that we might care about. But I’m open-minded to the possibility that maybe I should have said “preferences-over-future-states versus preferences-over-whatever-else” rather than “preferences-over-future-states versus preferences-over-trajectories”, and just not used the word “trajectories” at all.
Let’s take an agent that, in any possible situation, wiggles its arm. That’s all it does. From my perspective, I would not call that “a consequentialist agent”. But my impression is that you would call it a consequentialist agent, because it has a policy, and the “consequence” of the policy is that the agent wiggles its arm. Did I get that right?
- tailcalled 14 Dec 2021 22:03 UTC
  1 point
  Parent
  I’m confused about “preference over policies”. I thought people usually describe an MDP agent as having a policy, not a preference over policies. Right?
  Yes. But there are many different possible policies, and usually for an MDP agent, you select only one. This one policy is typically selected to be the one that leads to the optimal consequences. So you have a function over the consequences, ranking them by how good they are, and you have a function over policies, mapping them to the consequences (this function is determined by the MDP dynamics), and if you compose them, you get a function over policies.
  My framework instead is: I’m not thinking of MDP agents with policies, I’m thinking of planning agents which are constantly choosing actions / plans based on a search over a wide variety of possible actions / plans. We can thus describe them as having a “preference” for whatever objective that search is maximizing (at any given time). A universe-history is “anything in the world, both present and future”, which struck me as sufficiently broad to capture any aspect of a plan that we might care about. But I’m open-minded to the possibility that maybe I should have said “preferences-over-future-states versus preferences-over-whatever-else” rather than “preferences-over-future-states versus preferences-over-trajectories”, and just not used the word “trajectories” at all.
  My framework isn’t restricted to MDPs with policies, it’s applicable to any case where you have a fixed search space. Instead of a function that ranks policies, you could consider a function that ranks plans or ranks actions. Such a function is then consequentialist if it ranks them on the basis of the consequences of these plans/actions.
  Let’s take an agent that, in any possible situation, wiggles its arm. That’s all it does. From my perspective, I would not call that “a consequentialist agent”. But my impression is that you would call it a consequentialist agent, because it has a policy, and the “consequence” of the policy is that the agent wiggles its arm. Did I get that right?
  I’d say that consequentialism is more a property of the optimization process than the agent. If the agent itself contains an optimizer, then one can talk about whether the agent’s optimizer is consequentialist, as well as about whether the process that picked the agent is consequentialist.
  So if you sit down and write a piece of code that makes a robot wiggle its arm, then your choice of code would probably be (partly) consequentialist because you would select the code on the basis of the consequences it has. (Probably far from entirely consequentialist, because you would likely also care about the code’s readability and such, rather than just its consequences.) The code would most likely not have an inner optimizer which searches over possible actions, so it would not even be coherent to talk about whether it was consequentialist. (I.e. it would not be coherent to talk about whether its inner action-selecting optimizer considered the consequences of its actions, because it does not have an inner action-selecting optimizer.) But even if it did have an inner action-selecting optimizer, the code’s selection of actions would likely not be consequentialist, because there would probably be easier ways of ranking actions than by simulating the world to guess the consequences of the actions and then picking the one that does the arm-wiggling best.
  - Steven Byrnes 15 Dec 2021 14:42 UTC
    2 points
    Parent
    Right, what I call “planning agent” is the same as what you call “the agent itself contains an optimizer”, and I was talking about whether that optimizer is selecting plans for their long-term consequences, versus for other things (superficial aspects of plans, or their immediate consequences, etc.).
    I suspect that you have in mind a “Risks from learned optimization” type picture where we have little control over whether the agent contains an optimizer or not, or what the optimizer is selecting for. But there’s also lots of other possibilities, e.g. in MuZero the optimizer inside the AI agent is written by the human programmers into the source code (but involves queries to learned components like a world-model and value function). I happen to think the latter (humans write code for the agent’s optimizer) is more probable for reasons here, and that assumption is underlying the discussion under “My corrigibility proposal sketch”, which otherwise probably would seem pretty nonsensical to you, I imagine.
    In general, the “humans write code for the agent’s optimizer” approach still has an inner alignment problem, but it’s different in some respects, see here and here.
    - tailcalled 15 Dec 2021 17:03 UTC
      1 point
      Parent
      Right, what I call “planning agent” is the same as what you call “the agent itself contains an optimizer”, and I was talking about whether that optimizer is selecting plans for their long-term consequences, versus for other things (superficial aspects of plans, or their immediate consequences, etc.).
      I think one way we differ is that I would group {superficial aspects of plans} vs {long-term consequences, short-term consequences}, with the latter both being consequentialist.
      I suspect that you have in mind a “Risks from learned optimization” type picture where we have little control over whether the agent contains an optimizer or not, or what the optimizer is selecting for. But there’s also lots of other possibilities, e.g. in MuZero the optimizer inside the AI agent is written by the human programmers into the source code (but involves queries to learned components like a world-model and value function). I happen to think the latter (humans write code for the agent’s optimizer) is more probable for reasons here, and that assumption is underlying the discussion under “My corrigibility proposal sketch”, which otherwise probably would seem pretty nonsensical to you, I imagine.
      In general, the “humans write code for the agent’s optimizer” approach still has an inner alignment problem, but it’s different in some respects, see here and here.
      Nah, in fact I’d say that your “Misaligned Model-Based RL Agent” post is one of the main inspirations for my model. 🤔 I guess one place my model differs is that I expect to have an explicit utility function (because this seems easiest to reason about, and therefore safest), whereas you split the explicit utility function into a reward signal and a learned value model. Neither of these translate straightforwardly into my model:
      the reward signal is external to the AI, probably determined from the human’s point of view (🤔 I guess that explains the confusion in the other thread, where I had assumed the AI’s point of view, and you had assumed the human’s point of view), and so discussions about whether it is consequentialist or not do not fit straightforwardly into my framework
      the value function is presumably something like $E [\sum R | π, O]$ where R is the reward, $π$ is the current planner/actor, and $O$ is the agent’s epistemic state in its own world-model; this “bakes in” the policy to the value function in a way that is difficult to fit into my framework; implicitly in order to fit it, you need myopic optimization (as is often done in RL), which I would like to get away from (at least in the formalism—for efficiency we would probably need to apply myopic optimization in practice)
      - Steven Byrnes 15 Dec 2021 17:32 UTC
        2 points
        Parent
        I think one way we differ is that I would group {superficial aspects of plans} vs {long-term consequences, short-term consequences}, with the latter both being consequentialist.
        Hmm, I guess I try to say “long-term-consequentialist” for long-term consequences. I might have left out the “long-term” part by accident, or if I thought it was clear from context… (And also to make a snappier post title.)
        I do think there’s a meaningful notion of, let’s call it, “stereotypically consequentialist behavior” for both humans and AIs, and long-term consequentialists tend to match it really well, and short-term-consequentialists tend to match it less well.
        I guess one place my model differs is that I expect to have an explicit utility function (because this seems easiest to reason about, and therefore safest)
        Have you written or read anything about how that might work? My theory is: (1) the world is complicated, (2) the AI needs to learn a giant vocabulary of abstract patterns (latent variables) in order to understand or do or want anything of significance in the world, (3) therefore it’s tricky to just write down an explicit utility function. The “My corrigibility proposal sketch” gets around that by something like supervised-learning a way to express the utility function’s ingredients (e.g. “the humans will remain in control”, “I am being helpful”) in terms of these unlabeled latent variables in the world-model. That in turn requires labeled training data and OOD detection and various other details that seem hard to get exactly right, but are nevertheless our best bet. BTW that stuff is not in the “My AGI threat model” post, I grew more fond of them a few months afterwards. :)
        tailcalled 15 Dec 2021 18:32 UTC
        3 points
        Parent
        Hmm, I guess I try to say “long-term-consequentialist” for long-term consequences. I might have left out the “long-term” part by accident, or if I thought it was clear from context… (And also to make a snappier post title.)
        Fair enough.
        I do think there’s a meaningful notion of, let’s call it, “stereotypically consequentialist behavior” for both humans and AIs, and long-term consequentialists tend to match it really well, and short-term-consequentialists tend to match it less well.
        I agree. I think TurnTrout’s approach is a plausible strategy for formalizing it. If we apply his approach to the long-term vs short-term distinction, then we can observe that the vast majority of trajectory rankings are long-term consequentialist, and therefore most permutations mostly permute with long-term consequentialists; therefore the power-seeking arguments don’t go through with short-term consequentialists.
        I think the nature of the failure of the power-seeking arguments for short-term consequentialists is ultimately different from the nature of the failure for non-consequentialists, though; for short-term consequentialists, it happens as a result of dropping the features that power helps you control, while for non-consequentialists, it happens as a result of valuing additional features than the ones you can control with power.
        Have you written or read anything about how that might work? My theory is: (1) the world is complicated, (2) the AI needs to learn a giant vocabulary of abstract patterns (latent variables) in order to understand or do or want anything of significance in the world, (3) therefore it’s tricky to just write down an explicit utility function. The “My corrigibility proposal sketch” gets around that by something like supervised-learning a way to express the utility function’s ingredients (e.g. “the humans will remain in control”, “I am being helpful”) in terms of these unlabeled latent variables in the world-model. That in turn requires labeled training data and OOD detection and various other details that seem hard to get exactly right, but are nevertheless our best bet. BTW that stuff is not in the “My AGI threat model” post, I grew more fond of them a few months afterwards. :)
        Ah, I think we are in agreement then. I would also agree with using something like supervised learning to get the ingredients of the utility function. (Though I don’t yet know that the ingredients would directly be the sorts of things you mention, or more like “These are the objects in the world” + “Is each object a strawberry?” + etc..)
        (I would also want to structurally force the world model to be more interpretable; e.g. one could require it to reason in terms of objects living in 3D space.)