tailcalled comments on Consequentialism & corrigibility

tailcalled 15 Dec 2021 17:03 UTC
1 point
Right, what I call “planning agent” is the same as what you call “the agent itself contains an optimizer”, and I was talking about whether that optimizer is selecting plans for their long-term consequences, versus for other things (superficial aspects of plans, or their immediate consequences, etc.).
I think one way we differ is that I would group {superficial aspects of plans} vs {long-term consequences, short-term consequences}, with the latter both being consequentialist.
I suspect that you have in mind a “Risks from learned optimization” type picture where we have little control over whether the agent contains an optimizer or not, or what the optimizer is selecting for. But there’s also lots of other possibilities, e.g. in MuZero the optimizer inside the AI agent is written by the human programmers into the source code (but involves queries to learned components like a world-model and value function). I happen to think the latter (humans write code for the agent’s optimizer) is more probable for reasons here, and that assumption is underlying the discussion under “My corrigibility proposal sketch”, which otherwise probably would seem pretty nonsensical to you, I imagine.
In general, the “humans write code for the agent’s optimizer” approach still has an inner alignment problem, but it’s different in some respects, see here and here.
Nah, in fact I’d say that your “Misaligned Model-Based RL Agent” post is one of the main inspirations for my model. 🤔 I guess one place my model differs is that I expect to have an explicit utility function (because this seems easiest to reason about, and therefore safest), whereas you split the explicit utility function into a reward signal and a learned value model. Neither of these translate straightforwardly into my model:
- the reward signal is external to the AI, probably determined from the human’s point of view (🤔 I guess that explains the confusion in the other thread, where I had assumed the AI’s point of view, and you had assumed the human’s point of view), and so discussions about whether it is consequentialist or not do not fit straightforwardly into my framework
- the value function is presumably something like $E [\sum R | π, O]$ where R is the reward, $π$ is the current planner/actor, and $O$ is the agent’s epistemic state in its own world-model; this “bakes in” the policy to the value function in a way that is difficult to fit into my framework; implicitly in order to fit it, you need myopic optimization (as is often done in RL), which I would like to get away from (at least in the formalism—for efficiency we would probably need to apply myopic optimization in practice)
- Steven Byrnes 15 Dec 2021 17:32 UTC
  2 points
  Parent
  I think one way we differ is that I would group {superficial aspects of plans} vs {long-term consequences, short-term consequences}, with the latter both being consequentialist.
  Hmm, I guess I try to say “long-term-consequentialist” for long-term consequences. I might have left out the “long-term” part by accident, or if I thought it was clear from context… (And also to make a snappier post title.)
  I do think there’s a meaningful notion of, let’s call it, “stereotypically consequentialist behavior” for both humans and AIs, and long-term consequentialists tend to match it really well, and short-term-consequentialists tend to match it less well.
  I guess one place my model differs is that I expect to have an explicit utility function (because this seems easiest to reason about, and therefore safest)
  Have you written or read anything about how that might work? My theory is: (1) the world is complicated, (2) the AI needs to learn a giant vocabulary of abstract patterns (latent variables) in order to understand or do or want anything of significance in the world, (3) therefore it’s tricky to just write down an explicit utility function. The “My corrigibility proposal sketch” gets around that by something like supervised-learning a way to express the utility function’s ingredients (e.g. “the humans will remain in control”, “I am being helpful”) in terms of these unlabeled latent variables in the world-model. That in turn requires labeled training data and OOD detection and various other details that seem hard to get exactly right, but are nevertheless our best bet. BTW that stuff is not in the “My AGI threat model” post, I grew more fond of them a few months afterwards. :)
  - tailcalled 15 Dec 2021 18:32 UTC
    3 points
    Parent
    Hmm, I guess I try to say “long-term-consequentialist” for long-term consequences. I might have left out the “long-term” part by accident, or if I thought it was clear from context… (And also to make a snappier post title.)
    Fair enough.
    I do think there’s a meaningful notion of, let’s call it, “stereotypically consequentialist behavior” for both humans and AIs, and long-term consequentialists tend to match it really well, and short-term-consequentialists tend to match it less well.
    I agree. I think TurnTrout’s approach is a plausible strategy for formalizing it. If we apply his approach to the long-term vs short-term distinction, then we can observe that the vast majority of trajectory rankings are long-term consequentialist, and therefore most permutations mostly permute with long-term consequentialists; therefore the power-seeking arguments don’t go through with short-term consequentialists.
    I think the nature of the failure of the power-seeking arguments for short-term consequentialists is ultimately different from the nature of the failure for non-consequentialists, though; for short-term consequentialists, it happens as a result of dropping the features that power helps you control, while for non-consequentialists, it happens as a result of valuing additional features than the ones you can control with power.
    Have you written or read anything about how that might work? My theory is: (1) the world is complicated, (2) the AI needs to learn a giant vocabulary of abstract patterns (latent variables) in order to understand or do or want anything of significance in the world, (3) therefore it’s tricky to just write down an explicit utility function. The “My corrigibility proposal sketch” gets around that by something like supervised-learning a way to express the utility function’s ingredients (e.g. “the humans will remain in control”, “I am being helpful”) in terms of these unlabeled latent variables in the world-model. That in turn requires labeled training data and OOD detection and various other details that seem hard to get exactly right, but are nevertheless our best bet. BTW that stuff is not in the “My AGI threat model” post, I grew more fond of them a few months afterwards. :)
    Ah, I think we are in agreement then. I would also agree with using something like supervised learning to get the ingredients of the utility function. (Though I don’t yet know that the ingredients would directly be the sorts of things you mention, or more like “These are the objects in the world” + “Is each object a strawberry?” + etc..)
    (I would also want to structurally force the world model to be more interpretable; e.g. one could require it to reason in terms of objects living in 3D space.)