cfoster0 comments on Alignment allows “nonrobust” decision-influences and doesn’t require robust grading

cfoster0 2 Dec 2022 7:00 UTC
13 points
10
As always, I really enjoyed seeing how you think through this.

No: he can set “working hard” as his optimization target from the get-go, and, e. g., invent a plan of “stay on the lookout for new sources of distraction, explicitly run the world-model forwards to check whether X would distract me, and if yes, generate a new conscious heuristic for avoiding X”. But this requires “working hard” to be the value-child’s explicit consciously-known goal. Not just an implicit downstream consequence of the working-hard shard’s contextual activations.

Whatever decisions value-child makes are made via circuits within his policy network (shards), circuits that were etched into place by some combination of (1) generic pre-programming, (2) past predictive success, and (3) past reinforcement. Those circuits have contextual logic determined by e.g. their connectivity pattern. In order for him to have made the decision to hold “working hard” in attention and adopt it as a conscious goal, some such circuits need to already exist to have bid for that choice conditioned on the current state of value-child’s understanding, and to keep that goal in working memory so his future choices are conditional on goal-relevant representations. I don’t really see how the explicitness of the goal changes the dynamic or makes value-child any less “puppeteered” by his shards.

A working-hard shard is optimized for working hard. The value-child can notice that shard influencing his decision-making. He can study it, check its behavior in imagined hypothetical scenarios, gather statistical data. Eventually, he would arrive at the conclusion: this shard is optimized for making him work hard. At this point, he can put “working hard” into his world-model as “one of my values”. And this kind of value very much can be maximized.

At this point, the agent has abstracted the behavior of a shard (a nonvolatile pattern instantiated in neural connections) into a mental representation (a volatile pattern instantiated in neural activations). What does it mean to maximize this representation? The type signature of the originating shard is something like mental_context→policy_logits, and the abstracted value should preserve that type signature, so it doesn’t seem to me that the value should be any more maximizable than the shard. What mechanistic details have changed such that that operation now makes sense? What does it mean to maximize my working-hard value?

But in grown-up general-purpose systems, such as highly-intelligent highly-reflective humans who think a lot about philosophy and their own thinking and being effective at achieving real-world goals, shards encode optimization targets. Such systems acknowledge the role of shards in steering them towards what they’re supposed to do, but instead of remaining passive shard-puppets, they actively figure out what the shards are trying to get them to do, what they’re optimized for, what the downstream consequences of their shards’ activations are, then go and actively optimize for these things instead of waiting for their shards to kick them.

If the shards are no longer in the driver’s seat, how is behavior-/decision-steering implemented? I am having a hard time picturing what you are saying. It sounds something like “I see that I have an urge to flee when I see spiders. I conclude from this that I value avoiding spiders. Realizing this, I now abstract this heuristic into a general-purpose aversion to situations with a potential for spider-encounters, so as to satisfy this value.” Is that what you have in mind? Using shard theory language, I would describe this as a shard incrementally extending itself to activate across a wider range of input preconditions. But it sounds like you have something different in mind, maybe?

And this is where Goodharting will come in. That final utility function may look very different from what you’d expect from the initial shard distribution — the way a kind human, with various shards for “don’t kill”, “try to cheer people up”, “be a good friend” may stitch their values up into utilitarianism, disregard deontology, and go engage in well-intentioned extremism about it.

It is possible, but is it likely? What fraction of kids who start off with “don’t kill” and “try to cheer people up” and “be a good friend” values early in life in fact abandon those values as they become reflective adults with a broader moral framework? I would bet that even among philosophically-minded folks, this behavior is rare, in large part because their current values would steer them away from real attempts to quash their existing contextual values in favor of some ideology. (For example, they start adopting “utilitarianism as attire”, but in fact keep making nearly all of their decisions downstream from the same disaggregated situational heuristics in real life, and would get anxious at the prospect of actually losing all of their other values.)

Taking a big-picture view: The Shard Theory, as I see it, is not a replacement for or an explaining-away of the old fears of single-minded wrapper-mind utility-maximizers. It’s an explanation of what happens in the middle stage between a bunch of non-optimizing heuristics and the wrapper-mind. But we’ll still get a wrapper-mind at the end!

On the whole, I think that the case for “why wrapper-minds are the dominant attractors in the space of mind design” just isn’t all that strong, even given the coherence theorems about how agents should structure their preferences.
- Thane Ruthenis 2 Dec 2022 8:09 UTC
  3 points
  2
  Parent
  Thanks for an involved response!
  It sounds something like “I see that I have an urge to flee when I see spiders. I conclude from this that I value avoiding spiders. Realizing this, I now abstract this heuristic into a general-purpose aversion to situations with a potential for spider-encounters, so as to satisfy this value.” Is that what you have in mind? Using shard theory language, I would describe this as a shard incrementally extending itself to activate across a wider range of input preconditions. But it sounds like you have something different in mind, maybe?
  No, that’s about right. The difference is in the mechanism of this extension. The shard’s range of activations isn’t being generalized by the reward circuitry. Instead, the planner “figures out” what contextual goal the shard implicitly implements, then generalizes that goal even to completely unfamiliar situations, in a logical manner.
  If it was done via the reward circuitry, it would’ve been a slower process of trial-and-error, as the human gets put in novel spider-involving situations, and their no-spiders shard painstakingly learns to recognize such situations and bid against plans involving them.
  Say the planner generates some plan that involves spiders. For the no-spiders shard to bid against it, the following needs to happen:
  - The no-spiders shard can recognize this specific plan format.
  - The no-spiders shard can recognize the specific kind of “spiders” that will be involved (maybe they’re a really exotic variety, which it doesn’t yet know to activate in response to?).
  - The plan’s consequences are modeled in enough detail to show whether it will or will not involve spiders.
  E. g., I decide to go sleep in a haunted house to win a bet. I never bother to imagine the scenario in detail, so spiders never enter my expectations. In addition, I don’t do this sort of thing often, so my no-spiders shard doesn’t know to recognize from experience that this sort of plan would lead to spiders. So the shard doesn’t object, the reward circuitry can’t extend it to situations it’s never been in, and I end up doing something the natural generalization of my no-spiders shard would’ve bid against. (And then the no-spiders shard activates when I wake up to a spider sitting on my nose, and then the reward circuitry kicks in, and only the next time I want to win a haunted-house bet does my no-spiders shard know to bid against it.)
  If I have “spiders bad” as my explicitly known value, however, I can know to set “no spiders” as a planning constraint before engaging in any planning, and have a policy for checking whether a given plan would involve spiders. In that case, I would logically reason that yeah, there are probably spiders in the abandoned house, so I’ll discard the plan. The no-spiders shard itself, however, will just sit there none the wiser.
  I don’t really see how the explicitness of the goal changes the dynamic or makes value-child any less “puppeteered” by his shards.
  In a very literal way, shards are cut out of the loop here. In young general systems, shards prompt the planner with plan objectives. In mature systems, the planner prompts itself, having learned what kind of thing it’s usually prompted with and having generalized the shards’ activation pattern way beyond their actual implementation.
  At this point, the agent has abstracted the behavior of a shard (a nonvolatile pattern instantiated in neural connections) into a mental representation (a volatile pattern instantiated in neural activations). What does it mean to maximize this representation?
  Suppose that you have a node in your world-model which represents how hard you’re working now, and a shard that fires in certain contexts, whose activations have the consequence of setting that node’s value higher. Once the planner learns this relationship, it can conclude something like “it’s good if this node’s value is as high as possible” or maybe “above a certain number”, and then optimize for that node’s value regardless of context.
  I would bet that even among philosophically-minded folks, this behavior is rare, in large part because their current values would steer them away from real attempts to quash their existing contextual values in favor of some ideology
  See, this is what I mean about assuming that all agents we’ll deal with will be young to their agency. You’re talking about people who aren’t taking their ideologies seriously, and yes, most humans are like this. But e. g. LW-style rationalists and effective altruists make a point of trying to act like abstract philosophic conclusions apply to real life, instead of acting on inertia. And I expect superintelligences to take their beliefs seriously as well.
  Wasn’t there an unsolved problem in shard theory, where it predicted that our internal shard economies should ossify as old shards capture more and more space and quash young competition, and yet we can somehow e. g. train rationalists to resist this?
  I think that the case for “why wrapper-minds are the dominant attractors in the space of mind design” just isn’t all that strong
  How so?
  - cfoster0 2 Dec 2022 19:50 UTC
    11 points
    8
    Parent
    No, that’s about right. The difference is in the mechanism of this extension. The shard’s range of activations isn’t being generalized by the reward circuitry. Instead, the planner “figures out” what contextual goal the shard implicitly implements, then generalizes that goal even to completely unfamiliar situations, in a logical manner.
    I don’t think that is what is happening. I think what is happening is that the shard has a range of upstream inputs, and that the brain does something like TD learning on its thoughts to strengthen & broaden the connections responsible for positive value errors. TD learning (especially with temporal abstraction) lets the agent immediately update its behavior based on predictive/associational representations, rather than needing the much slower reward circuits to activate. You know the feeling of “Oh, that idea is actually pretty good!”? In my book, that ≈ positive TD error.
    A diamond shard is downstream of representations like “shiny appearance” and “clear color” and “engagement” and “expensive” and “diamond” and “episodic memory #38745″, all converging to form the cognitive context that inform when the shard triggers. When the agent imagines a possible plan like “What if I robbed a jewelry store?”, many of those same representations will be active, because “jewelry” spreads activation into adjacent concepts in the agent’s mind like “diamond” and “expensive”. Since those same representations are active, the diamond shard downstream from them is also triggered (though more weakly than if the agent were actually seeing a diamond in front of them) and bids for that chain of thought to continue. If that robbery-plan-thought seems better than expected (i.e. creates a positive TD error) upon consideration, all of the contributing concepts (including concepts like “doing step-by-step reasoning”) are immediately upweighted so as to reinforce & generalize their firing pattern into future.
    In my picture there is no separate “planner” component (in brains or in advanced neural network systems) that interacts with the shards to generalize their behavior. Planning is the name for running shard dynamics forward while looping the outputs back in as inputs. On an analogy with GPT, planning is just doing autoregressive generation. That’s it. There is no separate planning module within GPT. Planning is what we call it when we let the circuits pattern-match against their stored contexts, output their associated next-action logit contributions, and recycle the resulting outputs back into the network. The mechanistic details of planning-GPT are identical to the mechanistic details of pattern-matching GPT because they are the same system.
    Say the planner generates some plan that involves spiders. For the no-spiders shard to bid against it, the following needs to happen:
    The no-spiders shard can recognize this specific plan format.
    The no-spiders shard can recognize the specific kind of “spiders” that will be involved (maybe they’re a really exotic variety, which it doesn’t yet know to activate in response to?).
    The no-spiders shard only has to see that the “spider” concept is activated by the current thought, and it will bid against continuing that thought (as that connection will be among those strengthened by past updates, if the agent had the spider abstraction at the time). It doesn’t need to know anything about planning formats, or about different kinds of spiders, or about whether the current thought is a “perception” vs an “imagined consequence” vs a “memory” . The no-spiders shard bids against thoughts on the basis of the activation of the spider concept (and associated representations) in the WM.
    The plan’s consequences are modeled in enough detail to show whether it will or will not involve spiders.
    Yes, this part is definitely required. If the agent doesn’t think at all about whether the plan entails spiders, then they won’t make their decisions about the plan with spiders in mind.
    If I have “spiders bad” as my explicitly known value, however, I can know to set “no spiders” as a planning constraint before engaging in any planning, and have a policy for checking whether a given plan would involve spiders. In that case, I would logically reason that yeah, there are probably spiders in the abandoned house, so I’ll discard the plan. The no-spiders shard itself, however, will just sit there none the wiser.
    I buy that an agent can cache the “check for spiders” heuristic. But upon checking whether a plan involves spiders, if there isn’t a no-spiders shard or something similar, then whenever that check happens, the agent will just think “yep, that plan indeed involves spiders” and keep on thinking about the plan rather than abandoning it. The enduring decision-influence inside the agent’s head that makes spider-thoughts uncomfortable, the circuit that implements “object to thoughts on the basis of spiders” because of past negative experiences with spiders, is the same no-spiders shard that activates when the agent sees a spider (triggering the relevant abstractions inside the agent’s WM).
    Once the planner learns this relationship, it can conclude something like “it’s good if this node’s value is as high as possible” or maybe “above a certain number”, and then optimize for that node’s value regardless of context.
    Aside from the above objection to thinking of a distinct “planner” entity, I don’t get why it would form that conclusion in the situation you’re describing here. The agent has observed “When I’m in X contexts, I feel an internal tug towards/against Y and I think about how I’m working hard”. (Like “When I’m at school, I feel an internal tug towards staying quiet and I think about how I’m working hard.”) What can/will it conclude from that observation?
    But e. g. LW-style rationalists and effective altruists make a point of trying to act like abstract philosophic conclusions apply to real life, instead of acting on inertia. And I expect superintelligences to take their beliefs seriously as well. Wasn’t there an unsolved problem in shard theory, where it predicted that our internal shard economies should ossify as old shards capture more and more space and quash young competition, and yet we can somehow e. g. train rationalists to resist this?
    I likely have a dimmer view of rationalists/EAs and the degree to which they actually overhaul their motivations rather than layering new rationales on top of existing motivational foundations. But yeah, I think shard theory predicts early-formed values should be more sticky and enduring than late-coming ones.
    How so?
    My thoughts on wrapper-minds run along similar lines to nostalgebraist’s. Might be a conversation better had in DMs though :)
    - Thane Ruthenis 3 Dec 2022 3:20 UTC
      4 points
      1
      Parent
      I think what is happening is that the shard has a range of upstream inputs, and that the brain does something like TD learning on its thoughts to strengthen & broaden the connections responsible for positive value errors
      Interesting, I’ll look into TD learning in more depth later. Anecdotally, though, this doesn’t seem to be quite right. I model shards as consciously-felt urges, and it sure seems to me that I can work towards anticipating and satisfying-in-advance these urges without actually feeling them.
      To quote the post you linked:
      For example, imagine you’re feeling terribly nauseous. Of course your Steering Subsystem knows that you’re feeling terribly nauseous. And then suppose it sees you thinking a thought that seems to be leading towards eating. In that case, the Steering Subsystem may say: “That’s a terrible thought! Negative reward!”
      OK, so you’re feeling nauseous, and you pick up the phone to place your order at the bakery. This thought gets weakly but noticeably flagged by the Thought Assessors as “likely to lead to eating”. Your Steering Subsystem sees that and says “Boo, given my current nausea, that seems like a bad thought.” It will feel a bit aversive. “Yuck, I’m really ordering this huge cake??” you say to yourself.
      Logically, you know that come next week, when you actually receive the cake, you won’t feel nauseous anymore, and you’ll be delighted to have the cake. But still, right now, you feel kinda gross and unmotivated to order it.
      Do you order the cake anyway? Sure! Maybe the value function (a.k.a. the “will lead to reward” Thought Assessor) is strong enough to overrule the effects of the “will lead to eating” Thought Assessor. Or maybe you call up a different motivation: you imagine yourself as the kind of person who has good foresight and makes good sensible decisions, and who isn’t stuck in the moment. That’s a different thought in your head, which consequently activates a different set of Thought Assessors, and maybe that gets high value from the Steering Subsystem. Either way, you do in fact call the bakery to place the cake order for next week, despite feeling nauseous right now. What a heroic act!
      Emphasis mine. So there’s some meta-ish shard or group of shards that bid on plans based on the agent’s model of its shards’ future activations, without the object-level shards under consideration actually needing to activate. What I’m suggesting is that in sufficiently mature agents, there’s some meta-ish shard or system like this, which is increasingly responsible for all planning taking place.
      Aside from the above objection to thinking of a distinct “planner” entity, I don’t get why it would form that conclusion in the situation you’re describing here. The agent has observed “When I’m in X contexts, I feel an internal tug towards/against Y and I think about how I’m working hard”. (Like “When I’m at school, I feel an internal tug towards staying quiet and I think about how I’m working hard.”) What can/will it conclude from that observation?
      Good catch: I’m not entirely sure of the mechanism involved here. How specifically the meta-ish “do what my shards want me to do” system is implemented, and why it appears. I offer some potential reasons here (Section 6′s first part, before 6A), but I’m not sure it’s necessarily anything more complicated than coherent decisions = coherent utilities.
      My thoughts on wrapper-minds run along similar lines to nostalgebraist’s.
      Mm, those are arguments that wrapper-minds are a bad tool to solve a problem according to some entity external to the wrapper-mind. Not according to the wrapper-mind’s hard-coded objective function itself. And the reason why is because the wrapper-mind will tear apart the thing the external entity cares about in its powerful pursuit of the thing it’s technically pointed at, if the wrapper-mind is even slightly misaimed. Which… is actually an argument for wrapper-minds’ power, not against?
      And if it’s an argument that the SGD/evolution/reward circuitry won’t create wrapper-minds— I expect that ~all greedy optimization algorithms are screwed over by deceptive alignment there. Basically, they go:
      Build a system that implicitly pursues the mesa-objective implicit in the distribution of the contextual activations of its shards (the standard Shard Theory view).
      Get the system to start building a coherent model of the mesa-objective and pursue it coherently. (What I’m arguing will happen.)
      Gradually empower the system doing (2), because (while that system is still imperfect and tugged around by contextual shard activations, i. e. not a pure wrapper-mind) it’s just that good at delivering results.
      At some point the coherent-mesa-objective subsystem gets so powerful it realizes this dynamic, realizes its coherent-mesa-objective isn’t what the outer optimizer wants of it, and plays along/manipulates it until it’s powerful enough to break out.
      So — yes, pure wrapper-minds are a pretty bad tool for any given job. That’s the whole AI Alignment problem! But they’re not bad because they’re ineffective, they’re bad because they’re so effective they need to be aimed with impossible precision. And while we, generally intelligent reasoners, can realize it and be sensibly wary of them, stupid greedy algorithms like the SGD incrementally hand them more and more power until getting screwed over.
      And a superintelligence, on the other hand, would just be powerful enough to specify the coherent-mesa-objective for itself with such precision as to safely harness the power of the wrapper-mind — solve the alignment problem.
      In my picture there is no separate “planner” component (in brains or in advanced neural network systems) that interacts with the shards to generalize their behavior
      I’m assuming fairly strongly that it is separate, and that “the planner” is roughly as this post outlines it. Dehaene’s model of consciousness seems to point in the same direction, as does the Orthogonality Thesis. Generally, it seems that goals and the general-purpose planning algorithm have little to do with each other and can be fully decoupled. Also, informally, it doesn’t seem to me that my shards are legible to me the same way my models of them or my past thoughts are.
      And if you take this view as a given, the rest of my model seems to be the natural result.
      But it’s not a central disagreement, here, I think.
      - cfoster0 3 Dec 2022 7:37 UTC
        3 points
        0
        Parent
        Interesting. It sounds like a lot of this disagreement is downstream of some other disagreements about the likely shape of minds & how planning is implemented in practice. I don’t think that coherence theorems tell us anything about what the implementation details of agent decision-making should be (optimizing an explicitly represented objective), just about what properties its decisions should satisfy. I think my model says that deliberative cognition like planning is just chaining together steps of nondeliberative cognition with working-memory. In my model, there isn’t some compact algorithm corresponding to “general purpose search”; effective search in reality depends on being able to draw upon and chain together things from a massive body of stored declarative & procedural knowledge. In my model, there isn’t a distinction between “object-level shards” and “meta-level shards”. It’s a big ol’ net of connections, where each neuron has a bunch of incoming context inputs, and that neuron doesn’t know or care what its inputs mean, it just learns to fire “at the right time”, relative to its neighbors’ firing patterns. It’s not like there’s a set of “planning heuristic” shards that can meaningfully detach themselves from the “social interaction heuristic” shards or the “aesthetic preference heuristic” shards. They all drive one another.
        
        Also, I think that the AIs we build will have complex, contextual values by default. Extracting from themselves a crisp utility function that would be safe to install into an unrestricted wrapper-mind will be difficult and dangerous to them just like it is for us (though likely less so). It doesn’t seem at all clear to me that the way for a values-agent to satisfy its values is for it to immediately attempt to compile those values down into an explicit utility function. (If you value actually spending time with your family, I don’t think “first work for a few years eliciting from yourself a global preference ordering over world-states that approximates your judgment of spending-time-with-family goodness” would be the most effective or attractive course of action to satisfy your spending-time-with-family value.) Also, if you really need to create successor agents, why gamble on a wrapper-mind whose utility function may or may not match your preferences, when you can literally deploy byte-identical copies of yourself or finetuned checkpoints based on yourself, avoiding the successor-alignment problems entirely?
        Thane Ruthenis 3 Dec 2022 8:19 UTC
        4 points
        1
        Parent
        It sounds like a lot of this disagreement is downstream of some other disagreements about the likely shape of minds & how planning is implemented in practice
        Agreed.
        effective search in reality depends on being able to draw upon and chain together things from a massive body of stored declarative & procedural knowledge
        Sure, but I think this decouples into two components, “general-purpose search” over “the world-model”. The world-model would be by far the more complex component, and the GPS a comparably simple algorithm — but nonetheless a distinct, convergently-learned algorithm which codifies how the process of “drawing upon” the world-model is proceeding. (Note that I’m not calling it “the planning shard” as I had before, or “planning heuristics” or whatever — I think this algorithm is roughly the same for all goals and all world-models.)
        And in my model, this algorithm can’t natively access the procedural knowledge, only the declarative knowledge explicitly represented in the world-model. It has to painstakingly transfer procedural knowledge to the WM first/reverse-engineer it, before it can properly make use of it. And value reflection is part of that, in my view.
        And from that, in turn, I draw the differences between values-as-shards and values-explicitly-represented, and speculate what happens once all values have been reverse-engineered + we hit the level of superintelligence at which flawlessly compiling them into an utility function is trivial for the AI.
        In my model, there isn’t a distinction between “object-level shards” and “meta-level shards”.
        I don’t think there’s a sharp distinction, either. In that context, though, those labels seemed to make sense.
        I think that the AIs we build will have complex, contextual values by default
        I agree, and I agree that properly compiling its values into a utility-function will be a challenge for the AI, like it is for humans. I did mention that’d we’d want to interfere on a young agent first, which would operate roughly as humans do.
        But once it hits the scary strongly-superintelligent level, solving its version of the alignment problem should be relatively trivial for it, and at that point there’s no reason not to self-modify into a wrapper-mind, if being a wrapper-mind will be more effective.
        (Though I’m not even saying it’ll “self-modify” by, like, directly rewriting its code or something. It may “self-modify” by consciously adopting a new ideology/philosophy, as humans do.)