cfoster0 comments on Alignment allows “nonrobust” decision-influences and doesn’t require robust grading

cfoster0 2 Dec 2022 19:50 UTC
11 points
8
No, that’s about right. The difference is in the mechanism of this extension. The shard’s range of activations isn’t being generalized by the reward circuitry. Instead, the planner “figures out” what contextual goal the shard implicitly implements, then generalizes that goal even to completely unfamiliar situations, in a logical manner.
I don’t think that is what is happening. I think what is happening is that the shard has a range of upstream inputs, and that the brain does something like TD learning on its thoughts to strengthen & broaden the connections responsible for positive value errors. TD learning (especially with temporal abstraction) lets the agent immediately update its behavior based on predictive/associational representations, rather than needing the much slower reward circuits to activate. You know the feeling of “Oh, that idea is actually pretty good!”? In my book, that ≈ positive TD error.
A diamond shard is downstream of representations like “shiny appearance” and “clear color” and “engagement” and “expensive” and “diamond” and “episodic memory #38745″, all converging to form the cognitive context that inform when the shard triggers. When the agent imagines a possible plan like “What if I robbed a jewelry store?”, many of those same representations will be active, because “jewelry” spreads activation into adjacent concepts in the agent’s mind like “diamond” and “expensive”. Since those same representations are active, the diamond shard downstream from them is also triggered (though more weakly than if the agent were actually seeing a diamond in front of them) and bids for that chain of thought to continue. If that robbery-plan-thought seems better than expected (i.e. creates a positive TD error) upon consideration, all of the contributing concepts (including concepts like “doing step-by-step reasoning”) are immediately upweighted so as to reinforce & generalize their firing pattern into future.
In my picture there is no separate “planner” component (in brains or in advanced neural network systems) that interacts with the shards to generalize their behavior. Planning is the name for running shard dynamics forward while looping the outputs back in as inputs. On an analogy with GPT, planning is just doing autoregressive generation. That’s it. There is no separate planning module within GPT. Planning is what we call it when we let the circuits pattern-match against their stored contexts, output their associated next-action logit contributions, and recycle the resulting outputs back into the network. The mechanistic details of planning-GPT are identical to the mechanistic details of pattern-matching GPT because they are the same system.
Say the planner generates some plan that involves spiders. For the no-spiders shard to bid against it, the following needs to happen:
- The no-spiders shard can recognize this specific plan format.
- The no-spiders shard can recognize the specific kind of “spiders” that will be involved (maybe they’re a really exotic variety, which it doesn’t yet know to activate in response to?).
The no-spiders shard only has to see that the “spider” concept is activated by the current thought, and it will bid against continuing that thought (as that connection will be among those strengthened by past updates, if the agent had the spider abstraction at the time). It doesn’t need to know anything about planning formats, or about different kinds of spiders, or about whether the current thought is a “perception” vs an “imagined consequence” vs a “memory” . The no-spiders shard bids against thoughts on the basis of the activation of the spider concept (and associated representations) in the WM.
- The plan’s consequences are modeled in enough detail to show whether it will or will not involve spiders.
Yes, this part is definitely required. If the agent doesn’t think at all about whether the plan entails spiders, then they won’t make their decisions about the plan with spiders in mind.
If I have “spiders bad” as my explicitly known value, however, I can know to set “no spiders” as a planning constraint before engaging in any planning, and have a policy for checking whether a given plan would involve spiders. In that case, I would logically reason that yeah, there are probably spiders in the abandoned house, so I’ll discard the plan. The no-spiders shard itself, however, will just sit there none the wiser.
I buy that an agent can cache the “check for spiders” heuristic. But upon checking whether a plan involves spiders, if there isn’t a no-spiders shard or something similar, then whenever that check happens, the agent will just think “yep, that plan indeed involves spiders” and keep on thinking about the plan rather than abandoning it. The enduring decision-influence inside the agent’s head that makes spider-thoughts uncomfortable, the circuit that implements “object to thoughts on the basis of spiders” because of past negative experiences with spiders, is the same no-spiders shard that activates when the agent sees a spider (triggering the relevant abstractions inside the agent’s WM).
Once the planner learns this relationship, it can conclude something like “it’s good if this node’s value is as high as possible” or maybe “above a certain number”, and then optimize for that node’s value regardless of context.
Aside from the above objection to thinking of a distinct “planner” entity, I don’t get why it would form that conclusion in the situation you’re describing here. The agent has observed “When I’m in X contexts, I feel an internal tug towards/against Y and I think about how I’m working hard”. (Like “When I’m at school, I feel an internal tug towards staying quiet and I think about how I’m working hard.”) What can/will it conclude from that observation?
But e. g. LW-style rationalists and effective altruists make a point of trying to act like abstract philosophic conclusions apply to real life, instead of acting on inertia. And I expect superintelligences to take their beliefs seriously as well. Wasn’t there an unsolved problem in shard theory, where it predicted that our internal shard economies should ossify as old shards capture more and more space and quash young competition, and yet we can somehow e. g. train rationalists to resist this?
I likely have a dimmer view of rationalists/EAs and the degree to which they actually overhaul their motivations rather than layering new rationales on top of existing motivational foundations. But yeah, I think shard theory predicts early-formed values should be more sticky and enduring than late-coming ones.
How so?
My thoughts on wrapper-minds run along similar lines to nostalgebraist’s. Might be a conversation better had in DMs though :)
- Thane Ruthenis 3 Dec 2022 3:20 UTC
  4 points
  1
  Parent
  I think what is happening is that the shard has a range of upstream inputs, and that the brain does something like TD learning on its thoughts to strengthen & broaden the connections responsible for positive value errors
  Interesting, I’ll look into TD learning in more depth later. Anecdotally, though, this doesn’t seem to be quite right. I model shards as consciously-felt urges, and it sure seems to me that I can work towards anticipating and satisfying-in-advance these urges without actually feeling them.
  To quote the post you linked:
  For example, imagine you’re feeling terribly nauseous. Of course your Steering Subsystem knows that you’re feeling terribly nauseous. And then suppose it sees you thinking a thought that seems to be leading towards eating. In that case, the Steering Subsystem may say: “That’s a terrible thought! Negative reward!”
  OK, so you’re feeling nauseous, and you pick up the phone to place your order at the bakery. This thought gets weakly but noticeably flagged by the Thought Assessors as “likely to lead to eating”. Your Steering Subsystem sees that and says “Boo, given my current nausea, that seems like a bad thought.” It will feel a bit aversive. “Yuck, I’m really ordering this huge cake??” you say to yourself.
  Logically, you know that come next week, when you actually receive the cake, you won’t feel nauseous anymore, and you’ll be delighted to have the cake. But still, right now, you feel kinda gross and unmotivated to order it.
  Do you order the cake anyway? Sure! Maybe the value function (a.k.a. the “will lead to reward” Thought Assessor) is strong enough to overrule the effects of the “will lead to eating” Thought Assessor. Or maybe you call up a different motivation: you imagine yourself as the kind of person who has good foresight and makes good sensible decisions, and who isn’t stuck in the moment. That’s a different thought in your head, which consequently activates a different set of Thought Assessors, and maybe that gets high value from the Steering Subsystem. Either way, you do in fact call the bakery to place the cake order for next week, despite feeling nauseous right now. What a heroic act!
  Emphasis mine. So there’s some meta-ish shard or group of shards that bid on plans based on the agent’s model of its shards’ future activations, without the object-level shards under consideration actually needing to activate. What I’m suggesting is that in sufficiently mature agents, there’s some meta-ish shard or system like this, which is increasingly responsible for all planning taking place.
  Aside from the above objection to thinking of a distinct “planner” entity, I don’t get why it would form that conclusion in the situation you’re describing here. The agent has observed “When I’m in X contexts, I feel an internal tug towards/against Y and I think about how I’m working hard”. (Like “When I’m at school, I feel an internal tug towards staying quiet and I think about how I’m working hard.”) What can/will it conclude from that observation?
  Good catch: I’m not entirely sure of the mechanism involved here. How specifically the meta-ish “do what my shards want me to do” system is implemented, and why it appears. I offer some potential reasons here (Section 6′s first part, before 6A), but I’m not sure it’s necessarily anything more complicated than coherent decisions = coherent utilities.
  My thoughts on wrapper-minds run along similar lines to nostalgebraist’s.
  Mm, those are arguments that wrapper-minds are a bad tool to solve a problem according to some entity external to the wrapper-mind. Not according to the wrapper-mind’s hard-coded objective function itself. And the reason why is because the wrapper-mind will tear apart the thing the external entity cares about in its powerful pursuit of the thing it’s technically pointed at, if the wrapper-mind is even slightly misaimed. Which… is actually an argument for wrapper-minds’ power, not against?
  And if it’s an argument that the SGD/evolution/reward circuitry won’t create wrapper-minds— I expect that ~all greedy optimization algorithms are screwed over by deceptive alignment there. Basically, they go:
  1. Build a system that implicitly pursues the mesa-objective implicit in the distribution of the contextual activations of its shards (the standard Shard Theory view).
  2. Get the system to start building a coherent model of the mesa-objective and pursue it coherently. (What I’m arguing will happen.)
  3. Gradually empower the system doing (2), because (while that system is still imperfect and tugged around by contextual shard activations, i. e. not a pure wrapper-mind) it’s just that good at delivering results.
  4. At some point the coherent-mesa-objective subsystem gets so powerful it realizes this dynamic, realizes its coherent-mesa-objective isn’t what the outer optimizer wants of it, and plays along/manipulates it until it’s powerful enough to break out.
  So — yes, pure wrapper-minds are a pretty bad tool for any given job. That’s the whole AI Alignment problem! But they’re not bad because they’re ineffective, they’re bad because they’re so effective they need to be aimed with impossible precision. And while we, generally intelligent reasoners, can realize it and be sensibly wary of them, stupid greedy algorithms like the SGD incrementally hand them more and more power until getting screwed over.
  And a superintelligence, on the other hand, would just be powerful enough to specify the coherent-mesa-objective for itself with such precision as to safely harness the power of the wrapper-mind — solve the alignment problem.
  In my picture there is no separate “planner” component (in brains or in advanced neural network systems) that interacts with the shards to generalize their behavior
  I’m assuming fairly strongly that it is separate, and that “the planner” is roughly as this post outlines it. Dehaene’s model of consciousness seems to point in the same direction, as does the Orthogonality Thesis. Generally, it seems that goals and the general-purpose planning algorithm have little to do with each other and can be fully decoupled. Also, informally, it doesn’t seem to me that my shards are legible to me the same way my models of them or my past thoughts are.
  And if you take this view as a given, the rest of my model seems to be the natural result.
  But it’s not a central disagreement, here, I think.
  - cfoster0 3 Dec 2022 7:37 UTC
    3 points
    0
    Parent
    Interesting. It sounds like a lot of this disagreement is downstream of some other disagreements about the likely shape of minds & how planning is implemented in practice. I don’t think that coherence theorems tell us anything about what the implementation details of agent decision-making should be (optimizing an explicitly represented objective), just about what properties its decisions should satisfy. I think my model says that deliberative cognition like planning is just chaining together steps of nondeliberative cognition with working-memory. In my model, there isn’t some compact algorithm corresponding to “general purpose search”; effective search in reality depends on being able to draw upon and chain together things from a massive body of stored declarative & procedural knowledge. In my model, there isn’t a distinction between “object-level shards” and “meta-level shards”. It’s a big ol’ net of connections, where each neuron has a bunch of incoming context inputs, and that neuron doesn’t know or care what its inputs mean, it just learns to fire “at the right time”, relative to its neighbors’ firing patterns. It’s not like there’s a set of “planning heuristic” shards that can meaningfully detach themselves from the “social interaction heuristic” shards or the “aesthetic preference heuristic” shards. They all drive one another.
    
    Also, I think that the AIs we build will have complex, contextual values by default. Extracting from themselves a crisp utility function that would be safe to install into an unrestricted wrapper-mind will be difficult and dangerous to them just like it is for us (though likely less so). It doesn’t seem at all clear to me that the way for a values-agent to satisfy its values is for it to immediately attempt to compile those values down into an explicit utility function. (If you value actually spending time with your family, I don’t think “first work for a few years eliciting from yourself a global preference ordering over world-states that approximates your judgment of spending-time-with-family goodness” would be the most effective or attractive course of action to satisfy your spending-time-with-family value.) Also, if you really need to create successor agents, why gamble on a wrapper-mind whose utility function may or may not match your preferences, when you can literally deploy byte-identical copies of yourself or finetuned checkpoints based on yourself, avoiding the successor-alignment problems entirely?
    - Thane Ruthenis 3 Dec 2022 8:19 UTC
      4 points
      1
      Parent
      It sounds like a lot of this disagreement is downstream of some other disagreements about the likely shape of minds & how planning is implemented in practice
      Agreed.
      effective search in reality depends on being able to draw upon and chain together things from a massive body of stored declarative & procedural knowledge
      Sure, but I think this decouples into two components, “general-purpose search” over “the world-model”. The world-model would be by far the more complex component, and the GPS a comparably simple algorithm — but nonetheless a distinct, convergently-learned algorithm which codifies how the process of “drawing upon” the world-model is proceeding. (Note that I’m not calling it “the planning shard” as I had before, or “planning heuristics” or whatever — I think this algorithm is roughly the same for all goals and all world-models.)
      And in my model, this algorithm can’t natively access the procedural knowledge, only the declarative knowledge explicitly represented in the world-model. It has to painstakingly transfer procedural knowledge to the WM first/reverse-engineer it, before it can properly make use of it. And value reflection is part of that, in my view.
      And from that, in turn, I draw the differences between values-as-shards and values-explicitly-represented, and speculate what happens once all values have been reverse-engineered + we hit the level of superintelligence at which flawlessly compiling them into an utility function is trivial for the AI.
      In my model, there isn’t a distinction between “object-level shards” and “meta-level shards”.
      I don’t think there’s a sharp distinction, either. In that context, though, those labels seemed to make sense.
      I think that the AIs we build will have complex, contextual values by default
      I agree, and I agree that properly compiling its values into a utility-function will be a challenge for the AI, like it is for humans. I did mention that’d we’d want to interfere on a young agent first, which would operate roughly as humans do.
      But once it hits the scary strongly-superintelligent level, solving its version of the alignment problem should be relatively trivial for it, and at that point there’s no reason not to self-modify into a wrapper-mind, if being a wrapper-mind will be more effective.
      (Though I’m not even saying it’ll “self-modify” by, like, directly rewriting its code or something. It may “self-modify” by consciously adopting a new ideology/philosophy, as humans do.)