Thane Ruthenis comments on Alignment allows “nonrobust” decision-influences and doesn’t require robust grading

Thane Ruthenis 3 Dec 2022 3:20 UTC
4 points
1
I think what is happening is that the shard has a range of upstream inputs, and that the brain does something like TD learning on its thoughts to strengthen & broaden the connections responsible for positive value errors
Interesting, I’ll look into TD learning in more depth later. Anecdotally, though, this doesn’t seem to be quite right. I model shards as consciously-felt urges, and it sure seems to me that I can work towards anticipating and satisfying-in-advance these urges without actually feeling them.
To quote the post you linked:
For example, imagine you’re feeling terribly nauseous. Of course your Steering Subsystem knows that you’re feeling terribly nauseous. And then suppose it sees you thinking a thought that seems to be leading towards eating. In that case, the Steering Subsystem may say: “That’s a terrible thought! Negative reward!”
OK, so you’re feeling nauseous, and you pick up the phone to place your order at the bakery. This thought gets weakly but noticeably flagged by the Thought Assessors as “likely to lead to eating”. Your Steering Subsystem sees that and says “Boo, given my current nausea, that seems like a bad thought.” It will feel a bit aversive. “Yuck, I’m really ordering this huge cake??” you say to yourself.
Logically, you know that come next week, when you actually receive the cake, you won’t feel nauseous anymore, and you’ll be delighted to have the cake. But still, right now, you feel kinda gross and unmotivated to order it.
Do you order the cake anyway? Sure! Maybe the value function (a.k.a. the “will lead to reward” Thought Assessor) is strong enough to overrule the effects of the “will lead to eating” Thought Assessor. Or maybe you call up a different motivation: you imagine yourself as the kind of person who has good foresight and makes good sensible decisions, and who isn’t stuck in the moment. That’s a different thought in your head, which consequently activates a different set of Thought Assessors, and maybe that gets high value from the Steering Subsystem. Either way, you do in fact call the bakery to place the cake order for next week, despite feeling nauseous right now. What a heroic act!
Emphasis mine. So there’s some meta-ish shard or group of shards that bid on plans based on the agent’s model of its shards’ future activations, without the object-level shards under consideration actually needing to activate. What I’m suggesting is that in sufficiently mature agents, there’s some meta-ish shard or system like this, which is increasingly responsible for all planning taking place.
Aside from the above objection to thinking of a distinct “planner” entity, I don’t get why it would form that conclusion in the situation you’re describing here. The agent has observed “When I’m in X contexts, I feel an internal tug towards/against Y and I think about how I’m working hard”. (Like “When I’m at school, I feel an internal tug towards staying quiet and I think about how I’m working hard.”) What can/will it conclude from that observation?
Good catch: I’m not entirely sure of the mechanism involved here. How specifically the meta-ish “do what my shards want me to do” system is implemented, and why it appears. I offer some potential reasons here (Section 6′s first part, before 6A), but I’m not sure it’s necessarily anything more complicated than coherent decisions = coherent utilities.
My thoughts on wrapper-minds run along similar lines to nostalgebraist’s.
Mm, those are arguments that wrapper-minds are a bad tool to solve a problem according to some entity external to the wrapper-mind. Not according to the wrapper-mind’s hard-coded objective function itself. And the reason why is because the wrapper-mind will tear apart the thing the external entity cares about in its powerful pursuit of the thing it’s technically pointed at, if the wrapper-mind is even slightly misaimed. Which… is actually an argument for wrapper-minds’ power, not against?
And if it’s an argument that the SGD/evolution/reward circuitry won’t create wrapper-minds— I expect that ~all greedy optimization algorithms are screwed over by deceptive alignment there. Basically, they go:
1. Build a system that implicitly pursues the mesa-objective implicit in the distribution of the contextual activations of its shards (the standard Shard Theory view).
2. Get the system to start building a coherent model of the mesa-objective and pursue it coherently. (What I’m arguing will happen.)
3. Gradually empower the system doing (2), because (while that system is still imperfect and tugged around by contextual shard activations, i. e. not a pure wrapper-mind) it’s just that good at delivering results.
4. At some point the coherent-mesa-objective subsystem gets so powerful it realizes this dynamic, realizes its coherent-mesa-objective isn’t what the outer optimizer wants of it, and plays along/manipulates it until it’s powerful enough to break out.
So — yes, pure wrapper-minds are a pretty bad tool for any given job. That’s the whole AI Alignment problem! But they’re not bad because they’re ineffective, they’re bad because they’re so effective they need to be aimed with impossible precision. And while we, generally intelligent reasoners, can realize it and be sensibly wary of them, stupid greedy algorithms like the SGD incrementally hand them more and more power until getting screwed over.
And a superintelligence, on the other hand, would just be powerful enough to specify the coherent-mesa-objective for itself with such precision as to safely harness the power of the wrapper-mind — solve the alignment problem.
In my picture there is no separate “planner” component (in brains or in advanced neural network systems) that interacts with the shards to generalize their behavior
I’m assuming fairly strongly that it is separate, and that “the planner” is roughly as this post outlines it. Dehaene’s model of consciousness seems to point in the same direction, as does the Orthogonality Thesis. Generally, it seems that goals and the general-purpose planning algorithm have little to do with each other and can be fully decoupled. Also, informally, it doesn’t seem to me that my shards are legible to me the same way my models of them or my past thoughts are.
And if you take this view as a given, the rest of my model seems to be the natural result.
But it’s not a central disagreement, here, I think.
- cfoster0 3 Dec 2022 7:37 UTC
  3 points
  0
  Parent
  Interesting. It sounds like a lot of this disagreement is downstream of some other disagreements about the likely shape of minds & how planning is implemented in practice. I don’t think that coherence theorems tell us anything about what the implementation details of agent decision-making should be (optimizing an explicitly represented objective), just about what properties its decisions should satisfy. I think my model says that deliberative cognition like planning is just chaining together steps of nondeliberative cognition with working-memory. In my model, there isn’t some compact algorithm corresponding to “general purpose search”; effective search in reality depends on being able to draw upon and chain together things from a massive body of stored declarative & procedural knowledge. In my model, there isn’t a distinction between “object-level shards” and “meta-level shards”. It’s a big ol’ net of connections, where each neuron has a bunch of incoming context inputs, and that neuron doesn’t know or care what its inputs mean, it just learns to fire “at the right time”, relative to its neighbors’ firing patterns. It’s not like there’s a set of “planning heuristic” shards that can meaningfully detach themselves from the “social interaction heuristic” shards or the “aesthetic preference heuristic” shards. They all drive one another.
  
  Also, I think that the AIs we build will have complex, contextual values by default. Extracting from themselves a crisp utility function that would be safe to install into an unrestricted wrapper-mind will be difficult and dangerous to them just like it is for us (though likely less so). It doesn’t seem at all clear to me that the way for a values-agent to satisfy its values is for it to immediately attempt to compile those values down into an explicit utility function. (If you value actually spending time with your family, I don’t think “first work for a few years eliciting from yourself a global preference ordering over world-states that approximates your judgment of spending-time-with-family goodness” would be the most effective or attractive course of action to satisfy your spending-time-with-family value.) Also, if you really need to create successor agents, why gamble on a wrapper-mind whose utility function may or may not match your preferences, when you can literally deploy byte-identical copies of yourself or finetuned checkpoints based on yourself, avoiding the successor-alignment problems entirely?
  - Thane Ruthenis 3 Dec 2022 8:19 UTC
    4 points
    1
    Parent
    It sounds like a lot of this disagreement is downstream of some other disagreements about the likely shape of minds & how planning is implemented in practice
    Agreed.
    effective search in reality depends on being able to draw upon and chain together things from a massive body of stored declarative & procedural knowledge
    Sure, but I think this decouples into two components, “general-purpose search” over “the world-model”. The world-model would be by far the more complex component, and the GPS a comparably simple algorithm — but nonetheless a distinct, convergently-learned algorithm which codifies how the process of “drawing upon” the world-model is proceeding. (Note that I’m not calling it “the planning shard” as I had before, or “planning heuristics” or whatever — I think this algorithm is roughly the same for all goals and all world-models.)
    And in my model, this algorithm can’t natively access the procedural knowledge, only the declarative knowledge explicitly represented in the world-model. It has to painstakingly transfer procedural knowledge to the WM first/reverse-engineer it, before it can properly make use of it. And value reflection is part of that, in my view.
    And from that, in turn, I draw the differences between values-as-shards and values-explicitly-represented, and speculate what happens once all values have been reverse-engineered + we hit the level of superintelligence at which flawlessly compiling them into an utility function is trivial for the AI.
    In my model, there isn’t a distinction between “object-level shards” and “meta-level shards”.
    I don’t think there’s a sharp distinction, either. In that context, though, those labels seemed to make sense.
    I think that the AIs we build will have complex, contextual values by default
    I agree, and I agree that properly compiling its values into a utility-function will be a challenge for the AI, like it is for humans. I did mention that’d we’d want to interfere on a young agent first, which would operate roughly as humans do.
    But once it hits the scary strongly-superintelligent level, solving its version of the alignment problem should be relatively trivial for it, and at that point there’s no reason not to self-modify into a wrapper-mind, if being a wrapper-mind will be more effective.
    (Though I’m not even saying it’ll “self-modify” by, like, directly rewriting its code or something. It may “self-modify” by consciously adopting a new ideology/philosophy, as humans do.)