cfoster0 comments on Alignment allows “nonrobust” decision-influences and doesn’t require robust grading

cfoster0 3 Dec 2022 7:37 UTC
3 points
0
Interesting. It sounds like a lot of this disagreement is downstream of some other disagreements about the likely shape of minds & how planning is implemented in practice. I don’t think that coherence theorems tell us anything about what the implementation details of agent decision-making should be (optimizing an explicitly represented objective), just about what properties its decisions should satisfy. I think my model says that deliberative cognition like planning is just chaining together steps of nondeliberative cognition with working-memory. In my model, there isn’t some compact algorithm corresponding to “general purpose search”; effective search in reality depends on being able to draw upon and chain together things from a massive body of stored declarative & procedural knowledge. In my model, there isn’t a distinction between “object-level shards” and “meta-level shards”. It’s a big ol’ net of connections, where each neuron has a bunch of incoming context inputs, and that neuron doesn’t know or care what its inputs mean, it just learns to fire “at the right time”, relative to its neighbors’ firing patterns. It’s not like there’s a set of “planning heuristic” shards that can meaningfully detach themselves from the “social interaction heuristic” shards or the “aesthetic preference heuristic” shards. They all drive one another.

Also, I think that the AIs we build will have complex, contextual values by default. Extracting from themselves a crisp utility function that would be safe to install into an unrestricted wrapper-mind will be difficult and dangerous to them just like it is for us (though likely less so). It doesn’t seem at all clear to me that the way for a values-agent to satisfy its values is for it to immediately attempt to compile those values down into an explicit utility function. (If you value actually spending time with your family, I don’t think “first work for a few years eliciting from yourself a global preference ordering over world-states that approximates your judgment of spending-time-with-family goodness” would be the most effective or attractive course of action to satisfy your spending-time-with-family value.) Also, if you really need to create successor agents, why gamble on a wrapper-mind whose utility function may or may not match your preferences, when you can literally deploy byte-identical copies of yourself or finetuned checkpoints based on yourself, avoiding the successor-alignment problems entirely?
- Thane Ruthenis 3 Dec 2022 8:19 UTC
  4 points
  1
  Parent
  It sounds like a lot of this disagreement is downstream of some other disagreements about the likely shape of minds & how planning is implemented in practice
  Agreed.
  effective search in reality depends on being able to draw upon and chain together things from a massive body of stored declarative & procedural knowledge
  Sure, but I think this decouples into two components, “general-purpose search” over “the world-model”. The world-model would be by far the more complex component, and the GPS a comparably simple algorithm — but nonetheless a distinct, convergently-learned algorithm which codifies how the process of “drawing upon” the world-model is proceeding. (Note that I’m not calling it “the planning shard” as I had before, or “planning heuristics” or whatever — I think this algorithm is roughly the same for all goals and all world-models.)
  And in my model, this algorithm can’t natively access the procedural knowledge, only the declarative knowledge explicitly represented in the world-model. It has to painstakingly transfer procedural knowledge to the WM first/reverse-engineer it, before it can properly make use of it. And value reflection is part of that, in my view.
  And from that, in turn, I draw the differences between values-as-shards and values-explicitly-represented, and speculate what happens once all values have been reverse-engineered + we hit the level of superintelligence at which flawlessly compiling them into an utility function is trivial for the AI.
  In my model, there isn’t a distinction between “object-level shards” and “meta-level shards”.
  I don’t think there’s a sharp distinction, either. In that context, though, those labels seemed to make sense.
  I think that the AIs we build will have complex, contextual values by default
  I agree, and I agree that properly compiling its values into a utility-function will be a challenge for the AI, like it is for humans. I did mention that’d we’d want to interfere on a young agent first, which would operate roughly as humans do.
  But once it hits the scary strongly-superintelligent level, solving its version of the alignment problem should be relatively trivial for it, and at that point there’s no reason not to self-modify into a wrapper-mind, if being a wrapper-mind will be more effective.
  (Though I’m not even saying it’ll “self-modify” by, like, directly rewriting its code or something. It may “self-modify” by consciously adopting a new ideology/philosophy, as humans do.)