Thane Ruthenis comments on Alignment allows “nonrobust” decision-influences and doesn’t require robust grading

Thane Ruthenis Dec 3, 2022, 8:19 AM
4 points
1
It sounds like a lot of this disagreement is downstream of some other disagreements about the likely shape of minds & how planning is implemented in practice
Agreed.
effective search in reality depends on being able to draw upon and chain together things from a massive body of stored declarative & procedural knowledge
Sure, but I think this decouples into two components, “general-purpose search” over “the world-model”. The world-model would be by far the more complex component, and the GPS a comparably simple algorithm — but nonetheless a distinct, convergently-learned algorithm which codifies how the process of “drawing upon” the world-model is proceeding. (Note that I’m not calling it “the planning shard” as I had before, or “planning heuristics” or whatever — I think this algorithm is roughly the same for all goals and all world-models.)
And in my model, this algorithm can’t natively access the procedural knowledge, only the declarative knowledge explicitly represented in the world-model. It has to painstakingly transfer procedural knowledge to the WM first/reverse-engineer it, before it can properly make use of it. And value reflection is part of that, in my view.
And from that, in turn, I draw the differences between values-as-shards and values-explicitly-represented, and speculate what happens once all values have been reverse-engineered + we hit the level of superintelligence at which flawlessly compiling them into an utility function is trivial for the AI.
In my model, there isn’t a distinction between “object-level shards” and “meta-level shards”.
I don’t think there’s a sharp distinction, either. In that context, though, those labels seemed to make sense.
I think that the AIs we build will have complex, contextual values by default
I agree, and I agree that properly compiling its values into a utility-function will be a challenge for the AI, like it is for humans. I did mention that’d we’d want to interfere on a young agent first, which would operate roughly as humans do.
But once it hits the scary strongly-superintelligent level, solving its version of the alignment problem should be relatively trivial for it, and at that point there’s no reason not to self-modify into a wrapper-mind, if being a wrapper-mind will be more effective.
(Though I’m not even saying it’ll “self-modify” by, like, directly rewriting its code or something. It may “self-modify” by consciously adopting a new ideology/philosophy, as humans do.)