Hmm, I guess I try to say “long-term-consequentialist” for long-term consequences. I might have left out the “long-term” part by accident, or if I thought it was clear from context… (And also to make a snappier post title.)
Fair enough.
I do think there’s a meaningful notion of, let’s call it, “stereotypically consequentialist behavior” for both humans and AIs, and long-term consequentialists tend to match it really well, and short-term-consequentialists tend to match it less well.
I agree. I think TurnTrout’s approach is a plausible strategy for formalizing it. If we apply his approach to the long-term vs short-term distinction, then we can observe that the vast majority of trajectory rankings are long-term consequentialist, and therefore most permutations mostly permute with long-term consequentialists; therefore the power-seeking arguments don’t go through with short-term consequentialists.
I think the nature of the failure of the power-seeking arguments for short-term consequentialists is ultimately different from the nature of the failure for non-consequentialists, though; for short-term consequentialists, it happens as a result of dropping the features that power helps you control, while for non-consequentialists, it happens as a result of valuing additional features than the ones you can control with power.
Have you written or read anything about how that might work? My theory is: (1) the world is complicated, (2) the AI needs to learn a giant vocabulary of abstract patterns (latent variables) in order to understand or do or want anything of significance in the world, (3) therefore it’s tricky to just write down an explicit utility function. The “My corrigibility proposal sketch” gets around that by something like supervised-learning a way to express the utility function’s ingredients (e.g. “the humans will remain in control”, “I am being helpful”) in terms of these unlabeled latent variables in the world-model. That in turn requires labeled training data and OOD detection and various other details that seem hard to get exactly right, but are nevertheless our best bet. BTW that stuff is not in the “My AGI threat model” post, I grew more fond of them a few months afterwards. :)
Ah, I think we are in agreement then. I would also agree with using something like supervised learning to get the ingredients of the utility function. (Though I don’t yet know that the ingredients would directly be the sorts of things you mention, or more like “These are the objects in the world” + “Is each object a strawberry?” + etc..)
(I would also want to structurally force the world model to be more interpretable; e.g. one could require it to reason in terms of objects living in 3D space.)
Fair enough.
I agree. I think TurnTrout’s approach is a plausible strategy for formalizing it. If we apply his approach to the long-term vs short-term distinction, then we can observe that the vast majority of trajectory rankings are long-term consequentialist, and therefore most permutations mostly permute with long-term consequentialists; therefore the power-seeking arguments don’t go through with short-term consequentialists.
I think the nature of the failure of the power-seeking arguments for short-term consequentialists is ultimately different from the nature of the failure for non-consequentialists, though; for short-term consequentialists, it happens as a result of dropping the features that power helps you control, while for non-consequentialists, it happens as a result of valuing additional features than the ones you can control with power.
Ah, I think we are in agreement then. I would also agree with using something like supervised learning to get the ingredients of the utility function. (Though I don’t yet know that the ingredients would directly be the sorts of things you mention, or more like “These are the objects in the world” + “Is each object a strawberry?” + etc..)
(I would also want to structurally force the world model to be more interpretable; e.g. one could require it to reason in terms of objects living in 3D space.)