A dangerous intuition pump here would be something like, “If you take a human who was trained really hard in childhood to have faith in God and show epistemic deference to the Bible, and inspecting the internal contents of their thought at age 20 showed that they still had great faith, if you kept amping up that human’s intelligence their epistemology would at some point explode”
Yes, a value grounded in a factual error will get blown up by better epistemics, just as “be uncertain about the human’s goals” will get blown up by your beliefs getting their entropy deflated to zero by the good ole process we call “learning about reality.” But insofar as corrigibility is “chill out and just do some good stuff without contorting 4D spacetime into the perfect shape or whatever”, there are versions of that which don’t automatically get blown up by reality when you get smarter. As far as I can tell, some humans are living embodiments of the latter. I have some “benevolent libertarian” values pushing me Pareto improving everyone’s resource counts and letting them do as they will with their compute budgets. What’s supposed to blow that one up?
that in real life if we were faced with a very bizarre alien we would be unlikely to want to defer to it. Our lack of scalable desire to defer in all ways to an extremely bizarre alien that ate babies, is not something that you could fix just by giving us an emotion of great deference or respect toward that very bizarre alien. We would have our own thought processes that were unlike its thought processes, and if we scaled up our intelligence and reflection to further see the consequences implied by our own thought processes, they wouldn’t imply deference to the alien even if we had great respect toward it and had been trained hard in childhood to act corrigibly towards it.
This paragraph as a whole seems to make a lot of unsupported-to-me claims and seemingly equivocates between the two bolded claims, which are quite different. The first is that we (as adult humans with relatively well-entrenched values) would not want to defer to a strange alien. I agree.
The second is that we wouldn’t want to defer “even if we had great respect toward it and had been trained hard in childhood to act corrigibly towards it.” I don’t see why you believe that. Perhaps if we were otherwise socialized normally, we would end up unendorsing that value and not deferring? But I conjecture if that a person weren’t raised with normal cultural influences, you could probably brainwash them into being aligned baby-eaters via reward shaping via brain stimulation reward.
Acting corrigibly towards a baby-eating virtue ethicist when you are a utilitarian is an equally weird shape for a decision theory.
Yes, a value grounded in a factual error will get blown up by better epistemics, just as “be uncertain about the human’s goals” will get blown up by your beliefs getting their entropy deflated to zero by the good ole process we call “learning about reality.” But insofar as corrigibility is “chill out and just do some good stuff without contorting 4D spacetime into the perfect shape or whatever”, there are versions of that which don’t automatically get blown up by reality when you get smarter. As far as I can tell, some humans are living embodiments of the latter. I have some “benevolent libertarian” values pushing me Pareto improving everyone’s resource counts and letting them do as they will with their compute budgets. What’s supposed to blow that one up?
This paragraph as a whole seems to make a lot of unsupported-to-me claims and seemingly equivocates between the two bolded claims, which are quite different. The first is that we (as adult humans with relatively well-entrenched values) would not want to defer to a strange alien. I agree.
The second is that we wouldn’t want to defer “even if we had great respect toward it and had been trained hard in childhood to act corrigibly towards it.” I don’t see why you believe that. Perhaps if we were otherwise socialized normally, we would end up unendorsing that value and not deferring? But I conjecture if that a person weren’t raised with normal cultural influences, you could probably brainwash them into being aligned baby-eaters via reward shaping via brain stimulation reward.
A utilitarian? Like, as Thomas Kwa asked, what are the type signatures of the utility functions you’re imagining the AI to have? Your comment makes more sense to me if I imagine the utility function is computed over “conventional” objects-of-value.