Donald Hobson comments on Metaphilosophical competence can’t be disentangled from alignment

Donald Hobson 29 Nov 2019 13:07 UTC
4 points
I think its hard to distinguish a lack of metaphilosophical sophistication from having different values. The (hypothetical) angsty teen says that they want to kill everyone. If they had the power to, they would. How do we tell whether they are mistaken about their utility function, or just have killing everyone as their utility function. If they clearly state some utility function that is dependant on some real world parameter, and they are mistaken about that parameter, then we could know. Ie they want to kill everyone if and only if the moon is made of green cheese. They are confident that the moon is made of green cheese, so don’t even bother checking before killing everyone.
Alternately we could look at if they could be persuaded not to kill everyone, but some people could be persuaded of all sorts of things. The fact that you could be persuaded to do X says more about the persuasive ability of the persuader, and the vulnerabilities of your brain than whether you wanted X.
Alternatively we could look at whether they will regret it later. If I self modify into a paperclip maximiser, I won’t regret it, because that action maximised paperclips. However a hypothetical self who hadn’t been modified would regret it.
Suppose there are some nanobots in my brain that will slowly rewire me into a paperclip maximiser. I decide to remove them. The real me doesn’t regret this decision, the hypothetical me who wasn’t modified does. Suppose there is part of my brain that will make me power hungry and self centered once I become sufficiently powerful. I remove it. Which case is this? Am I damaging my alignment or preventing it from being damaged?
We don’t understand the concept of a philosophical mistake well enough to say if someone is making one. It seems likely that, to the extent that humans have a utility function, some humans have utility functions that want to kill most humans.
who almost certainly care about the future well-being of humanity.
Is mistaken. I think that a relatively small proportion of humans care about the future well being of humanity in any way similar to what the words mean to a mordern rationalist.
To a rationalist, “future wellbeing of humanity” might mean a superintelligent AI filling the universe with simulated human minds.
To a random modern first world person, it might mean a fairly utopian “sustainable” future, full of renewable energy, electric cars ect.
To a North Sentinal Islander, they might have little idea that any humans beyond their tribe exist, and might hope for several years of good weather and rich harvests.
To a 10th century monk, they might hope that judgement day comes soon, and that all the righteous souls go to heaven.
To a barbarian warlord, they might hope that their tribe conquers many other tribes.
The only sensible definition of “care about the future of humanity” that covers all these cases is that their utility function has some term relating to things happening to some humans. Their terminal values reference some humans in some way. As opposed to a paperclip maximiser that sees humans as entirely instrumental.