Kaj_Sotala comments on The Preference Fulfillment Hypothesis

Kaj_Sotala 6 Mar 2023 13:06 UTC
LW: 3 AF: 2
0
AF
I don’t know how you would describe “true niceness”, but I think it’s neither of the above.
Agreed. I think “true niceness” is something like, act to maximize people’s preferences, while also taking into account the fact that people often have a preference for their preferences to continue evolving and to resolve any of their preferences that are mutually contradictory in a painful way.
Niceness is natural for agents of similar strengths because lots of values point towards the same “nice” behavior. But when you’re much more powerful than anyone else, the target becomes much smaller, right?
Depends on the specifics, I think.
As an intuition pump, imagine the kindest, wisest person that you know. Suppose that that person was somehow boosted into a superintelligence and became the most powerful entity in the world.
Now, it’s certainly possible that for any human, it’s inevitable for evolutionary drives optimized for exploiting power to kick in at that situation and corrupt them… but let’s further suppose that the process of turning them into a superintelligence also somehow removed those, and made the person instead experience a permanent state of love towards everybody.
I think it’s at least plausible that the person would then continue to exhibit “true niceness” towards everyone, despite being that much more powerful than anyone else.
So at least if the agent had started out at a similar power level as everyone else—or if it at least simulates the kinds of agents that did—it might retain that motivation when boosted to higher level of power.
Do you have reasons to expect “slight RL on niceness” to give you “true niceness” as opposed to a kind of pseudo-niceness?
I don’t have a strong reason to expect that it’d happen automatically, but if people are thinking about the best ways to actually make the AI have “true niceness”, then possibly! That’s my hope, at least.
I would be scared of an AI which has been trained to be nice if there was no way to see if, when it got more powerful, it tried to modify people’s preferences / it tried to prevent people’s preferences from changing.
Me too!