https://arxiv.org/abs/1712.05812
It’s directly about inverse reinforcement learning, but that should be strictly stronger than RLHF. Seems incumbent on those who disagree to explain why throwing away information here would be enough of a normative assumption (contrary to every story about wishes.)
I don’t see how any of it can be right. Getting one algorithm to output Spongebob wouldn’t cause the SI to watch Spongebob -even a less silly claim in that vein would still be false. The Platonic agent would know the plan wouldn’t work, and thus wouldn’t do it.
Since no individual Platonic agent could do anything meaningful alone, and they plainly can’t communicate with each other, they can only coordinate by means of reflective decision theory. That’s fine, we’ll just assume that’s the obvious way for intelligent minds to behave. But then the SI works the same way, and knows the Platonic agents will think that way, and per RDT it refuses to change its behavior based on attempts to game the system. So none of this ever happens in the first place.
(This is without even considering the serious problems with assuming Platonic agents would share a goal to coordinate on. I don’t think I buy it. You can’t evolve a desire to come into existence, nor does an arbitrary goal seem to require it. Let me assure you, there can exist intelligent minds which don’t want worlds like ours to exist.)