niplav comments on Humans provide an untapped wealth of evidence about alignment

niplav 20 Mar 2023 13:55 UTC
2 points
0

Link? I don’t think we know how to use model-based agents to e.g. tile the world in diamonds even given unlimited compute, but I’m open to being wrong.

I’m not aware of any place where it’s written up; I’ve considered writing it up myself, because it seems like an important and underrated point. But basically the idea is if you’ve got an accurate model of the system and a value function that is a function of the latent state of that model, then you can pick a policy that you expect to increase the true latent value (optimization), rather than picking a policy that increases its expected latent value of its observations (wireheading). Such a policy would not be interested in interfering with its own sense-data, because that would interfere with its ability to optimize the real world.

The place where I encountered this idea was Learning What to Value (Daniel Dewey, 2010).