tailcalled comments on Humans provide an untapped wealth of evidence about alignment

tailcalled 14 Jul 2022 21:09 UTC
2 points
0
Humans do display many many alignment properties, and unlocking that mechanistic understanding is 1,000x more informative than other methods. Though this may not be worth arguing until you read the actual posts showing the mechanistic understandings (the genome post and future ones), and we could argue about specifics then?
If you’re convinced by them, then you’ll understand the reaction of “Fuck, we’ve been wasting so much time and studying humans makes so much sense” which is described in this post (e.g. Turntrout’s idea on corrigibility and statement “I wrote this post as someone who previously needed to read it.”). I’m stating here that me arguing “you should feel this way now before being convinced of specific mechanistic understandings” doesn’t make sense when stated this way.
That makes sense. I mean if you’ve found some good results that others have missed, then it may be very worthwhile. I’m just not sure what they look like.
Link? I don’t think we know how to use model-based agents to e.g. tile the world in diamonds even given unlimited compute, but I’m open to being wrong.
I’m not aware of any place where it’s written up; I’ve considered writing it up myself, because it seems like an important and underrated point. But basically the idea is if you’ve got an accurate model of the system and a value function that is a function of the latent state of that model, then you can pick a policy that you expect to increase the true latent value (optimization), rather than picking a policy that increases its expected latent value of its observations (wireheading). Such a policy would not be interested in interfering with its own sense-data, because that would interfere with its ability to optimize the real world.
I don’t think we know how to write an accurate model of the universe with a function computing diamonds even given infinite compute, so I don’t think it can be used for solving the diamond-tiling problem.
- niplav 20 Mar 2023 13:55 UTC
  2 points
  0
  Parent
  
  Link? I don’t think we know how to use model-based agents to e.g. tile the world in diamonds even given unlimited compute, but I’m open to being wrong.
  
  I’m not aware of any place where it’s written up; I’ve considered writing it up myself, because it seems like an important and underrated point. But basically the idea is if you’ve got an accurate model of the system and a value function that is a function of the latent state of that model, then you can pick a policy that you expect to increase the true latent value (optimization), rather than picking a policy that increases its expected latent value of its observations (wireheading). Such a policy would not be interested in interfering with its own sense-data, because that would interfere with its ability to optimize the real world.
  
  The place where I encountered this idea was Learning What to Value (Daniel Dewey, 2010).
- Algon 14 Jul 2022 21:47 UTC
  1 point
  0
  Parent
  “Reward Tampering Problems and Solutions in Reinforcement Learning” describes how to do what you outlined.