Charlie Steiner comments on Protocol evaluations: good analogies vs control

Charlie Steiner 21 Feb 2024 15:04 UTC
LW: 2 AF: 2
0
AF
Non-deceptive failures are easy to notice, but they’re not necessarily easy to eliminate—and if you don’t eliminate them, they’ll keep happening until some do slip through. I think I take them more seriously than you.
- ryan_greenblatt 21 Feb 2024 17:26 UTC
  LW: 4 AF: 4
  0
  AF Parent
  
  Non-deceptive failures are easy to notice, but they’re not necessarily easy to eliminate
  
  I agree, I was trying to note this in my second paragraph, but I guess this was insufficiently clear.
  
  I added the sentence “Being easy-to-study doesn’t imply easy-to-solve”.
  
  I think I take them more seriously than you.
  
  Seems too hard to tell based on this limited context. I think non-scheming failures are about 50% of the risk and probably should be about 50% of the effort of the AI safety-from-misalignment community. (I can see some arguments for scheming/deceptive alignment being more important toi work on in advance, but it also might be that non-scheming is more tractible and a higher fraction of risk in short timelines, so IDK overall.)