adamShimi comments on Don’t align agents to evaluations of plans

adamShimi 28 Nov 2022 17:27 UTC
LW: 4 AF: 4
0
AF
The way you write this (especially the last sentence) makes me think that you see this attempt as being close to the only one that makes sense to you atm. Which makes me curious:
- Do you think that you are internally trying to approximate your own $U_{i d e a l}$ ?
- Do you think that you have ever made the decision (either implicitly or explicitly) to not eval all or most plans because you don’t trust your ability to do so for adversarial examples (as opposed to tractability issues for example)?
- Can you think of concrete instances where you improved your own Eval?
- Can you think of concrete instances where you thought you improved you own Eval but then regretted it later?
- Do you think that your own changes to your eval have been moving in the direction of your $U_{i d e a l}$ ?