Ramana Kumar comments on Alignment allows “nonrobust” decision-influences and doesn’t require robust grading

Ramana Kumar 30 Nov 2022 18:04 UTC
LW: 2 AF: 1
0
AF
I’m focusing on the code in Appendix B.
What happens when self.diamondShard’s assessment of whether some consequences contain diamonds differs from ours? (Assume the agent’s world model is especially good.)
- TurnTrout 2 Dec 2022 2:06 UTC
  LW: 2 AF: 2
  0
  AF Parent
  The same thing which happens if the assessment isn’t different from ours—the agent is more likely to take that plan, all else equal.