Chris_Leong comments on The Theoretical Reward Learning Research Agenda: Introduction and Motivation

Chris_Leong 3 Mar 2025 3:46 UTC
LW: 2 AF: 1
0
AF
I was taking it as “solves” or “gets pretty close to solving”. Maybe that’s a misinterpretation on my part. What did you mean here?
- Joar Skalse 7 Mar 2025 21:17 UTC
  LW: 1 AF: 1
  0
  AF Parent
  No, that is not a misinterpretation: I do think that this research agenda has the potential to get pretty close to solving outer alignment. More specifically, if it is (practically) possible to solve outer alignment through some form of reward learning, then I think this research agenda will establish how that can be done (and prove that this method works), and if it isn’t possible, then I think this research agenda will produce a precise understanding of why that isn’t possible (which would in turn help to inform subsequent research). I don’t think this research agenda is the only way to solve outer alignment, but I think it is the most promising way to do it.