I appreciate you making the point about Boltzmann rationality. Indeed, I think this is where my lack of familiarity in actually implementing IRL systems begins to show. Would it be fair to claim that, even with a model taking into account the fact that humans aren’t perfect, it still assumes that there is an ultimate human reward function? So then the error model would just be seen as another tool to help the system get at this reward function. The system assumes that humans don’t always act according to this function all the time, but that there is still some “ultimate” one (that they occasionally diverge from).
I think I buy the philosophy versus engineering approach you mention. During this project I felt a bit like the philosopher, going through possible hypotheses and trying to logically evaluate their content with respect to the goals of alignment. I think at some point you definitely need to build upon the ones you find promising, and indeed that’s the next step for me. I suppose both mindsets are necessary—you need to build upon initial proposals, but seemingly not just any proposals, rather ones for which you assign a higher probability for getting to your goal.
I agree with the point about metaethics, and I think we could even go beyond this. In giving rules for what good rules are, or how to weigh past and future selves, and so on, one is also operating by certain principles of reasoning here that could use justification. Then these rules could use justification, to seemingly infinite regress. Intuitively, for practical purposes, we do want to stop at some level, and I view the equilibrium point of Demski’s proposal as essentially the “level” of meta-meta...-meta ethics at which further justification makes no meaningful difference. Demski talks about this more in his “Normativity” post in the Learning Normativity sequence.
Thanks for your comment, Charlie! A few things:
I appreciate you making the point about Boltzmann rationality. Indeed, I think this is where my lack of familiarity in actually implementing IRL systems begins to show. Would it be fair to claim that, even with a model taking into account the fact that humans aren’t perfect, it still assumes that there is an ultimate human reward function? So then the error model would just be seen as another tool to help the system get at this reward function. The system assumes that humans don’t always act according to this function all the time, but that there is still some “ultimate” one (that they occasionally diverge from).
I think I buy the philosophy versus engineering approach you mention. During this project I felt a bit like the philosopher, going through possible hypotheses and trying to logically evaluate their content with respect to the goals of alignment. I think at some point you definitely need to build upon the ones you find promising, and indeed that’s the next step for me. I suppose both mindsets are necessary—you need to build upon initial proposals, but seemingly not just any proposals, rather ones for which you assign a higher probability for getting to your goal.
I agree with the point about metaethics, and I think we could even go beyond this. In giving rules for what good rules are, or how to weigh past and future selves, and so on, one is also operating by certain principles of reasoning here that could use justification. Then these rules could use justification, to seemingly infinite regress. Intuitively, for practical purposes, we do want to stop at some level, and I view the equilibrium point of Demski’s proposal as essentially the “level” of meta-meta...-meta ethics at which further justification makes no meaningful difference. Demski talks about this more in his “Normativity” post in the Learning Normativity sequence.