the simplistic view that IRL agents hold about ground truth in human values (ie. the human behavior they’re observing is always perfectly displaying the values)
IRL typically involves an error model—a model of how humans make errors. If you’ve ever seen the phrase “Boltzmann-rational” in an IRL paper, it’s the assumption that humans most often do the best thing but can sometimes do arbitrarily bad things (just with an exponentially decreasing probability).
This is still simplistic, but it’s simplistic on a higher level :P
If you haven’t read Reducing Goodhart, it’s pretty related to the topic of this post.
Ultimately I’m not satisfied with any proposals we have so far. There’s sort of a philosophy versus engineering culture difference, where in philosophy we’d want to hoard all of these unsatisfying proposals, and occasionally take them out of their drawer and look at them again with fresh eyes, while in engineering the intuition would be that the effort is better spent looking for ways to make progress towards new and different ideas
I think there’s a divide here between implementing ethics, and implementing meta-ethics. E.g. trying to give rules for how to weight your past and future selves, vs. trying to give rules for what good rules are. When in doubt, shift gears towards implementing metaethics: it’s cheaper because we don’t have the time to write down a complete ethics for an AI to follow, it’s necessary because we can’t write down a complete ethics for an AI to follow, and it’s unavoidable because AIs in the real world will naturally do meta-ethics.
To expand on that last point—a sufficiently clever AI operating in the real world will notice that it itself is part of the real world. Actions like modifying itself are on the table, and have meta-ethical implications. This simultaneously makes it hard to prove convergence for any real-world system, while also making it seem likely that all sufficiently clever AIs in the real world will converge to a state that’s stable under consideration of self-modifying actions.
I appreciate you making the point about Boltzmann rationality. Indeed, I think this is where my lack of familiarity in actually implementing IRL systems begins to show. Would it be fair to claim that, even with a model taking into account the fact that humans aren’t perfect, it still assumes that there is an ultimate human reward function? So then the error model would just be seen as another tool to help the system get at this reward function. The system assumes that humans don’t always act according to this function all the time, but that there is still some “ultimate” one (that they occasionally diverge from).
I think I buy the philosophy versus engineering approach you mention. During this project I felt a bit like the philosopher, going through possible hypotheses and trying to logically evaluate their content with respect to the goals of alignment. I think at some point you definitely need to build upon the ones you find promising, and indeed that’s the next step for me. I suppose both mindsets are necessary—you need to build upon initial proposals, but seemingly not just any proposals, rather ones for which you assign a higher probability for getting to your goal.
I agree with the point about metaethics, and I think we could even go beyond this. In giving rules for what good rules are, or how to weigh past and future selves, and so on, one is also operating by certain principles of reasoning here that could use justification. Then these rules could use justification, to seemingly infinite regress. Intuitively, for practical purposes, we do want to stop at some level, and I view the equilibrium point of Demski’s proposal as essentially the “level” of meta-meta...-meta ethics at which further justification makes no meaningful difference. Demski talks about this more in his “Normativity” post in the Learning Normativity sequence.
Nice post.
IRL typically involves an error model—a model of how humans make errors. If you’ve ever seen the phrase “Boltzmann-rational” in an IRL paper, it’s the assumption that humans most often do the best thing but can sometimes do arbitrarily bad things (just with an exponentially decreasing probability).
This is still simplistic, but it’s simplistic on a higher level :P
If you haven’t read Reducing Goodhart, it’s pretty related to the topic of this post.
Ultimately I’m not satisfied with any proposals we have so far. There’s sort of a philosophy versus engineering culture difference, where in philosophy we’d want to hoard all of these unsatisfying proposals, and occasionally take them out of their drawer and look at them again with fresh eyes, while in engineering the intuition would be that the effort is better spent looking for ways to make progress towards new and different ideas
I think there’s a divide here between implementing ethics, and implementing meta-ethics. E.g. trying to give rules for how to weight your past and future selves, vs. trying to give rules for what good rules are. When in doubt, shift gears towards implementing metaethics: it’s cheaper because we don’t have the time to write down a complete ethics for an AI to follow, it’s necessary because we can’t write down a complete ethics for an AI to follow, and it’s unavoidable because AIs in the real world will naturally do meta-ethics.
To expand on that last point—a sufficiently clever AI operating in the real world will notice that it itself is part of the real world. Actions like modifying itself are on the table, and have meta-ethical implications. This simultaneously makes it hard to prove convergence for any real-world system, while also making it seem likely that all sufficiently clever AIs in the real world will converge to a state that’s stable under consideration of self-modifying actions.
Thanks for your comment, Charlie! A few things:
I appreciate you making the point about Boltzmann rationality. Indeed, I think this is where my lack of familiarity in actually implementing IRL systems begins to show. Would it be fair to claim that, even with a model taking into account the fact that humans aren’t perfect, it still assumes that there is an ultimate human reward function? So then the error model would just be seen as another tool to help the system get at this reward function. The system assumes that humans don’t always act according to this function all the time, but that there is still some “ultimate” one (that they occasionally diverge from).
I think I buy the philosophy versus engineering approach you mention. During this project I felt a bit like the philosopher, going through possible hypotheses and trying to logically evaluate their content with respect to the goals of alignment. I think at some point you definitely need to build upon the ones you find promising, and indeed that’s the next step for me. I suppose both mindsets are necessary—you need to build upon initial proposals, but seemingly not just any proposals, rather ones for which you assign a higher probability for getting to your goal.
I agree with the point about metaethics, and I think we could even go beyond this. In giving rules for what good rules are, or how to weigh past and future selves, and so on, one is also operating by certain principles of reasoning here that could use justification. Then these rules could use justification, to seemingly infinite regress. Intuitively, for practical purposes, we do want to stop at some level, and I view the equilibrium point of Demski’s proposal as essentially the “level” of meta-meta...-meta ethics at which further justification makes no meaningful difference. Demski talks about this more in his “Normativity” post in the Learning Normativity sequence.