Honestly? I feel like this same set of problems gets re-solved a lot. I’m worried that it’s a sign of ill health for the field.
I think we understand certain technical aspects of corrigibility (indifference and CIRL), but have hit a brick wall in certain other aspects (things that require sophisticated “common sense” about AIs or humans to implement, philosophical problems about how to get an AI to solve philosophical problems). I think this is part of what leads to re-treading old ground when new people (or a person wanting to apply a new tool) try to work on AI safety.
On the other hand, I’m not sure if we’ve exhausted Concrete Problems yet. Yes, the answer is often “just have sophisticated common sense,” but I think the value is in exploring the problems and generating elegant solutions so that we can deepen our understanding of value functions and agent behavior (like TurnTrout’s work on low-impact agents). In fact, Tom’s a co-author on a very good toy problems paper, many of which require similar sort of one-off solutions that still might advance our technical understanding of agents.
Thanks for bringing up these points. The intended audience is researchers more familiar with RL than the safety literature. Rather than try to modify the paper to everyone’s liking, let me just give a little intro / context for it here.
The paper is the culmination of a few years of work (previously described in e.g. my thesis and alignment paper). One of the main goals has been to understand whether it is possible to redeem RL from a safety viewpoint, or whether some rather different framework would be necessary to build safe AGI.
As a first step along this path, I tried to categorize problems with RL, and see which solutions applied to which categories. For this purpose, I found causal graphs valuable (thesis), and I later realized that causal influence diagrams (CID) provided an even better foundation. Any problem corresponds to an ‘undesired path’ in a CID, and basically all the solutions corresponded to ways of getting rid of that path. As highlighted in the introduction of the paper, I now view this insight as one of the most useful ones.
Another important contribution of the paper is pinpointing which solution idea solves which type of reward tampering problem, and a discussion of how the solutions might fit together. I see this as a kind of stepping stone towards more empirical RL work in this area.
Third, the paper puts a fair bit of emphasis on giving brief but precise summaries of previous ideas in the safety literature, and may therefore serve as a kind of literature review. You are absolutely right that solutions to reward function tampering (often more loosely referred to as wireheading) have been around for quite some time. However, the explanations of these methods have been scattered across a number of papers, using a number of different frameworks and formalisms.
Sure. On the one hand, xkcd. On the other hand, if it works for you, that’s great and absolutely useful progress.
I’m a little worried about direct applicability to RL because the model is still not fully naturalized—actions that affect goals are neatly labeled and separated rather than being a messy subset of actions that affect the world. I guess this another one of those cases where I think the “right” answer is “sophisticated common sense,” but an ad-hoc mostly-answer would still be useful conceptual progress.
Actually, I would argue that the model is naturalized in the relevant way.
When studying reward function tampering, for instance, the agent chooses actions from a set of available actions. These actions just affect the state of the environment, and somehow result in reward or not.
As a conceptual tool, we label part of the environment the “reward function”, and part of the environment the “proper state”. This is just to distinguish between effects that we’d like the agent to use from effects that we don’t want the agent to use.
The current-RF solution doesn’t rely on this distinction, it only relies on query-access to the reward function (which you could easily give an embedded RL agent).
The neat thing is that when we look at the objective of the current-RF agent using the same conceptual labeling of parts of the state, we see exactly why it works: the causal paths from actions to reward that pass the reward function have been removed.
Yes, that is partly what we are trying to do here. By summarizing some of the “folklore” in the community, we’ll hopefully be able to get new members up to speed quicker.
Honestly? I feel like this same set of problems gets re-solved a lot. I’m worried that it’s a sign of ill health for the field.
I think we understand certain technical aspects of corrigibility (indifference and CIRL), but have hit a brick wall in certain other aspects (things that require sophisticated “common sense” about AIs or humans to implement, philosophical problems about how to get an AI to solve philosophical problems). I think this is part of what leads to re-treading old ground when new people (or a person wanting to apply a new tool) try to work on AI safety.
On the other hand, I’m not sure if we’ve exhausted Concrete Problems yet. Yes, the answer is often “just have sophisticated common sense,” but I think the value is in exploring the problems and generating elegant solutions so that we can deepen our understanding of value functions and agent behavior (like TurnTrout’s work on low-impact agents). In fact, Tom’s a co-author on a very good toy problems paper, many of which require similar sort of one-off solutions that still might advance our technical understanding of agents.
Yeah, unless I’m missing something, this is the solution to the “easy problem of wireheading” as discussed at Abram Demski, Stable Pointers to Value II: Environmental Goals .
Still, I say kudos to the authors for making progress on exactly how to put that principle into practice.
Hey Steve,
Thanks for linking to Abram’s excellent blog post.
We should have pointed this out in the paper, but there is a simple correspondence between Abram’s terminology and ours:
Easy wireheading problem = reward function tampering
Hard wireheading problem = feedback tampering.
Our current-RF optimization corresponds to Abram’s observation-utility agent.
We also discuss the RF-input tampering problem and solutions (sometimes called the delusion box problem), which I don’t fit into Abram’s distinction.
Hey Charlie,
Thanks for bringing up these points. The intended audience is researchers more familiar with RL than the safety literature. Rather than try to modify the paper to everyone’s liking, let me just give a little intro / context for it here.
The paper is the culmination of a few years of work (previously described in e.g. my thesis and alignment paper). One of the main goals has been to understand whether it is possible to redeem RL from a safety viewpoint, or whether some rather different framework would be necessary to build safe AGI.
As a first step along this path, I tried to categorize problems with RL, and see which solutions applied to which categories. For this purpose, I found causal graphs valuable (thesis), and I later realized that causal influence diagrams (CID) provided an even better foundation. Any problem corresponds to an ‘undesired path’ in a CID, and basically all the solutions corresponded to ways of getting rid of that path. As highlighted in the introduction of the paper, I now view this insight as one of the most useful ones.
Another important contribution of the paper is pinpointing which solution idea solves which type of reward tampering problem, and a discussion of how the solutions might fit together. I see this as a kind of stepping stone towards more empirical RL work in this area.
Third, the paper puts a fair bit of emphasis on giving brief but precise summaries of previous ideas in the safety literature, and may therefore serve as a kind of literature review. You are absolutely right that solutions to reward function tampering (often more loosely referred to as wireheading) have been around for quite some time. However, the explanations of these methods have been scattered across a number of papers, using a number of different frameworks and formalisms.
Tom
Sure. On the one hand, xkcd. On the other hand, if it works for you, that’s great and absolutely useful progress.
I’m a little worried about direct applicability to RL because the model is still not fully naturalized—actions that affect goals are neatly labeled and separated rather than being a messy subset of actions that affect the world. I guess this another one of those cases where I think the “right” answer is “sophisticated common sense,” but an ad-hoc mostly-answer would still be useful conceptual progress.
Actually, I would argue that the model is naturalized in the relevant way.
When studying reward function tampering, for instance, the agent chooses actions from a set of available actions. These actions just affect the state of the environment, and somehow result in reward or not.
As a conceptual tool, we label part of the environment the “reward function”, and part of the environment the “proper state”. This is just to distinguish between effects that we’d like the agent to use from effects that we don’t want the agent to use.
The current-RF solution doesn’t rely on this distinction, it only relies on query-access to the reward function (which you could easily give an embedded RL agent).
The neat thing is that when we look at the objective of the current-RF agent using the same conceptual labeling of parts of the state, we see exactly why it works: the causal paths from actions to reward that pass the reward function have been removed.
Maybe the problem is getting everyone on the same page.
Yes, that is partly what we are trying to do here. By summarizing some of the “folklore” in the community, we’ll hopefully be able to get new members up to speed quicker.