After rereading the sequence and reflecting on this further, I disagree with your interpretation of the Reframing Impact concept of impact. The concept is “change in my ability to get what I want”, i.e. change in the true human utility function. This is a broad statement that does not specify how to measure “change”, in particular what it is measured with respect to (the baseline) or how to take the difference from the baseline (e.g. whether to apply absolute value). Your interpretation of this statement uses the previous state as a baseline and does not apply an absolute value to the difference. This is a specific and nonstandard instantiation of the impact concept, and the undesirable property you described does not hold for other instantiations—e.g. using a stepwise inaction baseline and an absolute value: Impact(s, a) = |E[V(s, a)] - E[V(s, noop)]|. So I don’t think it’s fair to argue based on this instantiation that it doesn’t make sense to regularize the RI notion of impact.
I think that AUP-the-method and RR are also instantiations of the RI notion of impact. These methods can be seen as approximating the change in the true human utility function (which is usually unknown) by using some some set of utility functions (e.g. random ones) to cover the possible outcomes that could be part of the true human utility function. Thus, they instantiate the idealized notion of impact using the actually available information.
After rereading the sequence and reflecting on this further, I disagree with your interpretation of the Reframing Impact concept of impact. The concept is “change in my ability to get what I want”, i.e. change in the true human utility function. This is a broad statement that does not specify how to measure “change”, in particular what it is measured with respect to (the baseline) or how to take the difference from the baseline (e.g. whether to apply absolute value). Your interpretation of this statement uses the previous state as a baseline and does not apply an absolute value to the difference. This is a specific and nonstandard instantiation of the impact concept, and the undesirable property you described does not hold for other instantiations—e.g. using a stepwise inaction baseline and an absolute value: Impact(s, a) = |E[V(s, a)] - E[V(s, noop)]|. So I don’t think it’s fair to argue based on this instantiation that it doesn’t make sense to regularize the RI notion of impact.
AU theory says that people feel impacted as new observations change their on-policy value estimate (so it’s the TD error). I agree with Rohin’s interpretation as I understand it.
However, AU theory is descriptive – it describes when and how we feel impacted, but not how to build agents which don’t impact us much. That’s what the rest of the sequence talked about.
The thing that I believe (irrespective of whether RI says it or not) is:
“Humans find new information ‘impactful’ to themselves when it changes how good they expect their future to be.” (In practice, there’s a host of complications because humans are messy, e.g. uncertainty about how good the future is also tends to feel impactful.)
In particular, if humans had perfect beliefs and knew exactly what would happen at all times, no information could never change how good they expect their future to be, and so nothing could ever be impactful.
Since this is tied to new information changing what you expect, it seems like the natural baseline is the previous state.
Separately, I also think that RI was trying to argue for this conclusion, but I’ll defer to Alex about what he was or wasn’t trying to claim / argue for.
I think the previous state is a natural baseline if you are interested in the total impact on the human from all sources. If you are interested in the impact on the human that is caused by the agent (where the agent is the source), the natural choice would be the stepwise inaction baseline (comparing to the agent doing nothing).
As an example, suppose I have an unpleasant ride on a crowded bus, where person X steps on my foot and person Y steals my wallet. The total impact on me would be computed relative to the previous state before I got on the bus, which would include both my foot and my wallet. The impact of person X on me would be computed relative to the stepwise inaction baseline, where person X does nothing (but person Y still steals my wallet), and vice versa.
When we use impact as a regularizer, we are interested in the impact caused by the agent, so we use the stepwise inaction baseline. It wouldn’t make sense to use total impact as a regularizer, since it would penalize the agent for impact from all sources.
If you are interested in the impact on the human that is caused by the agent (where the agent is the source), the natural choice would be the stepwise inaction baseline (comparing to the agent doing nothing).
To the extent that there is a natural choice (counterfactuals are hard), I think it would be “what the human expected the agent to do” (the same sort of reasoning that led to the previous state baseline).
This gives the same answer as the stepwise inaction baseline in your example (because usually we don’t expect a specific person to step on our feet or to steal our wallet).
An example where it gives a different answer is in driving. The stepwise inaction baseline says “impact is measured relative to all the other drivers going comatose”, so in the baseline state many accidents happen, and you get stuck in a huge traffic jam. Thus, all the other drivers are constantly having a huge impact on you by continuing to drive!
In contrast, the baseline of “what the human expected the agent to do” gets the intuitive answer—the human expected all the other drivers to drive normally, and so normal driving has ~zero impact, whereas if someone actually did fall comatose and cause an accident, that would be quite impactful.
EDIT: Tbc, I think this is the “natural choice” if you want to predict what humans would say is impactful; I don’t have a strong opinion on what the “natural choice” would be if you wanted to successfully prevent catastrophe via penalizing “impact”. (Though in this case the driving example still argues against stepwise inaction.)
I certainly agree that there are problems with the stepwise inaction baseline and it’s probably not the final answer for impact penalization. I should have said that the inaction counterfactual is a natural choice, rather than specifically its stepwise form. Using the inaction baseline in the driving example compares to the other driver never leaving their garage (rather than falling asleep at the wheel). Of course, the inaction baseline has other issues (like offsetting), so I think it’s an open question how to design a baseline that satisfies all the criteria we consider sensible (and whether it’s even possible).
I agree that counterfactuals are hard, but I’m not sure that difficulty can be avoided. Your baseline of “what the human expected the agent to do” is also a counterfactual, since you need to model what would have happened if the world unfolded as expected. It also requires a lot of information from the human, which is subjective and may be hard to elicit. What a human expected to happen in a given situation may not even be well-defined if they have internal disagreement—e.g. even if I feel surprised by someone’s behavior, there is often a voice in my head saying “this was actually predictable from their past behavior so I should have known better”. On the other hand, since (as you mentioned) this is not intended as a baseline for impact penalization, maybe it doesn’t need to be well-defined or efficient in terms of human input, and it is a good source of intuition on what feels impactful to humans.
Using the inaction baseline in the driving example compares to the other driver never leaving their garage (rather than falling asleep at the wheel).
Maybe? How do you decide where to start the inaction baseline? In RL the episode start is an obvious choice, but it’s not clear how to apply that for humans.
(I only have this objection when trying to explain what “impact” means to humans; it seems fine in the RL setting. I do think we’ll probably stop relying on the episode abstraction eventually, so we would eventually need to not rely on it ourselves, but plausibly that can be dealt with in the future.)
Also, under this inaction baseline, the roads are perpetually empty, and so you’re always feeling impact from the fact that you can’t zoom down the road at 120 mph, which seems wrong.
I agree that counterfactuals are hard, but I’m not sure that difficulty can be avoided. Your baseline of “what the human expected the agent to do” is also a counterfactual, since you need to model what would have happened if the world unfolded as expected.
Sorry, what I meant to imply was “baselines are counterfactuals, and counterfactuals are hard, so maybe no ‘natural’ baseline exists”. I certainly agree that my baseline is a counterfactual.
On the other hand, since (as you mentioned) this is not intended as a baseline for impact penalization, maybe it doesn’t need to be well-defined or efficient in terms of human input, and it is a good source of intuition on what feels impactful to humans.
Yes, that’s my main point. I agree that there’s no clear way to take my baseline and implement it in code, and that it depends on fuzzy concepts that don’t always apply (even when interpreted by humans).
After rereading the sequence and reflecting on this further, I disagree with your interpretation of the Reframing Impact concept of impact. The concept is “change in my ability to get what I want”, i.e. change in the true human utility function. This is a broad statement that does not specify how to measure “change”, in particular what it is measured with respect to (the baseline) or how to take the difference from the baseline (e.g. whether to apply absolute value). Your interpretation of this statement uses the previous state as a baseline and does not apply an absolute value to the difference. This is a specific and nonstandard instantiation of the impact concept, and the undesirable property you described does not hold for other instantiations—e.g. using a stepwise inaction baseline and an absolute value: Impact(s, a) = |E[V(s, a)] - E[V(s, noop)]|. So I don’t think it’s fair to argue based on this instantiation that it doesn’t make sense to regularize the RI notion of impact.
I think that AUP-the-method and RR are also instantiations of the RI notion of impact. These methods can be seen as approximating the change in the true human utility function (which is usually unknown) by using some some set of utility functions (e.g. random ones) to cover the possible outcomes that could be part of the true human utility function. Thus, they instantiate the idealized notion of impact using the actually available information.
AU theory says that people feel impacted as new observations change their on-policy value estimate (so it’s the TD error). I agree with Rohin’s interpretation as I understand it.
However, AU theory is descriptive – it describes when and how we feel impacted, but not how to build agents which don’t impact us much. That’s what the rest of the sequence talked about.
The thing that I believe (irrespective of whether RI says it or not) is:
“Humans find new information ‘impactful’ to themselves when it changes how good they expect their future to be.” (In practice, there’s a host of complications because humans are messy, e.g. uncertainty about how good the future is also tends to feel impactful.)
In particular, if humans had perfect beliefs and knew exactly what would happen at all times, no information could never change how good they expect their future to be, and so nothing could ever be impactful.
Since this is tied to new information changing what you expect, it seems like the natural baseline is the previous state.
Separately, I also think that RI was trying to argue for this conclusion, but I’ll defer to Alex about what he was or wasn’t trying to claim / argue for.
I think the previous state is a natural baseline if you are interested in the total impact on the human from all sources. If you are interested in the impact on the human that is caused by the agent (where the agent is the source), the natural choice would be the stepwise inaction baseline (comparing to the agent doing nothing).
As an example, suppose I have an unpleasant ride on a crowded bus, where person X steps on my foot and person Y steals my wallet. The total impact on me would be computed relative to the previous state before I got on the bus, which would include both my foot and my wallet. The impact of person X on me would be computed relative to the stepwise inaction baseline, where person X does nothing (but person Y still steals my wallet), and vice versa.
When we use impact as a regularizer, we are interested in the impact caused by the agent, so we use the stepwise inaction baseline. It wouldn’t make sense to use total impact as a regularizer, since it would penalize the agent for impact from all sources.
To the extent that there is a natural choice (counterfactuals are hard), I think it would be “what the human expected the agent to do” (the same sort of reasoning that led to the previous state baseline).
This gives the same answer as the stepwise inaction baseline in your example (because usually we don’t expect a specific person to step on our feet or to steal our wallet).
An example where it gives a different answer is in driving. The stepwise inaction baseline says “impact is measured relative to all the other drivers going comatose”, so in the baseline state many accidents happen, and you get stuck in a huge traffic jam. Thus, all the other drivers are constantly having a huge impact on you by continuing to drive!
In contrast, the baseline of “what the human expected the agent to do” gets the intuitive answer—the human expected all the other drivers to drive normally, and so normal driving has ~zero impact, whereas if someone actually did fall comatose and cause an accident, that would be quite impactful.
EDIT: Tbc, I think this is the “natural choice” if you want to predict what humans would say is impactful; I don’t have a strong opinion on what the “natural choice” would be if you wanted to successfully prevent catastrophe via penalizing “impact”. (Though in this case the driving example still argues against stepwise inaction.)
I certainly agree that there are problems with the stepwise inaction baseline and it’s probably not the final answer for impact penalization. I should have said that the inaction counterfactual is a natural choice, rather than specifically its stepwise form. Using the inaction baseline in the driving example compares to the other driver never leaving their garage (rather than falling asleep at the wheel). Of course, the inaction baseline has other issues (like offsetting), so I think it’s an open question how to design a baseline that satisfies all the criteria we consider sensible (and whether it’s even possible).
I agree that counterfactuals are hard, but I’m not sure that difficulty can be avoided. Your baseline of “what the human expected the agent to do” is also a counterfactual, since you need to model what would have happened if the world unfolded as expected. It also requires a lot of information from the human, which is subjective and may be hard to elicit. What a human expected to happen in a given situation may not even be well-defined if they have internal disagreement—e.g. even if I feel surprised by someone’s behavior, there is often a voice in my head saying “this was actually predictable from their past behavior so I should have known better”. On the other hand, since (as you mentioned) this is not intended as a baseline for impact penalization, maybe it doesn’t need to be well-defined or efficient in terms of human input, and it is a good source of intuition on what feels impactful to humans.
Maybe? How do you decide where to start the inaction baseline? In RL the episode start is an obvious choice, but it’s not clear how to apply that for humans.
(I only have this objection when trying to explain what “impact” means to humans; it seems fine in the RL setting. I do think we’ll probably stop relying on the episode abstraction eventually, so we would eventually need to not rely on it ourselves, but plausibly that can be dealt with in the future.)
Also, under this inaction baseline, the roads are perpetually empty, and so you’re always feeling impact from the fact that you can’t zoom down the road at 120 mph, which seems wrong.
Sorry, what I meant to imply was “baselines are counterfactuals, and counterfactuals are hard, so maybe no ‘natural’ baseline exists”. I certainly agree that my baseline is a counterfactual.
Yes, that’s my main point. I agree that there’s no clear way to take my baseline and implement it in code, and that it depends on fuzzy concepts that don’t always apply (even when interpreted by humans).