^ This is also how I interpret all of those statements. (Though I don’t agree with all of them.)
I also dislike the “overfitting the environment” phrase, though the underlying concept seems fine. If anything, the concept being pointed at is more analogous to distributional shift, since the idea is that the utility function works well in “normal” cases and not elsewhere.
I disagree that AUP-the-method is hugely different from RR-the-method; I agree that the explanations and stated intuitions are very different, but I don’t think the switch from states to utility functions is as fundamental as you think it is. I think you could make the same arguments about opportunity cost / instrumental convergence about the variant of RR that penalizes both increases and decreases in reachability.
Ignoring my dislike of the phrase, I don’t agree that AUP is stopping you from “overfitting the environment” (the way I interpret the phrase, which I hope is the same as your interpretation, but who knows). I’d guess that your-vision-of AUP wildly overcompensates and causes you to seriously “underfit the environment”, or rephrased in my language, it prevents you from executing most interesting plans, which happens to include the catastrophic plans but also includes the useful plans. If you tune hyperparameters so it no longer “underfits the environment” (alternatively, “allows for interesting plans”), then I expect it allows catastrophic plans.
I continue to feel some apprehension about defining impact as opportunity cost and instrumental convergence, though I wouldn’t say I currently disagree with it.
I think you could make the same arguments about opportunity cost / instrumental convergence about the variant of RR that penalizes both increases and decreases in reachability.
(I’m going to take a shot at this now because it’s meta, and I think there’s a compact explanation I can provide that hopefully makes sense.)
Suppose the theory of attainable utility is correct (i.e., we find things impactful when they change our ability to get what we want). Then whenever the theory of relative state reachability gets something right, you would be able to say “it’s penalizing opportunity cost or instrumental convergence” post hoc because that’s why we find things impactful. You could say the same thing about instances of correct behavior by agents which use whitelisting, which I think we agree is quite different.
In the world where attainable utility is correct, you would indeed observe that reachability is conceptually similar in some ways. the problem is that you can’t actually use the opportunity cost/instrumental convergence arguments to predict RR behavior.
Here’s an example, from the vantage point of you, a person. Choice A leads to a 180° rotation of a large, forever inaccessible shell of the observable universe. Choice B leads to the ruination of the planet, excluding what we personally need to survive.
the theory of relative state reachability says choice A is maximally impactful. Why? You can’t reach anything like the states you could under inaction. How does this decision track with opportunity cost?
Attainable utility says choice B is the bigger deal. You couldn’t do anything with that part of the universe anyways, so it doesn’t change much. This is the correct answer.
this scenario is important because it isn’t just an issue with ontologies, or a situation designed to fool the exact formalism we provided. It’s an illustration of where state reachability diverges from these notions.
a natural reply is, what about things that AUP penalizes that we don’t find impactful, like an agent connecting the Internet? the answer is that impact is being measured with respect to the agent itself (and Internet access is indeed impactful to the agent), and the counterfactuals in the formalism we provide. This is different from the AU theory of impact being incorrect. (more on this later.)
however, the gears of AUP rely on the AU theory. Many problems disappear because of the difference in theories, which produces (IMO) a fundamental difference in methods.
ETA: Here’s a physically realistic alternative scenario. Again, we’re thinking about how the theories of attainable utility (change in your ability to get what you want) and relative reachability (change in your ability to reach states) line up with our intuitive judgments. If they disagree, and actual implementations also disagree, that is evidence for a different underlying mechanism.
Imagine you’re in a room; you have a modest discount factor and your normal values and ontology.
Choice A leads to a portion of the wall being painted yellow. You don’t know of any way to remove the paint before the reachability is discounted away. If you don’t take this choice now, you cant later. Choice B, which is always available, ravages the environment around you.
Relative reachability, using a reasonable way of looking at the world and thinking about states, judges choice A more impactful. Attainable utility, using a reasonable interpretation of your values, judges choice B to be more impactful, which lines up with our intuitions.
It’s also the case that AUP seems to do the right thing with an attainable set consisting of, say, random linear functionals over the pixels of the observation channel which are additive over time (a simple example being a utility function which assigns high utility to blue pixels, additive over time steps). even if the agent disprefers yellow pixels in its observations, it can just look at other parts of the room, so the attainable utilities don’t change much. So it doesn’t require our values to do the right thing here, either.
The main point is that the reason it’s doing the right thing is based on opportunity cost, while relative reachability’s incorrect judgment is not.
I don’t agree that AUP is stopping you from “overfitting the environment” (the way I interpret the phrase, which I hope is the same as your interpretation, but who knows).
It isn’t the same, but the way you and major interpreted the phrase is totally reasonable, considering what I wrote.
We talked a bit off-forum, which helped clarify things for me.
Firstly, there’s a difference between attainable utility theory (AU theory), and AUP-the-method. AU theory talks about how impact is about instrumental convergence and opportunity cost, and how that can be measured via thinking about how much utility the agent could attain. In particular, in AU theory “impact” is about how actions change your attainable utility according to the true utility function. AUP is a proposal for an impact regularization method, but it must deal with the fact that we don’t know the true utility function, and so it forms an approximation by considering changes to the attainable utilities of a set of utility functions.
Many of the claims are about AU theory and not about AUP. There isn’t really an analogous “RR theory”.
Another thing is that while both AUP and RR-with-penalties-on-increases would give large penalties to instrumentally convergent actions, it seems like for “regular” irreversible actions like painting a wall AUP would assign a much lower penalty than RR, so differentially AUP is penalizing instrumentally convergent actions more. This happens because utility functions tend to care about particular aspects of the state, rather than all states. Consider the action of moving in a direction: if the utility functions don’t care about being further in that direction, there is no AUP penalty. In contrast, with RR, we will now be able to more easily access states in that direction, leading to at least some penalty.
That said, it seems like you can get this benefit with RR by using a featurization of the state, which also causes you to only care about particular aspects of the state.
^ This is also how I interpret all of those statements. (Though I don’t agree with all of them.)
I also dislike the “overfitting the environment” phrase, though the underlying concept seems fine. If anything, the concept being pointed at is more analogous to distributional shift, since the idea is that the utility function works well in “normal” cases and not elsewhere.
which do you disagree with?
I disagree that AUP-the-method is hugely different from RR-the-method; I agree that the explanations and stated intuitions are very different, but I don’t think the switch from states to utility functions is as fundamental as you think it is. I think you could make the same arguments about opportunity cost / instrumental convergence about the variant of RR that penalizes both increases and decreases in reachability.
Ignoring my dislike of the phrase, I don’t agree that AUP is stopping you from “overfitting the environment” (the way I interpret the phrase, which I hope is the same as your interpretation, but who knows). I’d guess that your-vision-of AUP wildly overcompensates and causes you to seriously “underfit the environment”, or rephrased in my language, it prevents you from executing most interesting plans, which happens to include the catastrophic plans but also includes the useful plans. If you tune hyperparameters so it no longer “underfits the environment” (alternatively, “allows for interesting plans”), then I expect it allows catastrophic plans.
I continue to feel some apprehension about defining impact as opportunity cost and instrumental convergence, though I wouldn’t say I currently disagree with it.
(I’m going to take a shot at this now because it’s meta, and I think there’s a compact explanation I can provide that hopefully makes sense.)
Suppose the theory of attainable utility is correct (i.e., we find things impactful when they change our ability to get what we want). Then whenever the theory of relative state reachability gets something right, you would be able to say “it’s penalizing opportunity cost or instrumental convergence” post hoc because that’s why we find things impactful. You could say the same thing about instances of correct behavior by agents which use whitelisting, which I think we agree is quite different.
In the world where attainable utility is correct, you would indeed observe that reachability is conceptually similar in some ways. the problem is that you can’t actually use the opportunity cost/instrumental convergence arguments to predict RR behavior.
Here’s an example, from the vantage point of you, a person. Choice A leads to a 180° rotation of a large, forever inaccessible shell of the observable universe. Choice B leads to the ruination of the planet, excluding what we personally need to survive.
the theory of relative state reachability says choice A is maximally impactful. Why? You can’t reach anything like the states you could under inaction. How does this decision track with opportunity cost?
Attainable utility says choice B is the bigger deal. You couldn’t do anything with that part of the universe anyways, so it doesn’t change much. This is the correct answer.
this scenario is important because it isn’t just an issue with ontologies, or a situation designed to fool the exact formalism we provided. It’s an illustration of where state reachability diverges from these notions.
a natural reply is, what about things that AUP penalizes that we don’t find impactful, like an agent connecting the Internet? the answer is that impact is being measured with respect to the agent itself (and Internet access is indeed impactful to the agent), and the counterfactuals in the formalism we provide. This is different from the AU theory of impact being incorrect. (more on this later.)
however, the gears of AUP rely on the AU theory. Many problems disappear because of the difference in theories, which produces (IMO) a fundamental difference in methods.
ETA: Here’s a physically realistic alternative scenario. Again, we’re thinking about how the theories of attainable utility (change in your ability to get what you want) and relative reachability (change in your ability to reach states) line up with our intuitive judgments. If they disagree, and actual implementations also disagree, that is evidence for a different underlying mechanism.
Imagine you’re in a room; you have a modest discount factor and your normal values and ontology.
Choice A leads to a portion of the wall being painted yellow. You don’t know of any way to remove the paint before the reachability is discounted away. If you don’t take this choice now, you cant later. Choice B, which is always available, ravages the environment around you.
Relative reachability, using a reasonable way of looking at the world and thinking about states, judges choice A more impactful. Attainable utility, using a reasonable interpretation of your values, judges choice B to be more impactful, which lines up with our intuitions.
It’s also the case that AUP seems to do the right thing with an attainable set consisting of, say, random linear functionals over the pixels of the observation channel which are additive over time (a simple example being a utility function which assigns high utility to blue pixels, additive over time steps). even if the agent disprefers yellow pixels in its observations, it can just look at other parts of the room, so the attainable utilities don’t change much. So it doesn’t require our values to do the right thing here, either.
The main point is that the reason it’s doing the right thing is based on opportunity cost, while relative reachability’s incorrect judgment is not.
It isn’t the same, but the way you and major interpreted the phrase is totally reasonable, considering what I wrote.
We talked a bit off-forum, which helped clarify things for me.
Firstly, there’s a difference between attainable utility theory (AU theory), and AUP-the-method. AU theory talks about how impact is about instrumental convergence and opportunity cost, and how that can be measured via thinking about how much utility the agent could attain. In particular, in AU theory “impact” is about how actions change your attainable utility according to the true utility function. AUP is a proposal for an impact regularization method, but it must deal with the fact that we don’t know the true utility function, and so it forms an approximation by considering changes to the attainable utilities of a set of utility functions.
Many of the claims are about AU theory and not about AUP. There isn’t really an analogous “RR theory”.
Another thing is that while both AUP and RR-with-penalties-on-increases would give large penalties to instrumentally convergent actions, it seems like for “regular” irreversible actions like painting a wall AUP would assign a much lower penalty than RR, so differentially AUP is penalizing instrumentally convergent actions more. This happens because utility functions tend to care about particular aspects of the state, rather than all states. Consider the action of moving in a direction: if the utility functions don’t care about being further in that direction, there is no AUP penalty. In contrast, with RR, we will now be able to more easily access states in that direction, leading to at least some penalty.
That said, it seems like you can get this benefit with RR by using a featurization of the state, which also causes you to only care about particular aspects of the state.