I have a bit of time on my hands, so I thought I might try to answer some of your questions. Of course I can’t speak for TurnTrout, and there’s a decent chance that I’m confused about some of the things here. But here is how I think about AUP and the points raised in this chain:
“AUP is not about the state”—I’m going to take a step back, and pretend we have an agent working with AUP reasoning. We’ve specified an arcane set of utility functions (based on air molecule positions, well-defined human happiness, continued existence, whatever fits in the mathematical framework). Next we have an action A available, and would like to compute the impact of that action. To do this our agent would compare how well it would be able to optimize each of those arcane utility functions in the world where A was taken, versus how well it would be able to optimize these utility functions in the world where the rest action was taken instead. This is “not about state” in the sense that the impact is determined by the change in the ability for the agent to optimize these arcane utilities, not by the change in the world state. In the particular case where the utility function is specified all the way down to sensory inputs (as opposed to elements of the world around us, which have to be interpreted by the agent first) this doesn’t explicitly refer to the world around us at all (although of course implicitly the actions and sensory inputs of the agent are part of the world)! The thing being measured is the change in ability to optimize future observations, where what is a ‘good’ observation is defined by our arcane set of utility functions.
“overfitting the environment”—I’m not too sure about this one, but I’ll have a crack at it. I think this should be interpreted as follows: if we give a powerful agent a utility function that doesn’t agree perfectly with human happiness, then the wrong thing is being optimized. The agent will shape the world around us to what is best according to the utility function, and this is bad. It would be a lot better (but still less than perfect) if we had some way of forcing this agent to obey general rules of simplicity. The idea here is that our bad proxy utility function is at least somewhat correlated with actual human happiness under everyday circumstances, so as long as we don’t suddenly introduce a massively powerful agent optimizing something weird (oops) to massively change our lives we should be fine. So if we can give our agent a limited ‘budget’ - in the case of fitting a curve to a dataset this would be akin to the number of free parameters—then at least things won’t go horribly wrong, plus we expect these simpler actions to have less unintended side-effects outside the domain we’re interested in. I think this is what is meant, although I don’t really like the terminology “overfitting the environment”.
“The long arms of opportunity cost and instrumental convergence”—this point is actually very interesting. In the first bullet point I tried to explain a little bit about how AUP doesn’t directly depend on the world state (it depends on the agent’s observations, but without an ontology that doesn’t really tell you much about the world), instead all its gears are part of the agent itself. This is really weird. But it also lets us sidestep the issue of human value learning—if you don’t directly involve the world in your impact measure, you don’t need to understand the world for it to work. The real question is this one: “how could this impact measure possibly resemble anything like ‘impact’ as it is intuitively understood, when it doesn’t involve the world around us?” The answer: “The long arms of opportunity cost and instrumental convergence”. Keep in mind we’re defining impact as change in the ability to optimize future observations. So the point is as follows: you can pick any absurd utility function you want, and any absurd possible action, and odds are this is going to result in some amount of attainable utility change compared to taking the null action. In particular, precisely those actions that massively change your ability to make big changes to the real world will have a big impact even on arbitrary utility functions! This sentence is so key I’m just going to repeat it with more emphasis: the actions that massively change your ability to make big changes in the world—i.e. massive decreases of power (like shutting down) but also massive increases in power—have big opportunity costs/benefits compared to the null action for a very wide range of utility functions. So these get assigned very high impact, even if the utility function set we use is utter hokuspokus! Now this is precisely instrumental convergence, i.e. the claim that for many different utility functions the first steps of optimizing them involves “make sure you have sufficient power to enforce your actions to optimize your utility function”. So this gives us some hope that TurnTrout’s impact measure will correspond to intuitive measures of impact even if the utility functions involved in the definition are not at all like human values (or even like a sensible category in the real world at all)!
“Wirehead a utility function”—this is the same as optimizing a utility function, although there is an important point to be made here. Since our agent doesn’t have a world-model (or at least, shouldn’t need one for a minimal working example), it is plausible the agent can optimize a utility function by hijacking its own input stream, or something of the sorts. This means that its attainable utility is at least partially determined by the agent’s ability to ‘wirehead’ to a situation where taking the rest action for all future timesteps will produce a sequence of observations that maximizes this specific utility function, which if I’m not mistaken is pretty much spot on the classical definition of wireheading.
“Cut out the middleman”—this is similar to the first bullet point. By defining the impact of an action as our change in the ability to optimize future observations, we don’t need to make reference to world-states at all. This means that questions like “how different are two given world-states?” or “how much do we care about the difference between two two world-states?” or even “can we (almost) undo our previous action, or did we lose something valuable along the way?” are orthogonal to the construction of this impact measure. It is only when we add in an ontology and start interpreting the agent’s observations as world-states that these questions come back. In this sense this impact measure is completely different from RR: I started to write exactly how this was the case, but I think TurnTrout’s explanation is better than anything I can cook up. So just ctrl+F “I tried to nip this confusion in the bud.” and read down a bit.
^ This is also how I interpret all of those statements. (Though I don’t agree with all of them.)
I also dislike the “overfitting the environment” phrase, though the underlying concept seems fine. If anything, the concept being pointed at is more analogous to distributional shift, since the idea is that the utility function works well in “normal” cases and not elsewhere.
I disagree that AUP-the-method is hugely different from RR-the-method; I agree that the explanations and stated intuitions are very different, but I don’t think the switch from states to utility functions is as fundamental as you think it is. I think you could make the same arguments about opportunity cost / instrumental convergence about the variant of RR that penalizes both increases and decreases in reachability.
Ignoring my dislike of the phrase, I don’t agree that AUP is stopping you from “overfitting the environment” (the way I interpret the phrase, which I hope is the same as your interpretation, but who knows). I’d guess that your-vision-of AUP wildly overcompensates and causes you to seriously “underfit the environment”, or rephrased in my language, it prevents you from executing most interesting plans, which happens to include the catastrophic plans but also includes the useful plans. If you tune hyperparameters so it no longer “underfits the environment” (alternatively, “allows for interesting plans”), then I expect it allows catastrophic plans.
I continue to feel some apprehension about defining impact as opportunity cost and instrumental convergence, though I wouldn’t say I currently disagree with it.
I think you could make the same arguments about opportunity cost / instrumental convergence about the variant of RR that penalizes both increases and decreases in reachability.
(I’m going to take a shot at this now because it’s meta, and I think there’s a compact explanation I can provide that hopefully makes sense.)
Suppose the theory of attainable utility is correct (i.e., we find things impactful when they change our ability to get what we want). Then whenever the theory of relative state reachability gets something right, you would be able to say “it’s penalizing opportunity cost or instrumental convergence” post hoc because that’s why we find things impactful. You could say the same thing about instances of correct behavior by agents which use whitelisting, which I think we agree is quite different.
In the world where attainable utility is correct, you would indeed observe that reachability is conceptually similar in some ways. the problem is that you can’t actually use the opportunity cost/instrumental convergence arguments to predict RR behavior.
Here’s an example, from the vantage point of you, a person. Choice A leads to a 180° rotation of a large, forever inaccessible shell of the observable universe. Choice B leads to the ruination of the planet, excluding what we personally need to survive.
the theory of relative state reachability says choice A is maximally impactful. Why? You can’t reach anything like the states you could under inaction. How does this decision track with opportunity cost?
Attainable utility says choice B is the bigger deal. You couldn’t do anything with that part of the universe anyways, so it doesn’t change much. This is the correct answer.
this scenario is important because it isn’t just an issue with ontologies, or a situation designed to fool the exact formalism we provided. It’s an illustration of where state reachability diverges from these notions.
a natural reply is, what about things that AUP penalizes that we don’t find impactful, like an agent connecting the Internet? the answer is that impact is being measured with respect to the agent itself (and Internet access is indeed impactful to the agent), and the counterfactuals in the formalism we provide. This is different from the AU theory of impact being incorrect. (more on this later.)
however, the gears of AUP rely on the AU theory. Many problems disappear because of the difference in theories, which produces (IMO) a fundamental difference in methods.
ETA: Here’s a physically realistic alternative scenario. Again, we’re thinking about how the theories of attainable utility (change in your ability to get what you want) and relative reachability (change in your ability to reach states) line up with our intuitive judgments. If they disagree, and actual implementations also disagree, that is evidence for a different underlying mechanism.
Imagine you’re in a room; you have a modest discount factor and your normal values and ontology.
Choice A leads to a portion of the wall being painted yellow. You don’t know of any way to remove the paint before the reachability is discounted away. If you don’t take this choice now, you cant later. Choice B, which is always available, ravages the environment around you.
Relative reachability, using a reasonable way of looking at the world and thinking about states, judges choice A more impactful. Attainable utility, using a reasonable interpretation of your values, judges choice B to be more impactful, which lines up with our intuitions.
It’s also the case that AUP seems to do the right thing with an attainable set consisting of, say, random linear functionals over the pixels of the observation channel which are additive over time (a simple example being a utility function which assigns high utility to blue pixels, additive over time steps). even if the agent disprefers yellow pixels in its observations, it can just look at other parts of the room, so the attainable utilities don’t change much. So it doesn’t require our values to do the right thing here, either.
The main point is that the reason it’s doing the right thing is based on opportunity cost, while relative reachability’s incorrect judgment is not.
I don’t agree that AUP is stopping you from “overfitting the environment” (the way I interpret the phrase, which I hope is the same as your interpretation, but who knows).
It isn’t the same, but the way you and major interpreted the phrase is totally reasonable, considering what I wrote.
We talked a bit off-forum, which helped clarify things for me.
Firstly, there’s a difference between attainable utility theory (AU theory), and AUP-the-method. AU theory talks about how impact is about instrumental convergence and opportunity cost, and how that can be measured via thinking about how much utility the agent could attain. In particular, in AU theory “impact” is about how actions change your attainable utility according to the true utility function. AUP is a proposal for an impact regularization method, but it must deal with the fact that we don’t know the true utility function, and so it forms an approximation by considering changes to the attainable utilities of a set of utility functions.
Many of the claims are about AU theory and not about AUP. There isn’t really an analogous “RR theory”.
Another thing is that while both AUP and RR-with-penalties-on-increases would give large penalties to instrumentally convergent actions, it seems like for “regular” irreversible actions like painting a wall AUP would assign a much lower penalty than RR, so differentially AUP is penalizing instrumentally convergent actions more. This happens because utility functions tend to care about particular aspects of the state, rather than all states. Consider the action of moving in a direction: if the utility functions don’t care about being further in that direction, there is no AUP penalty. In contrast, with RR, we will now be able to more easily access states in that direction, leading to at least some penalty.
That said, it seems like you can get this benefit with RR by using a featurization of the state, which also causes you to only care about particular aspects of the state.
I have a bit of time on my hands, so I thought I might try to answer some of your questions. Of course I can’t speak for TurnTrout, and there’s a decent chance that I’m confused about some of the things here. But here is how I think about AUP and the points raised in this chain:
“AUP is not about the state”—I’m going to take a step back, and pretend we have an agent working with AUP reasoning. We’ve specified an arcane set of utility functions (based on air molecule positions, well-defined human happiness, continued existence, whatever fits in the mathematical framework). Next we have an action A available, and would like to compute the impact of that action. To do this our agent would compare how well it would be able to optimize each of those arcane utility functions in the world where A was taken, versus how well it would be able to optimize these utility functions in the world where the rest action was taken instead. This is “not about state” in the sense that the impact is determined by the change in the ability for the agent to optimize these arcane utilities, not by the change in the world state. In the particular case where the utility function is specified all the way down to sensory inputs (as opposed to elements of the world around us, which have to be interpreted by the agent first) this doesn’t explicitly refer to the world around us at all (although of course implicitly the actions and sensory inputs of the agent are part of the world)! The thing being measured is the change in ability to optimize future observations, where what is a ‘good’ observation is defined by our arcane set of utility functions.
“overfitting the environment”—I’m not too sure about this one, but I’ll have a crack at it. I think this should be interpreted as follows: if we give a powerful agent a utility function that doesn’t agree perfectly with human happiness, then the wrong thing is being optimized. The agent will shape the world around us to what is best according to the utility function, and this is bad. It would be a lot better (but still less than perfect) if we had some way of forcing this agent to obey general rules of simplicity. The idea here is that our bad proxy utility function is at least somewhat correlated with actual human happiness under everyday circumstances, so as long as we don’t suddenly introduce a massively powerful agent optimizing something weird (oops) to massively change our lives we should be fine. So if we can give our agent a limited ‘budget’ - in the case of fitting a curve to a dataset this would be akin to the number of free parameters—then at least things won’t go horribly wrong, plus we expect these simpler actions to have less unintended side-effects outside the domain we’re interested in. I think this is what is meant, although I don’t really like the terminology “overfitting the environment”.
“The long arms of opportunity cost and instrumental convergence”—this point is actually very interesting. In the first bullet point I tried to explain a little bit about how AUP doesn’t directly depend on the world state (it depends on the agent’s observations, but without an ontology that doesn’t really tell you much about the world), instead all its gears are part of the agent itself. This is really weird. But it also lets us sidestep the issue of human value learning—if you don’t directly involve the world in your impact measure, you don’t need to understand the world for it to work. The real question is this one: “how could this impact measure possibly resemble anything like ‘impact’ as it is intuitively understood, when it doesn’t involve the world around us?” The answer: “The long arms of opportunity cost and instrumental convergence”. Keep in mind we’re defining impact as change in the ability to optimize future observations. So the point is as follows: you can pick any absurd utility function you want, and any absurd possible action, and odds are this is going to result in some amount of attainable utility change compared to taking the null action. In particular, precisely those actions that massively change your ability to make big changes to the real world will have a big impact even on arbitrary utility functions! This sentence is so key I’m just going to repeat it with more emphasis: the actions that massively change your ability to make big changes in the world—i.e. massive decreases of power (like shutting down) but also massive increases in power—have big opportunity costs/benefits compared to the null action for a very wide range of utility functions. So these get assigned very high impact, even if the utility function set we use is utter hokuspokus! Now this is precisely instrumental convergence, i.e. the claim that for many different utility functions the first steps of optimizing them involves “make sure you have sufficient power to enforce your actions to optimize your utility function”. So this gives us some hope that TurnTrout’s impact measure will correspond to intuitive measures of impact even if the utility functions involved in the definition are not at all like human values (or even like a sensible category in the real world at all)!
“Wirehead a utility function”—this is the same as optimizing a utility function, although there is an important point to be made here. Since our agent doesn’t have a world-model (or at least, shouldn’t need one for a minimal working example), it is plausible the agent can optimize a utility function by hijacking its own input stream, or something of the sorts. This means that its attainable utility is at least partially determined by the agent’s ability to ‘wirehead’ to a situation where taking the rest action for all future timesteps will produce a sequence of observations that maximizes this specific utility function, which if I’m not mistaken is pretty much spot on the classical definition of wireheading.
“Cut out the middleman”—this is similar to the first bullet point. By defining the impact of an action as our change in the ability to optimize future observations, we don’t need to make reference to world-states at all. This means that questions like “how different are two given world-states?” or “how much do we care about the difference between two two world-states?” or even “can we (almost) undo our previous action, or did we lose something valuable along the way?” are orthogonal to the construction of this impact measure. It is only when we add in an ontology and start interpreting the agent’s observations as world-states that these questions come back. In this sense this impact measure is completely different from RR: I started to write exactly how this was the case, but I think TurnTrout’s explanation is better than anything I can cook up. So just ctrl+F “I tried to nip this confusion in the bud.” and read down a bit.
^ This is also how I interpret all of those statements. (Though I don’t agree with all of them.)
I also dislike the “overfitting the environment” phrase, though the underlying concept seems fine. If anything, the concept being pointed at is more analogous to distributional shift, since the idea is that the utility function works well in “normal” cases and not elsewhere.
which do you disagree with?
I disagree that AUP-the-method is hugely different from RR-the-method; I agree that the explanations and stated intuitions are very different, but I don’t think the switch from states to utility functions is as fundamental as you think it is. I think you could make the same arguments about opportunity cost / instrumental convergence about the variant of RR that penalizes both increases and decreases in reachability.
Ignoring my dislike of the phrase, I don’t agree that AUP is stopping you from “overfitting the environment” (the way I interpret the phrase, which I hope is the same as your interpretation, but who knows). I’d guess that your-vision-of AUP wildly overcompensates and causes you to seriously “underfit the environment”, or rephrased in my language, it prevents you from executing most interesting plans, which happens to include the catastrophic plans but also includes the useful plans. If you tune hyperparameters so it no longer “underfits the environment” (alternatively, “allows for interesting plans”), then I expect it allows catastrophic plans.
I continue to feel some apprehension about defining impact as opportunity cost and instrumental convergence, though I wouldn’t say I currently disagree with it.
(I’m going to take a shot at this now because it’s meta, and I think there’s a compact explanation I can provide that hopefully makes sense.)
Suppose the theory of attainable utility is correct (i.e., we find things impactful when they change our ability to get what we want). Then whenever the theory of relative state reachability gets something right, you would be able to say “it’s penalizing opportunity cost or instrumental convergence” post hoc because that’s why we find things impactful. You could say the same thing about instances of correct behavior by agents which use whitelisting, which I think we agree is quite different.
In the world where attainable utility is correct, you would indeed observe that reachability is conceptually similar in some ways. the problem is that you can’t actually use the opportunity cost/instrumental convergence arguments to predict RR behavior.
Here’s an example, from the vantage point of you, a person. Choice A leads to a 180° rotation of a large, forever inaccessible shell of the observable universe. Choice B leads to the ruination of the planet, excluding what we personally need to survive.
the theory of relative state reachability says choice A is maximally impactful. Why? You can’t reach anything like the states you could under inaction. How does this decision track with opportunity cost?
Attainable utility says choice B is the bigger deal. You couldn’t do anything with that part of the universe anyways, so it doesn’t change much. This is the correct answer.
this scenario is important because it isn’t just an issue with ontologies, or a situation designed to fool the exact formalism we provided. It’s an illustration of where state reachability diverges from these notions.
a natural reply is, what about things that AUP penalizes that we don’t find impactful, like an agent connecting the Internet? the answer is that impact is being measured with respect to the agent itself (and Internet access is indeed impactful to the agent), and the counterfactuals in the formalism we provide. This is different from the AU theory of impact being incorrect. (more on this later.)
however, the gears of AUP rely on the AU theory. Many problems disappear because of the difference in theories, which produces (IMO) a fundamental difference in methods.
ETA: Here’s a physically realistic alternative scenario. Again, we’re thinking about how the theories of attainable utility (change in your ability to get what you want) and relative reachability (change in your ability to reach states) line up with our intuitive judgments. If they disagree, and actual implementations also disagree, that is evidence for a different underlying mechanism.
Imagine you’re in a room; you have a modest discount factor and your normal values and ontology.
Choice A leads to a portion of the wall being painted yellow. You don’t know of any way to remove the paint before the reachability is discounted away. If you don’t take this choice now, you cant later. Choice B, which is always available, ravages the environment around you.
Relative reachability, using a reasonable way of looking at the world and thinking about states, judges choice A more impactful. Attainable utility, using a reasonable interpretation of your values, judges choice B to be more impactful, which lines up with our intuitions.
It’s also the case that AUP seems to do the right thing with an attainable set consisting of, say, random linear functionals over the pixels of the observation channel which are additive over time (a simple example being a utility function which assigns high utility to blue pixels, additive over time steps). even if the agent disprefers yellow pixels in its observations, it can just look at other parts of the room, so the attainable utilities don’t change much. So it doesn’t require our values to do the right thing here, either.
The main point is that the reason it’s doing the right thing is based on opportunity cost, while relative reachability’s incorrect judgment is not.
It isn’t the same, but the way you and major interpreted the phrase is totally reasonable, considering what I wrote.
We talked a bit off-forum, which helped clarify things for me.
Firstly, there’s a difference between attainable utility theory (AU theory), and AUP-the-method. AU theory talks about how impact is about instrumental convergence and opportunity cost, and how that can be measured via thinking about how much utility the agent could attain. In particular, in AU theory “impact” is about how actions change your attainable utility according to the true utility function. AUP is a proposal for an impact regularization method, but it must deal with the fact that we don’t know the true utility function, and so it forms an approximation by considering changes to the attainable utilities of a set of utility functions.
Many of the claims are about AU theory and not about AUP. There isn’t really an analogous “RR theory”.
Another thing is that while both AUP and RR-with-penalties-on-increases would give large penalties to instrumentally convergent actions, it seems like for “regular” irreversible actions like painting a wall AUP would assign a much lower penalty than RR, so differentially AUP is penalizing instrumentally convergent actions more. This happens because utility functions tend to care about particular aspects of the state, rather than all states. Consider the action of moving in a direction: if the utility functions don’t care about being further in that direction, there is no AUP penalty. In contrast, with RR, we will now be able to more easily access states in that direction, leading to at least some penalty.
That said, it seems like you can get this benefit with RR by using a featurization of the state, which also causes you to only care about particular aspects of the state.