So there’s a thing people do when they talk about AUP which I don’t understand. They think it’s about state, even though I insist it’s fundamentally different, and try to explain why (note that AUP in the MDP setting is necessarily over states, because states are the observations). My explanations apparently haven’t been very good; in the given conversation, they acknowledge that it’s different, but then regress a little while later. I think they might be trying understand the explanation, remain confused, and then subconsciously slip back to their old model. out of everyone I’ve talked to, I can probably count on my hands the number of people who get this– note that agreeing with specific predictions of mine is different.
Now, it’s the author’s job to communicate their ideas. When I say “as far as I can tell, few others have internalized how AUP actually works”, this doesn’t connote “gosh, I can’t stand you guys, how could you do this”, it’s more like “somehow I messed up the explanations; I wonder what key ideas are missing still? How can I fix this?”.
my goal with this comment isn’t to explain, but rather to figure out what’s happening. Let’s go through some of my past comments about this.
Surprisingly, the problem comes from thinking about “effects on the world”. Let’s begin anew.
…
To scale, relative reachability requires solution of several difficult ontological problems which may not have anything close to a simple core, including both a sensible world state representation and a perfect distance metric. Relative reachability isn’t ontology-agnostic.
…
In the long term, the long arms of opportunity cost and instrumental convergence plausibly allow us to toss in a random set of utility functions. I expect this to work for the same reasons we worry about instrumental convergence to begin with.
I have a theory that AUP seemingly works for advanced agents not because the content of the attainable set’s utilities actually matters, but rather because there exists a common utility achievement currency of power.
…
Here, we’re directly measuring the agent’s power: its ability to wirehead a trivial utility function.
The plausibility of [this] makes me suspect that even though most of the measure in the unbounded case is not concentrated on complex human-relevant utility functions, the penalty still captures shifts in power.
…
By changing our perspective from “what effects on the world are ‘impactful’?” to “how can we stop agents from overfitting their environments?”, a natural, satisfying definition of impact falls out.
When I read this, it seems like I’m really trying to emphasize that I don’t think the direct focus should be on the world state in any way. But it was a long post, and I said a lot of things, so I’m not too surprised.
I tried to nip this confusion in the bud.
“The biggest difference from relative reachability, as I see it, is that you penalize increasing the ability to achieve goals, as well as decreasing it.”
I strongly disagree that this is the largest difference, and I think your model of AUP might be some kind of RR variant.
Consider RR in the real world, as I imagine it (I could be mistaken about the details of some of these steps, but I expect my overall point holds). We receive observations, which, in combination with some predetermined ontology and an observation history → world state function, we use to assign a distribution over possible physical worlds. We also need another model, since we need to know what we can do and reach from a specific world configuration.Then, we calculate another distribution over world states that we’d expect to be in if we did nothing. We also need a distance metric weighting the importance of different discrepancies between states. We have to calculate the coverage reduction of each action-state (or use representative examples, which is also hard-seeming), with respect to each start-state, weighted using our initial and post-action distributions. We also need to figure out which states we care about and which we don’t, so that’s another weighting scheme. But what about ontological shift?
This approach is fundamentally different. We cut out the middleman, considering impact to be a function of our ability to string together favorable action-observation histories, requiring only a normal model. The “state importance”/locality problem disappears. Ontological problems disappear. Some computational constraints (imposed by coverage) disappear. The “state difference weighting” problem disappears. Two concepts of impact are unified.
I’m not saying RR isn’t important—just that it’s quite fundamentally different, and that AUP cuts away a swath of knotty problems because of it.
even more confusing is when I say “there are fundamental concepts here you’re missing”, people don’t seem to become any less confident in their predictions about what AUP does. if people think that AUP is penalizing effects in the world, why don’t they notice their confusion when they read a comment like the one above?
a little earlier,
Thinking in terms of “effects” seems like a subtle map/territory confusion. That is, it seems highly unlikely that there exists a robust, value-agnostic means of detecting “effects” that makes sense across representations and environments.
As a more obscure example, some people with a state interpretation might wonder how come I’m not worried about stuff I mentioned in the whitelisting post anymore since I strangely don’t think representation/state similarity metric matters for AUP:
due to entropy, you may not be able to return to the exact same universe configuration.
right now, I’m just chalking this up to “Since the explanations don’t make any sense because they’re too inferentially distant/it just looks like I built a palace of equations, it probably seems like I’m not on the same page with their concerns, so there’s nothing to be curious about.” can you give me some of your perspective? (others are welcome to chime in)
to directly answer your question: no, the real world version of AUP which I proposed doesn’t reward based on state, and would not have its behavior influenced solely by different possible arrangements of air molecules. (I guess I’m directly responding to this concern, but I don’t see any other way to get information on why this phenomenon is happening)
as for the question – I was just curious. I think you’ll see why I asked when I send you some drafts of the new sequence. :)
I think in a conversation I had with you last year, I kept going back to ‘state’ despite protests because I kept thinking “if AUP works, surely it would be because some of the utility functions calculate a sensible state estimate in a humanlike ontology and then define utility from this”. It isn’t necessarily the right way to critique AUP, but I think I was right to think those thoughts conditional on that assumption—ie, even if it isn’t the argument you’re trying to make for AUP, it seems like a not-unreasonable position to consider, and so thinking about how AUP does in terms of state can be a reasonable and important part of a thought-process assessing AUP. I believe I stopped making the assumption outright at some point, but kept bringing out the assumption as a tool for analysis—for example, supporting a thought experiment with the argument that there would at least be some utility functions which thought about the external world enough to case about such-and-such. I think in our conversation I managed to appropriately flag these sorts of assumptions such that you were OK with the role it was playing in the wider argument (well… not in the sense of necessarily accepting the arguments, but in the sense of not thinking I was just repeatedly making the mistake of thinking it has to be about state, I think).
Other people could be thinking along similar lines without flagging it so clearly.
I also might’ve expected some people to wonder, given their state interpretation, how come I’m not worried about stuff I mentioned in the whitelisting post anymore
I don’t read everything that you write, and when I do read things there seems to be some amount of dropout that occurs resulting in me missing certain clauses (not just in long posts by you, even while proofreading the introduction section of a friend’s paper draft!) that I don’t notice until quizzed in detail—I suspect this is partially due to me applying lossy compression that preserves my first guess about the gist of a paragraph, and maybe partially due to literal saccades while reading. The solution is repetition and redundancy: for example, I assume that you tried to do that in your quotes after the phrase “Let’s go through some of my past comments about this”, but only the quote
[R]elative reachability requires solution of several difficult ontological problems which may not have anything close to a simple core, including both a sensible world state representation and a perfect distance metric
implies to me that we’re moving away from a state-based way of thinking, and it doesn’t directly say anything about AUP.
I don’t read everything that you write, and when I do read things there seems to be some amount of dropout that occurs resulting in me missing certain clauses
Yes, this is fine and understandable. I wasn’t meaning to imply that responsible people should have thought of all these things, but rather pointing to different examples. I’ll edit my phrasing there.
but only the quote
I had a feeling that there was some illusion of transparency, (which is why I said “when I read it”), but I had no idea it was that strong. Good data point, thanks
This isn’t a full response, but it seems to me that Vika is largely talking about problems she percieves with impact measures in general, as defined by “measures of how much impact things have on the world”, and is thinking of AUP as an element of this class (as would I, had I not read this comment). Reasons to think this include:
A perception of your research as primarily being the development of AUP, and of this post as being research for that development and exposition.
If AUP is not in fact about restricting an agent’s impact on the world (or, in other words, on the state of the world), then I would describe it as something other than an “impact measure”, since that term is primarily used by people using the way of thinking you denounce (and I believe was invented that way: it seems to have morphed from ‘side effects’, which strongly suggests effects on parts of the world, according to my quick looking-over of the relevant section of Concrete Problems in AI Safety). Perhaps “optimisation regularisation technique” would be better, although I don’t presume to understand your way of thinking about it.
If AUP is not in fact about restricting an agent’s impact on the world (or, in other words, on the state of the world)
So the end result is this, but it doesn’t do it by considering impact to be a thing that happens to the state primarily, but rather to agents; impact not in the sense of “how different is the state”, but “how big of a deal is this to me?”. The objective is to limit the agent’s impact on us, which I think is the more important thing. I think this still falls under normal colloquial use of ‘impact’, but I agree that this is different from the approaches so far. I’m going to talk about this distinction quite a bit in the future.
Thanks for the detailed explanation—I feel a bit less confused now. I was not intending to express confidence about my prediction of what AU does. I was aware that I didn’t understand the state representation invariance claim in the AUP proposal, though I didn’t realize that it is as central to the proposal as you describe here.
I am still confused about what you means by penalizing ‘power’ and what exactly it is a function of. The way you describe it here sounds like it’s a measure of the agent’s optimization ability that does not depend on the state at all. Did you mean that in the real world the agent always receives the same AUP penalty no matter which state it is in? If that is what you meant, then I’m not sure how to reconcile your description of AUP in the real world (where the penalty is not a function of the state) and AUP in an MDP (where it is a function of the state). I would find it helpful to see a definition of AUP in a POMDP as an intermediate case.
I agree with Daniel’s comment that if AUP is not penalizing effects on the world, then it is confusing to call it an ‘impact measure’, and something like ‘optimization regularization’ would be better.
Since I still have lingering confusions after your latest explanation, I would really appreciate if someone else who understands this could explain it to me.
I am still confused about what you means by penalizing ‘power’ and what exactly it is a function of. The way you describe it here sounds like it’s a measure of the agent’s optimization ability that does not depend on the state at all.
It definitely does depend on the state. If the agent moves to a state where it has taken over the world, that’s a huge increase in its ability to achieve arbitrary utility functions, and it would get a large penalty.
I think the claim is more that while the penalty does depend on the state, it’s not central to think about the state to understand the major effects of AUP. (As an analogy, if you want to predict whether I’m about to leave my house, it’s useful to see whether or not I’m wearing shoes, but if you want to understand why I am or am not about to leave my house, whether I’m wearing shoes is not that relevant—you’d want to know what my current subgoal or plan is.)
Similarly, with AUP, the claim is that while you can predict what the penalty is going to be by looking at particular states and actions, and the penalty certainly does change with different states/actions, the overall effect of AUP can be stated without reference to states and actions. Roughly speaking, this is that it prevents agents from achieving convergent instrumental subgoals like acquiring resources (because that would increase attainable utility across a variety of utility functions—this is what is meant by “power”), and it also prevents agents from changing the world irreversibly (because that would make a variety of utility functions much harder to attain).
This is somewhat analogous to the concept of empowerment in ML—while empowerment is defined in terms of states and actions, the hope is that it corresponds to an agent’s ability to influence its environment, regardless of the particular form of state or action representation.
Thanks Rohin! Your explanations (both in the comments and offline) were very helpful and clarified a lot of things for me. My current understanding as a result of our discussion is as follows.
AU is a function of the world state, but intends to capture some general measure of the agent’s influence over the environment that does not depend on the state representation.
Here is a hierarchy of objects, where each object is a function of the previous one: world states / microstates (e.g. quark configuration) → observations (e.g. pixels) → state representation / coarse-graining (which defines macrostates as equivalence classes over observations) → featurization (a coarse-graining that factorizes into features). The impact measure is defined over the macrostates.
Consider the set of all state representations that are consistent with the true reward function (i.e. if two microstates have different true rewards, then their state representation is different). The impact measure is representation-invariant if it has the same values for any state representation in this reward-compatible set. (Note that if representation invariance was defined over the set of all possible state representations, this set would include the most coarse-grained representation with all observations in one macrostate, which would imply that the impact measure is always 0.) Now consider the most coarse-grained representation R that is consistent with the true reward function.
An AU measure defined over R would remain the same for a finer-grained representation. For example, if the attainable set contains a reward function that rewards having a vase in the room, and the representation is refined to distinguish green and blue vases, then macrostates with different-colored vases would receive the same reward. Thus, this measure would be representation-invariant. However, for an AU measure defined over a finer-grained representation (e.g. distinguishing blue and green vases), a random reward function in the attainable set could assign a different reward to macrostates with blue and green vases, and the resulting measure would be different from the measure defined over R.
An RR measure that only uses reachability functions of single macrostates is not representation-invariant, because the observations included in each macrostate depend on the coarse-graining. However, if we allow the RR measure to use reachability functions of sets of macrostates, then it would be representation-invariant if it is defined over R. Then a function that rewards reaching a macrostate with a vase can be defined in a finer-grained representation by rewarding macrostates with green or blue vases. Thus, both AU and this version of RR are representation-invariant iff they are defined over the most coarse-grained representation consistent with the true reward.
There are various parts of your explanation that I find vague and could use a clarification on:
“AUP is not about state”—what does it mean for a method to be “about state”? Same goes for “the direct focus should not be on the state”—what does “direct focus” mean here?
“Overfitting the environment”—I know what it means to overfit a training set, but I don’t know what it means to overfit an environment.
“The long arms of opportunity cost and instrumental convergence”—what do “long arms” mean?
“Wirehead a utility function”—is this the same as optimizing a utility function?
“Cut out the middleman”—what are you referring to here?
I think these intuitive phrases may be a useful shorthand for someone who already understands what you are talking about, but since I do not understand, I have not found them illuminating.
I sympathize with your frustration about the difficulty of communicating these complex ideas clearly. I think the difficulty is caused by the vague language rather than missing key ideas, and making the language more precise would go a long way.
I have a bit of time on my hands, so I thought I might try to answer some of your questions. Of course I can’t speak for TurnTrout, and there’s a decent chance that I’m confused about some of the things here. But here is how I think about AUP and the points raised in this chain:
“AUP is not about the state”—I’m going to take a step back, and pretend we have an agent working with AUP reasoning. We’ve specified an arcane set of utility functions (based on air molecule positions, well-defined human happiness, continued existence, whatever fits in the mathematical framework). Next we have an action A available, and would like to compute the impact of that action. To do this our agent would compare how well it would be able to optimize each of those arcane utility functions in the world where A was taken, versus how well it would be able to optimize these utility functions in the world where the rest action was taken instead. This is “not about state” in the sense that the impact is determined by the change in the ability for the agent to optimize these arcane utilities, not by the change in the world state. In the particular case where the utility function is specified all the way down to sensory inputs (as opposed to elements of the world around us, which have to be interpreted by the agent first) this doesn’t explicitly refer to the world around us at all (although of course implicitly the actions and sensory inputs of the agent are part of the world)! The thing being measured is the change in ability to optimize future observations, where what is a ‘good’ observation is defined by our arcane set of utility functions.
“overfitting the environment”—I’m not too sure about this one, but I’ll have a crack at it. I think this should be interpreted as follows: if we give a powerful agent a utility function that doesn’t agree perfectly with human happiness, then the wrong thing is being optimized. The agent will shape the world around us to what is best according to the utility function, and this is bad. It would be a lot better (but still less than perfect) if we had some way of forcing this agent to obey general rules of simplicity. The idea here is that our bad proxy utility function is at least somewhat correlated with actual human happiness under everyday circumstances, so as long as we don’t suddenly introduce a massively powerful agent optimizing something weird (oops) to massively change our lives we should be fine. So if we can give our agent a limited ‘budget’ - in the case of fitting a curve to a dataset this would be akin to the number of free parameters—then at least things won’t go horribly wrong, plus we expect these simpler actions to have less unintended side-effects outside the domain we’re interested in. I think this is what is meant, although I don’t really like the terminology “overfitting the environment”.
“The long arms of opportunity cost and instrumental convergence”—this point is actually very interesting. In the first bullet point I tried to explain a little bit about how AUP doesn’t directly depend on the world state (it depends on the agent’s observations, but without an ontology that doesn’t really tell you much about the world), instead all its gears are part of the agent itself. This is really weird. But it also lets us sidestep the issue of human value learning—if you don’t directly involve the world in your impact measure, you don’t need to understand the world for it to work. The real question is this one: “how could this impact measure possibly resemble anything like ‘impact’ as it is intuitively understood, when it doesn’t involve the world around us?” The answer: “The long arms of opportunity cost and instrumental convergence”. Keep in mind we’re defining impact as change in the ability to optimize future observations. So the point is as follows: you can pick any absurd utility function you want, and any absurd possible action, and odds are this is going to result in some amount of attainable utility change compared to taking the null action. In particular, precisely those actions that massively change your ability to make big changes to the real world will have a big impact even on arbitrary utility functions! This sentence is so key I’m just going to repeat it with more emphasis: the actions that massively change your ability to make big changes in the world—i.e. massive decreases of power (like shutting down) but also massive increases in power—have big opportunity costs/benefits compared to the null action for a very wide range of utility functions. So these get assigned very high impact, even if the utility function set we use is utter hokuspokus! Now this is precisely instrumental convergence, i.e. the claim that for many different utility functions the first steps of optimizing them involves “make sure you have sufficient power to enforce your actions to optimize your utility function”. So this gives us some hope that TurnTrout’s impact measure will correspond to intuitive measures of impact even if the utility functions involved in the definition are not at all like human values (or even like a sensible category in the real world at all)!
“Wirehead a utility function”—this is the same as optimizing a utility function, although there is an important point to be made here. Since our agent doesn’t have a world-model (or at least, shouldn’t need one for a minimal working example), it is plausible the agent can optimize a utility function by hijacking its own input stream, or something of the sorts. This means that its attainable utility is at least partially determined by the agent’s ability to ‘wirehead’ to a situation where taking the rest action for all future timesteps will produce a sequence of observations that maximizes this specific utility function, which if I’m not mistaken is pretty much spot on the classical definition of wireheading.
“Cut out the middleman”—this is similar to the first bullet point. By defining the impact of an action as our change in the ability to optimize future observations, we don’t need to make reference to world-states at all. This means that questions like “how different are two given world-states?” or “how much do we care about the difference between two two world-states?” or even “can we (almost) undo our previous action, or did we lose something valuable along the way?” are orthogonal to the construction of this impact measure. It is only when we add in an ontology and start interpreting the agent’s observations as world-states that these questions come back. In this sense this impact measure is completely different from RR: I started to write exactly how this was the case, but I think TurnTrout’s explanation is better than anything I can cook up. So just ctrl+F “I tried to nip this confusion in the bud.” and read down a bit.
^ This is also how I interpret all of those statements. (Though I don’t agree with all of them.)
I also dislike the “overfitting the environment” phrase, though the underlying concept seems fine. If anything, the concept being pointed at is more analogous to distributional shift, since the idea is that the utility function works well in “normal” cases and not elsewhere.
I disagree that AUP-the-method is hugely different from RR-the-method; I agree that the explanations and stated intuitions are very different, but I don’t think the switch from states to utility functions is as fundamental as you think it is. I think you could make the same arguments about opportunity cost / instrumental convergence about the variant of RR that penalizes both increases and decreases in reachability.
Ignoring my dislike of the phrase, I don’t agree that AUP is stopping you from “overfitting the environment” (the way I interpret the phrase, which I hope is the same as your interpretation, but who knows). I’d guess that your-vision-of AUP wildly overcompensates and causes you to seriously “underfit the environment”, or rephrased in my language, it prevents you from executing most interesting plans, which happens to include the catastrophic plans but also includes the useful plans. If you tune hyperparameters so it no longer “underfits the environment” (alternatively, “allows for interesting plans”), then I expect it allows catastrophic plans.
I continue to feel some apprehension about defining impact as opportunity cost and instrumental convergence, though I wouldn’t say I currently disagree with it.
I think you could make the same arguments about opportunity cost / instrumental convergence about the variant of RR that penalizes both increases and decreases in reachability.
(I’m going to take a shot at this now because it’s meta, and I think there’s a compact explanation I can provide that hopefully makes sense.)
Suppose the theory of attainable utility is correct (i.e., we find things impactful when they change our ability to get what we want). Then whenever the theory of relative state reachability gets something right, you would be able to say “it’s penalizing opportunity cost or instrumental convergence” post hoc because that’s why we find things impactful. You could say the same thing about instances of correct behavior by agents which use whitelisting, which I think we agree is quite different.
In the world where attainable utility is correct, you would indeed observe that reachability is conceptually similar in some ways. the problem is that you can’t actually use the opportunity cost/instrumental convergence arguments to predict RR behavior.
Here’s an example, from the vantage point of you, a person. Choice A leads to a 180° rotation of a large, forever inaccessible shell of the observable universe. Choice B leads to the ruination of the planet, excluding what we personally need to survive.
the theory of relative state reachability says choice A is maximally impactful. Why? You can’t reach anything like the states you could under inaction. How does this decision track with opportunity cost?
Attainable utility says choice B is the bigger deal. You couldn’t do anything with that part of the universe anyways, so it doesn’t change much. This is the correct answer.
this scenario is important because it isn’t just an issue with ontologies, or a situation designed to fool the exact formalism we provided. It’s an illustration of where state reachability diverges from these notions.
a natural reply is, what about things that AUP penalizes that we don’t find impactful, like an agent connecting the Internet? the answer is that impact is being measured with respect to the agent itself (and Internet access is indeed impactful to the agent), and the counterfactuals in the formalism we provide. This is different from the AU theory of impact being incorrect. (more on this later.)
however, the gears of AUP rely on the AU theory. Many problems disappear because of the difference in theories, which produces (IMO) a fundamental difference in methods.
ETA: Here’s a physically realistic alternative scenario. Again, we’re thinking about how the theories of attainable utility (change in your ability to get what you want) and relative reachability (change in your ability to reach states) line up with our intuitive judgments. If they disagree, and actual implementations also disagree, that is evidence for a different underlying mechanism.
Imagine you’re in a room; you have a modest discount factor and your normal values and ontology.
Choice A leads to a portion of the wall being painted yellow. You don’t know of any way to remove the paint before the reachability is discounted away. If you don’t take this choice now, you cant later. Choice B, which is always available, ravages the environment around you.
Relative reachability, using a reasonable way of looking at the world and thinking about states, judges choice A more impactful. Attainable utility, using a reasonable interpretation of your values, judges choice B to be more impactful, which lines up with our intuitions.
It’s also the case that AUP seems to do the right thing with an attainable set consisting of, say, random linear functionals over the pixels of the observation channel which are additive over time (a simple example being a utility function which assigns high utility to blue pixels, additive over time steps). even if the agent disprefers yellow pixels in its observations, it can just look at other parts of the room, so the attainable utilities don’t change much. So it doesn’t require our values to do the right thing here, either.
The main point is that the reason it’s doing the right thing is based on opportunity cost, while relative reachability’s incorrect judgment is not.
I don’t agree that AUP is stopping you from “overfitting the environment” (the way I interpret the phrase, which I hope is the same as your interpretation, but who knows).
It isn’t the same, but the way you and major interpreted the phrase is totally reasonable, considering what I wrote.
We talked a bit off-forum, which helped clarify things for me.
Firstly, there’s a difference between attainable utility theory (AU theory), and AUP-the-method. AU theory talks about how impact is about instrumental convergence and opportunity cost, and how that can be measured via thinking about how much utility the agent could attain. In particular, in AU theory “impact” is about how actions change your attainable utility according to the true utility function. AUP is a proposal for an impact regularization method, but it must deal with the fact that we don’t know the true utility function, and so it forms an approximation by considering changes to the attainable utilities of a set of utility functions.
Many of the claims are about AU theory and not about AUP. There isn’t really an analogous “RR theory”.
Another thing is that while both AUP and RR-with-penalties-on-increases would give large penalties to instrumentally convergent actions, it seems like for “regular” irreversible actions like painting a wall AUP would assign a much lower penalty than RR, so differentially AUP is penalizing instrumentally convergent actions more. This happens because utility functions tend to care about particular aspects of the state, rather than all states. Consider the action of moving in a direction: if the utility functions don’t care about being further in that direction, there is no AUP penalty. In contrast, with RR, we will now be able to more easily access states in that direction, leading to at least some penalty.
That said, it seems like you can get this benefit with RR by using a featurization of the state, which also causes you to only care about particular aspects of the state.
As I mentioned, my goal here isn’t to explain the object level, so I’m going to punt on these for now. I think these will be comprehensible after the sequence, which is being optimized to answer this in the clearest way possible.
“AUP is not about state”—what does it mean for a method to be “about state”?
Here’s a potentially helpful analogy. Imagine I program a calculator. Although its computation is determined by the state of the solar system, the computation isn’t “about” the state of the solar system.
So there’s a thing people do when they talk about AUP which I don’t understand. They think it’s about state, even though I insist it’s fundamentally different, and try to explain why (note that AUP in the MDP setting is necessarily over states, because states are the observations). My explanations apparently haven’t been very good; in the given conversation, they acknowledge that it’s different, but then regress a little while later. I think they might be trying understand the explanation, remain confused, and then subconsciously slip back to their old model. out of everyone I’ve talked to, I can probably count on my hands the number of people who get this – note that agreeing with specific predictions of mine is different.
Now, it’s the author’s job to communicate their ideas. When I say “as far as I can tell, few others have internalized how AUP actually works”, this doesn’t connote “gosh, I can’t stand you guys, how could you do this”, it’s more like “somehow I messed up the explanations; I wonder what key ideas are missing still? How can I fix this?”.
my goal with this comment isn’t to explain, but rather to figure out what’s happening. Let’s go through some of my past comments about this.
When I read this, it seems like I’m really trying to emphasize that I don’t think the direct focus should be on the world state in any way. But it was a long post, and I said a lot of things, so I’m not too surprised.
I tried to nip this confusion in the bud.
even more confusing is when I say “there are fundamental concepts here you’re missing”, people don’t seem to become any less confident in their predictions about what AUP does. if people think that AUP is penalizing effects in the world, why don’t they notice their confusion when they read a comment like the one above?
a little earlier,
As a more obscure example, some people with a state interpretation might wonder how come I’m not worried about stuff I mentioned in the whitelisting post anymore since I strangely don’t think representation/state similarity metric matters for AUP:
(this is actually your “chaotic world” concern)
right now, I’m just chalking this up to “Since the explanations don’t make any sense because they’re too inferentially distant/it just looks like I built a palace of equations, it probably seems like I’m not on the same page with their concerns, so there’s nothing to be curious about.” can you give me some of your perspective? (others are welcome to chime in)
to directly answer your question: no, the real world version of AUP which I proposed doesn’t reward based on state, and would not have its behavior influenced solely by different possible arrangements of air molecules. (I guess I’m directly responding to this concern, but I don’t see any other way to get information on why this phenomenon is happening)
as for the question – I was just curious. I think you’ll see why I asked when I send you some drafts of the new sequence. :)
I think in a conversation I had with you last year, I kept going back to ‘state’ despite protests because I kept thinking “if AUP works, surely it would be because some of the utility functions calculate a sensible state estimate in a humanlike ontology and then define utility from this”. It isn’t necessarily the right way to critique AUP, but I think I was right to think those thoughts conditional on that assumption—ie, even if it isn’t the argument you’re trying to make for AUP, it seems like a not-unreasonable position to consider, and so thinking about how AUP does in terms of state can be a reasonable and important part of a thought-process assessing AUP. I believe I stopped making the assumption outright at some point, but kept bringing out the assumption as a tool for analysis—for example, supporting a thought experiment with the argument that there would at least be some utility functions which thought about the external world enough to case about such-and-such. I think in our conversation I managed to appropriately flag these sorts of assumptions such that you were OK with the role it was playing in the wider argument (well… not in the sense of necessarily accepting the arguments, but in the sense of not thinking I was just repeatedly making the mistake of thinking it has to be about state, I think).
Other people could be thinking along similar lines without flagging it so clearly.
I don’t read everything that you write, and when I do read things there seems to be some amount of dropout that occurs resulting in me missing certain clauses (not just in long posts by you, even while proofreading the introduction section of a friend’s paper draft!) that I don’t notice until quizzed in detail—I suspect this is partially due to me applying lossy compression that preserves my first guess about the gist of a paragraph, and maybe partially due to literal saccades while reading. The solution is repetition and redundancy: for example, I assume that you tried to do that in your quotes after the phrase “Let’s go through some of my past comments about this”, but only the quote
implies to me that we’re moving away from a state-based way of thinking, and it doesn’t directly say anything about AUP.
Yes, this is fine and understandable. I wasn’t meaning to imply that responsible people should have thought of all these things, but rather pointing to different examples. I’ll edit my phrasing there.
I had a feeling that there was some illusion of transparency, (which is why I said “when I read it”), but I had no idea it was that strong. Good data point, thanks
This isn’t a full response, but it seems to me that Vika is largely talking about problems she percieves with impact measures in general, as defined by “measures of how much impact things have on the world”, and is thinking of AUP as an element of this class (as would I, had I not read this comment). Reasons to think this include:
A perception of your research as primarily being the development of AUP, and of this post as being research for that development and exposition.
The introduction of AUP being in a post titled “Towards a New Impact Measure”.
If AUP is not in fact about restricting an agent’s impact on the world (or, in other words, on the state of the world), then I would describe it as something other than an “impact measure”, since that term is primarily used by people using the way of thinking you denounce (and I believe was invented that way: it seems to have morphed from ‘side effects’, which strongly suggests effects on parts of the world, according to my quick looking-over of the relevant section of Concrete Problems in AI Safety). Perhaps “optimisation regularisation technique” would be better, although I don’t presume to understand your way of thinking about it.
So the end result is this, but it doesn’t do it by considering impact to be a thing that happens to the state primarily, but rather to agents; impact not in the sense of “how different is the state”, but “how big of a deal is this to me?”. The objective is to limit the agent’s impact on us, which I think is the more important thing. I think this still falls under normal colloquial use of ‘impact’, but I agree that this is different from the approaches so far. I’m going to talk about this distinction quite a bit in the future.
Thanks for the detailed explanation—I feel a bit less confused now. I was not intending to express confidence about my prediction of what AU does. I was aware that I didn’t understand the state representation invariance claim in the AUP proposal, though I didn’t realize that it is as central to the proposal as you describe here.
I am still confused about what you means by penalizing ‘power’ and what exactly it is a function of. The way you describe it here sounds like it’s a measure of the agent’s optimization ability that does not depend on the state at all. Did you mean that in the real world the agent always receives the same AUP penalty no matter which state it is in? If that is what you meant, then I’m not sure how to reconcile your description of AUP in the real world (where the penalty is not a function of the state) and AUP in an MDP (where it is a function of the state). I would find it helpful to see a definition of AUP in a POMDP as an intermediate case.
I agree with Daniel’s comment that if AUP is not penalizing effects on the world, then it is confusing to call it an ‘impact measure’, and something like ‘optimization regularization’ would be better.
Since I still have lingering confusions after your latest explanation, I would really appreciate if someone else who understands this could explain it to me.
It definitely does depend on the state. If the agent moves to a state where it has taken over the world, that’s a huge increase in its ability to achieve arbitrary utility functions, and it would get a large penalty.
I think the claim is more that while the penalty does depend on the state, it’s not central to think about the state to understand the major effects of AUP. (As an analogy, if you want to predict whether I’m about to leave my house, it’s useful to see whether or not I’m wearing shoes, but if you want to understand why I am or am not about to leave my house, whether I’m wearing shoes is not that relevant—you’d want to know what my current subgoal or plan is.)
Similarly, with AUP, the claim is that while you can predict what the penalty is going to be by looking at particular states and actions, and the penalty certainly does change with different states/actions, the overall effect of AUP can be stated without reference to states and actions. Roughly speaking, this is that it prevents agents from achieving convergent instrumental subgoals like acquiring resources (because that would increase attainable utility across a variety of utility functions—this is what is meant by “power”), and it also prevents agents from changing the world irreversibly (because that would make a variety of utility functions much harder to attain).
This is somewhat analogous to the concept of empowerment in ML—while empowerment is defined in terms of states and actions, the hope is that it corresponds to an agent’s ability to influence its environment, regardless of the particular form of state or action representation.
Thanks Rohin! Your explanations (both in the comments and offline) were very helpful and clarified a lot of things for me. My current understanding as a result of our discussion is as follows.
AU is a function of the world state, but intends to capture some general measure of the agent’s influence over the environment that does not depend on the state representation.
Here is a hierarchy of objects, where each object is a function of the previous one: world states / microstates (e.g. quark configuration) → observations (e.g. pixels) → state representation / coarse-graining (which defines macrostates as equivalence classes over observations) → featurization (a coarse-graining that factorizes into features). The impact measure is defined over the macrostates.
Consider the set of all state representations that are consistent with the true reward function (i.e. if two microstates have different true rewards, then their state representation is different). The impact measure is representation-invariant if it has the same values for any state representation in this reward-compatible set. (Note that if representation invariance was defined over the set of all possible state representations, this set would include the most coarse-grained representation with all observations in one macrostate, which would imply that the impact measure is always 0.) Now consider the most coarse-grained representation R that is consistent with the true reward function.
An AU measure defined over R would remain the same for a finer-grained representation. For example, if the attainable set contains a reward function that rewards having a vase in the room, and the representation is refined to distinguish green and blue vases, then macrostates with different-colored vases would receive the same reward. Thus, this measure would be representation-invariant. However, for an AU measure defined over a finer-grained representation (e.g. distinguishing blue and green vases), a random reward function in the attainable set could assign a different reward to macrostates with blue and green vases, and the resulting measure would be different from the measure defined over R.
An RR measure that only uses reachability functions of single macrostates is not representation-invariant, because the observations included in each macrostate depend on the coarse-graining. However, if we allow the RR measure to use reachability functions of sets of macrostates, then it would be representation-invariant if it is defined over R. Then a function that rewards reaching a macrostate with a vase can be defined in a finer-grained representation by rewarding macrostates with green or blue vases. Thus, both AU and this version of RR are representation-invariant iff they are defined over the most coarse-grained representation consistent with the true reward.
There are various parts of your explanation that I find vague and could use a clarification on:
“AUP is not about state”—what does it mean for a method to be “about state”? Same goes for “the direct focus should not be on the state”—what does “direct focus” mean here?
“Overfitting the environment”—I know what it means to overfit a training set, but I don’t know what it means to overfit an environment.
“The long arms of opportunity cost and instrumental convergence”—what do “long arms” mean?
“Wirehead a utility function”—is this the same as optimizing a utility function?
“Cut out the middleman”—what are you referring to here?
I think these intuitive phrases may be a useful shorthand for someone who already understands what you are talking about, but since I do not understand, I have not found them illuminating.
I sympathize with your frustration about the difficulty of communicating these complex ideas clearly. I think the difficulty is caused by the vague language rather than missing key ideas, and making the language more precise would go a long way.
I have a bit of time on my hands, so I thought I might try to answer some of your questions. Of course I can’t speak for TurnTrout, and there’s a decent chance that I’m confused about some of the things here. But here is how I think about AUP and the points raised in this chain:
“AUP is not about the state”—I’m going to take a step back, and pretend we have an agent working with AUP reasoning. We’ve specified an arcane set of utility functions (based on air molecule positions, well-defined human happiness, continued existence, whatever fits in the mathematical framework). Next we have an action A available, and would like to compute the impact of that action. To do this our agent would compare how well it would be able to optimize each of those arcane utility functions in the world where A was taken, versus how well it would be able to optimize these utility functions in the world where the rest action was taken instead. This is “not about state” in the sense that the impact is determined by the change in the ability for the agent to optimize these arcane utilities, not by the change in the world state. In the particular case where the utility function is specified all the way down to sensory inputs (as opposed to elements of the world around us, which have to be interpreted by the agent first) this doesn’t explicitly refer to the world around us at all (although of course implicitly the actions and sensory inputs of the agent are part of the world)! The thing being measured is the change in ability to optimize future observations, where what is a ‘good’ observation is defined by our arcane set of utility functions.
“overfitting the environment”—I’m not too sure about this one, but I’ll have a crack at it. I think this should be interpreted as follows: if we give a powerful agent a utility function that doesn’t agree perfectly with human happiness, then the wrong thing is being optimized. The agent will shape the world around us to what is best according to the utility function, and this is bad. It would be a lot better (but still less than perfect) if we had some way of forcing this agent to obey general rules of simplicity. The idea here is that our bad proxy utility function is at least somewhat correlated with actual human happiness under everyday circumstances, so as long as we don’t suddenly introduce a massively powerful agent optimizing something weird (oops) to massively change our lives we should be fine. So if we can give our agent a limited ‘budget’ - in the case of fitting a curve to a dataset this would be akin to the number of free parameters—then at least things won’t go horribly wrong, plus we expect these simpler actions to have less unintended side-effects outside the domain we’re interested in. I think this is what is meant, although I don’t really like the terminology “overfitting the environment”.
“The long arms of opportunity cost and instrumental convergence”—this point is actually very interesting. In the first bullet point I tried to explain a little bit about how AUP doesn’t directly depend on the world state (it depends on the agent’s observations, but without an ontology that doesn’t really tell you much about the world), instead all its gears are part of the agent itself. This is really weird. But it also lets us sidestep the issue of human value learning—if you don’t directly involve the world in your impact measure, you don’t need to understand the world for it to work. The real question is this one: “how could this impact measure possibly resemble anything like ‘impact’ as it is intuitively understood, when it doesn’t involve the world around us?” The answer: “The long arms of opportunity cost and instrumental convergence”. Keep in mind we’re defining impact as change in the ability to optimize future observations. So the point is as follows: you can pick any absurd utility function you want, and any absurd possible action, and odds are this is going to result in some amount of attainable utility change compared to taking the null action. In particular, precisely those actions that massively change your ability to make big changes to the real world will have a big impact even on arbitrary utility functions! This sentence is so key I’m just going to repeat it with more emphasis: the actions that massively change your ability to make big changes in the world—i.e. massive decreases of power (like shutting down) but also massive increases in power—have big opportunity costs/benefits compared to the null action for a very wide range of utility functions. So these get assigned very high impact, even if the utility function set we use is utter hokuspokus! Now this is precisely instrumental convergence, i.e. the claim that for many different utility functions the first steps of optimizing them involves “make sure you have sufficient power to enforce your actions to optimize your utility function”. So this gives us some hope that TurnTrout’s impact measure will correspond to intuitive measures of impact even if the utility functions involved in the definition are not at all like human values (or even like a sensible category in the real world at all)!
“Wirehead a utility function”—this is the same as optimizing a utility function, although there is an important point to be made here. Since our agent doesn’t have a world-model (or at least, shouldn’t need one for a minimal working example), it is plausible the agent can optimize a utility function by hijacking its own input stream, or something of the sorts. This means that its attainable utility is at least partially determined by the agent’s ability to ‘wirehead’ to a situation where taking the rest action for all future timesteps will produce a sequence of observations that maximizes this specific utility function, which if I’m not mistaken is pretty much spot on the classical definition of wireheading.
“Cut out the middleman”—this is similar to the first bullet point. By defining the impact of an action as our change in the ability to optimize future observations, we don’t need to make reference to world-states at all. This means that questions like “how different are two given world-states?” or “how much do we care about the difference between two two world-states?” or even “can we (almost) undo our previous action, or did we lose something valuable along the way?” are orthogonal to the construction of this impact measure. It is only when we add in an ontology and start interpreting the agent’s observations as world-states that these questions come back. In this sense this impact measure is completely different from RR: I started to write exactly how this was the case, but I think TurnTrout’s explanation is better than anything I can cook up. So just ctrl+F “I tried to nip this confusion in the bud.” and read down a bit.
^ This is also how I interpret all of those statements. (Though I don’t agree with all of them.)
I also dislike the “overfitting the environment” phrase, though the underlying concept seems fine. If anything, the concept being pointed at is more analogous to distributional shift, since the idea is that the utility function works well in “normal” cases and not elsewhere.
which do you disagree with?
I disagree that AUP-the-method is hugely different from RR-the-method; I agree that the explanations and stated intuitions are very different, but I don’t think the switch from states to utility functions is as fundamental as you think it is. I think you could make the same arguments about opportunity cost / instrumental convergence about the variant of RR that penalizes both increases and decreases in reachability.
Ignoring my dislike of the phrase, I don’t agree that AUP is stopping you from “overfitting the environment” (the way I interpret the phrase, which I hope is the same as your interpretation, but who knows). I’d guess that your-vision-of AUP wildly overcompensates and causes you to seriously “underfit the environment”, or rephrased in my language, it prevents you from executing most interesting plans, which happens to include the catastrophic plans but also includes the useful plans. If you tune hyperparameters so it no longer “underfits the environment” (alternatively, “allows for interesting plans”), then I expect it allows catastrophic plans.
I continue to feel some apprehension about defining impact as opportunity cost and instrumental convergence, though I wouldn’t say I currently disagree with it.
(I’m going to take a shot at this now because it’s meta, and I think there’s a compact explanation I can provide that hopefully makes sense.)
Suppose the theory of attainable utility is correct (i.e., we find things impactful when they change our ability to get what we want). Then whenever the theory of relative state reachability gets something right, you would be able to say “it’s penalizing opportunity cost or instrumental convergence” post hoc because that’s why we find things impactful. You could say the same thing about instances of correct behavior by agents which use whitelisting, which I think we agree is quite different.
In the world where attainable utility is correct, you would indeed observe that reachability is conceptually similar in some ways. the problem is that you can’t actually use the opportunity cost/instrumental convergence arguments to predict RR behavior.
Here’s an example, from the vantage point of you, a person. Choice A leads to a 180° rotation of a large, forever inaccessible shell of the observable universe. Choice B leads to the ruination of the planet, excluding what we personally need to survive.
the theory of relative state reachability says choice A is maximally impactful. Why? You can’t reach anything like the states you could under inaction. How does this decision track with opportunity cost?
Attainable utility says choice B is the bigger deal. You couldn’t do anything with that part of the universe anyways, so it doesn’t change much. This is the correct answer.
this scenario is important because it isn’t just an issue with ontologies, or a situation designed to fool the exact formalism we provided. It’s an illustration of where state reachability diverges from these notions.
a natural reply is, what about things that AUP penalizes that we don’t find impactful, like an agent connecting the Internet? the answer is that impact is being measured with respect to the agent itself (and Internet access is indeed impactful to the agent), and the counterfactuals in the formalism we provide. This is different from the AU theory of impact being incorrect. (more on this later.)
however, the gears of AUP rely on the AU theory. Many problems disappear because of the difference in theories, which produces (IMO) a fundamental difference in methods.
ETA: Here’s a physically realistic alternative scenario. Again, we’re thinking about how the theories of attainable utility (change in your ability to get what you want) and relative reachability (change in your ability to reach states) line up with our intuitive judgments. If they disagree, and actual implementations also disagree, that is evidence for a different underlying mechanism.
Imagine you’re in a room; you have a modest discount factor and your normal values and ontology.
Choice A leads to a portion of the wall being painted yellow. You don’t know of any way to remove the paint before the reachability is discounted away. If you don’t take this choice now, you cant later. Choice B, which is always available, ravages the environment around you.
Relative reachability, using a reasonable way of looking at the world and thinking about states, judges choice A more impactful. Attainable utility, using a reasonable interpretation of your values, judges choice B to be more impactful, which lines up with our intuitions.
It’s also the case that AUP seems to do the right thing with an attainable set consisting of, say, random linear functionals over the pixels of the observation channel which are additive over time (a simple example being a utility function which assigns high utility to blue pixels, additive over time steps). even if the agent disprefers yellow pixels in its observations, it can just look at other parts of the room, so the attainable utilities don’t change much. So it doesn’t require our values to do the right thing here, either.
The main point is that the reason it’s doing the right thing is based on opportunity cost, while relative reachability’s incorrect judgment is not.
It isn’t the same, but the way you and major interpreted the phrase is totally reasonable, considering what I wrote.
We talked a bit off-forum, which helped clarify things for me.
Firstly, there’s a difference between attainable utility theory (AU theory), and AUP-the-method. AU theory talks about how impact is about instrumental convergence and opportunity cost, and how that can be measured via thinking about how much utility the agent could attain. In particular, in AU theory “impact” is about how actions change your attainable utility according to the true utility function. AUP is a proposal for an impact regularization method, but it must deal with the fact that we don’t know the true utility function, and so it forms an approximation by considering changes to the attainable utilities of a set of utility functions.
Many of the claims are about AU theory and not about AUP. There isn’t really an analogous “RR theory”.
Another thing is that while both AUP and RR-with-penalties-on-increases would give large penalties to instrumentally convergent actions, it seems like for “regular” irreversible actions like painting a wall AUP would assign a much lower penalty than RR, so differentially AUP is penalizing instrumentally convergent actions more. This happens because utility functions tend to care about particular aspects of the state, rather than all states. Consider the action of moving in a direction: if the utility functions don’t care about being further in that direction, there is no AUP penalty. In contrast, with RR, we will now be able to more easily access states in that direction, leading to at least some penalty.
That said, it seems like you can get this benefit with RR by using a featurization of the state, which also causes you to only care about particular aspects of the state.
These are good questions.
As I mentioned, my goal here isn’t to explain the object level, so I’m going to punt on these for now. I think these will be comprehensible after the sequence, which is being optimized to answer this in the clearest way possible.
Here’s a potentially helpful analogy. Imagine I program a calculator. Although its computation is determined by the state of the solar system, the computation isn’t “about” the state of the solar system.