(I didn’t understand this post, this comment is me trying to make sense of it. After writing the comment, I think I understand the post more, and the comment is effectively an answer to it.)
Here’s a picture for full agency, where we want a Cartesian agent that optimizes some utility function over the course of the history of the universe. We’re going to create an algorithm “outside” the universe, but by design we only care about performance during the “actual” time in the universe.
Idealized, Outside-the-universe Algorithm: Simulate the entire history of the universe, from beginning to end, letting the agent take actions along the way. Compute the reward at the end, and use that to improve the agent so it does better. (Don’t use a discount factor; if you have a pure rate of time preference, that should be part of the utility function.) Repeat until the agent is optimal. (Ignore difficulties with optimization.)
Such an agent will exhibit all the aspects of full agency within the universe. If during the universe-history, something within the universe starts to predict it well, then it will behave as within-universe FDT would predict. The agent will not be myopic: it will be optimizing over the entire universe-history. If it ends up in a game with some within-universe agent, it will not use the Nash equilibrium, it will correctly reason about the other agent’s beliefs about it, and exploit those as best it can.
Now obviously, this is a Cartesian agent, not an embedded one, and the learning procedure takes place “outside the universe”, and any other agents are required to be “in the universe”, and this is why everything is straightforward. But it does seem like this gives us full agency.
When both your agent and learning process have to be embedded within the environment, you can’t have this simple story any more. There isn’t a True Embedded Learning Process to find; at the very minimum any such process could be diagonalized against by the environment. Any embedded learning process must be “misspecified” in some way, relative to the idealized learning process above, if you are evaluating on the metric “is the utility function on universe-histories maximized”. (This is part of my intuition against realism about rationality.) These misspecifications lead to “partial agency”.
To add more gears to this: learning algorithms work by generating/collecting data points, and then training an agent on that data, under the assumption that each data point is an iid sample. Since the data points cannot be full universe-histories, they will necessarily leave out some aspects of reality that the Idealized Outside-the-Universe Algorithm could capture. Examples:
In supervised learning, each data point is a single pair (x,y). The iid assumption means that the algorithm cannot model the fact that y1 could influence the pair (x2,y2), and so the gradients don’t incentivize using that influence.
In RL, each data point is a small fragment of a universe-history (i.e. an episode). The iid assumption means that the algorithm cannot model the fact that changes to the first fragment can affect future fragments, which leads to myopia.
In closed-source games, each data point is a transcript of what happened in a particular instance of the game. The iid assumption means that the algorithm cannot model the opponent changing its policy, and so treats it as a one-shot game instead of an iterated game. (What exactly happens depends a lot on the specific setup.)
So my position is “partial agency arises because any embedded learning algorithm will necessarily leave out aspects that the idealized learning algorithm can identify”. And as a subclaim, that this often happens because of the effective iid assumption between data points in a learning algorithm.
The reality --> beliefs optimization seems like a different thing: bidirectional optimization of that would correspond to minimizing the delta between beliefs and reality. No one actually wants to literally minimize that; having accurate beliefs is an instrumental goal for some other goal, not a terminal one.
That said, I’m not optimistic about creating incentives for particular kinds of partial agency: as soon as the model is able to reason, it can do all the same reasoning I did, and if it is actually trying to maximize some simple function of universe-histories, then it should move towards full agency upon doing this reasoning.
I wrote up a long reply to this and then accidentally lost it :(
Let me first say that I definitely sympathize with skepticism/confusion about this whole line of thinking.
I roughly agree with your picture of what’s going on with “full agency”—it’s best thought of as fully cartesian idealized UDT, “learning” by searching for the best policy.
Initially I was on-board with your connection to iid, but now I think it’s a red herring.
I illustrated my idea with an iid example, but I can make a similar argument for algorithms which explicitly discard iid, such as Solomonoff induction. Solomonoff induction still won’t systematically learn to produce answers which manipulate the data. This is because SI’s judgement of the quality of a hypothesis doesn’t pay any attention to how dominant the hypothesis was during a given prediction—completely unlike RL, where you need to pay attention to what action you actually took. So if the current-most-probable hypothesis is a manipulator, throwing around its weight to make things easy to predict, and a small-probability hypothesis is “parasitically” taking advantage of the ease-of-prediction without paying the cost of implementing the manipulative strategy, the parasite will continue rising in probability until the manipulative strategy doesn’t have enough weight to shift the output probabilities the way it needs to to implement the manipulative strategy.
So, actually, iid isn’t what’s going on at all, although iid cases do seem like particularly clear illustrations. This further convinces me that there’s an interesting phenomenon to formalize here.
The reality --> beliefs optimization seems like a different thing: bidirectional optimization of that would correspond to minimizing the delta between beliefs and reality. No one actually wants to literally minimize that; having accurate beliefs is an instrumental goal for some other goal, not a terminal one.
I’m not sure what you’re saying here. I agree that “no one wants that”. That’s what I meant when I said that partial agency seems to be a necessary subcomponent of full agency—even idealized full agents need to implement partial-agency optimizations for certain sub-processes, at least in the one case of reality->belief optimization. (Although, perhaps this is not true, since we should think of full agency as UDT which doesn’t update at all… maybe it is more accurate to say that full-er agents often want to use partial-er optimizations for sub-processes.)
So I don’t know what you mean when you say it seems like a different thing. I agree with Wei’s point that myopia isn’t fully sufficient to get reality->belief directionality; but, at least, it gets a whole lot of it, and reality->belief directionality implies myopia.
That said, I’m not optimistic about creating incentives for particular kinds of partial agency: as soon as the model is able to reason, it can do all the same reasoning I did, and if it is actually trying to maximize some simple function of universe-histories, then it should move towards full agency upon doing this reasoning.
I’m not sure what you mean here, so let me give another example and see what you think.
Evolution incentivises a form of partial agency because it incentivizes comparative reproductive advantage, rather than absolute. A gene that reduces the reproductive rate of other organisms is as incentivized as one which increases that of its own. This leads to evolving-to-extinction and other less extreme inefficiencies—this is just usually not that bad because it is difficult for a gene to reduce the fitness of organisms it isn’t in, and methods of doing so usually have countermeasures. As a result, we can’t exactly think of evolution as optimizing something. It’s myopic in the sense that it prefers genes which are point-improvements for their carriers even at a cost to global fitness; it’s stop-gradient-y in that it optimizes with respect to the relatively fixed population which exists during an organism’s lifetime, ignoring the fact that increasing the frequency of a gene changes that population (and so creating the maximum-of-a-fixed-point-of-our-maximum effect for evolutionarily stable equilibria).
So, understanding partial agency better could help us think about what kind of agents are incentivized by evolution.
It’s true that a very intelligent organism such as humans can come along and change the rules of the game, but I’m not claiming that incentivising partial agency gets rid of inner alignment problems. I’m only claiming that **if the rules of the game remain intact** we can incentivise partial agency.
To be clear, I don’t think iid explains it in all cases, I also think iid is just a particularly clean example. Hence why I said (emphasis added now):
So my position is “partial agency arises because any embedded learning algorithm will necessarily leave out aspects that the idealized learning algorithm can identify”. And as a subclaim, that this often happens because of the effective iid assumption between data points in a learning algorithm.
Re:
I’m not sure what you’re saying here. I agree that “no one wants that”.
My point is that the relevant distinction in that case seems to be “instrumental goal” vs. “terminal goal”, rather than “full agency” vs. “partial agency”. In other words, I expect that a map that split things up based on instrumental vs. terminal would do a better job of understanding the territory than one that used full vs. partial agency.
Re: evolution example, I agree that particular learning algorithms can be designed such that they incentivize partial agency. I think my intuition is that all of the particular kinds of partial agency we could incentivize would be too much of a handicap on powerful AI systems (or won’t work at all, e.g. if the way to get powerful AI systems is via mesa optimization).
I’m only claiming that **if the rules of the game remain intact** we can incentivise partial agency.
My point is that the relevant distinction in that case seems to be “instrumental goal” vs. “terminal goal”, rather than “full agency” vs. “partial agency”. In other words, I expect that a map that split things up based on instrumental vs. terminal would do a better job of understanding the territory than one that used full vs. partial agency.
Ah, I see. I definitely don’t disagree that epistemics is instrumental. (Maybe we have some terminal drive for it, but, let’s set that aside.) BUT:
I don’t think we can account for what’s going on here just by pointing that out. Yes, the fact that it’s instrumental means that we cut it off when it “goes too far”, and there’s not a nice encapsulation of what “goes too far” means. However, I think even when we set that aside there’s still an alter-the-map-to-fit-the-territory-not-the-other-way-around phenomenon. IE, yes, it’s a subgoal, but how can we understand the subgoal? Is it best understood as optimization, or something else?
When designing machine learning algorithms, this is essentially built in as a terminal goal; the training procedure incentivises predicting the data, not manipulating it. Or, if it does indeed incentivize manipulation of the data, we would like to understand that better; and we’d like to be able to design things which don’t have that incentive structure.
To be clear, I don’t think iid explains it in all cases, I also think iid is just a particularly clean example.
(I didn’t understand this post, this comment is me trying to make sense of it. After writing the comment, I think I understand the post more, and the comment is effectively an answer to it.)
Here’s a picture for full agency, where we want a Cartesian agent that optimizes some utility function over the course of the history of the universe. We’re going to create an algorithm “outside” the universe, but by design we only care about performance during the “actual” time in the universe.
Idealized, Outside-the-universe Algorithm: Simulate the entire history of the universe, from beginning to end, letting the agent take actions along the way. Compute the reward at the end, and use that to improve the agent so it does better. (Don’t use a discount factor; if you have a pure rate of time preference, that should be part of the utility function.) Repeat until the agent is optimal. (Ignore difficulties with optimization.)
Such an agent will exhibit all the aspects of full agency within the universe. If during the universe-history, something within the universe starts to predict it well, then it will behave as within-universe FDT would predict. The agent will not be myopic: it will be optimizing over the entire universe-history. If it ends up in a game with some within-universe agent, it will not use the Nash equilibrium, it will correctly reason about the other agent’s beliefs about it, and exploit those as best it can.
Now obviously, this is a Cartesian agent, not an embedded one, and the learning procedure takes place “outside the universe”, and any other agents are required to be “in the universe”, and this is why everything is straightforward. But it does seem like this gives us full agency.
When both your agent and learning process have to be embedded within the environment, you can’t have this simple story any more. There isn’t a True Embedded Learning Process to find; at the very minimum any such process could be diagonalized against by the environment. Any embedded learning process must be “misspecified” in some way, relative to the idealized learning process above, if you are evaluating on the metric “is the utility function on universe-histories maximized”. (This is part of my intuition against realism about rationality.) These misspecifications lead to “partial agency”.
To add more gears to this: learning algorithms work by generating/collecting data points, and then training an agent on that data, under the assumption that each data point is an iid sample. Since the data points cannot be full universe-histories, they will necessarily leave out some aspects of reality that the Idealized Outside-the-Universe Algorithm could capture. Examples:
In supervised learning, each data point is a single pair (x,y). The iid assumption means that the algorithm cannot model the fact that y1 could influence the pair (x2,y2), and so the gradients don’t incentivize using that influence.
In RL, each data point is a small fragment of a universe-history (i.e. an episode). The iid assumption means that the algorithm cannot model the fact that changes to the first fragment can affect future fragments, which leads to myopia.
In closed-source games, each data point is a transcript of what happened in a particular instance of the game. The iid assumption means that the algorithm cannot model the opponent changing its policy, and so treats it as a one-shot game instead of an iterated game. (What exactly happens depends a lot on the specific setup.)
So my position is “partial agency arises because any embedded learning algorithm will necessarily leave out aspects that the idealized learning algorithm can identify”. And as a subclaim, that this often happens because of the effective iid assumption between data points in a learning algorithm.
The reality --> beliefs optimization seems like a different thing: bidirectional optimization of that would correspond to minimizing the delta between beliefs and reality. No one actually wants to literally minimize that; having accurate beliefs is an instrumental goal for some other goal, not a terminal one.
That said, I’m not optimistic about creating incentives for particular kinds of partial agency: as soon as the model is able to reason, it can do all the same reasoning I did, and if it is actually trying to maximize some simple function of universe-histories, then it should move towards full agency upon doing this reasoning.
I wrote up a long reply to this and then accidentally lost it :(
Let me first say that I definitely sympathize with skepticism/confusion about this whole line of thinking.
I roughly agree with your picture of what’s going on with “full agency”—it’s best thought of as fully cartesian idealized UDT, “learning” by searching for the best policy.
Initially I was on-board with your connection to iid, but now I think it’s a red herring.
I illustrated my idea with an iid example, but I can make a similar argument for algorithms which explicitly discard iid, such as Solomonoff induction. Solomonoff induction still won’t systematically learn to produce answers which manipulate the data. This is because SI’s judgement of the quality of a hypothesis doesn’t pay any attention to how dominant the hypothesis was during a given prediction—completely unlike RL, where you need to pay attention to what action you actually took. So if the current-most-probable hypothesis is a manipulator, throwing around its weight to make things easy to predict, and a small-probability hypothesis is “parasitically” taking advantage of the ease-of-prediction without paying the cost of implementing the manipulative strategy, the parasite will continue rising in probability until the manipulative strategy doesn’t have enough weight to shift the output probabilities the way it needs to to implement the manipulative strategy.
So, actually, iid isn’t what’s going on at all, although iid cases do seem like particularly clear illustrations. This further convinces me that there’s an interesting phenomenon to formalize here.
I’m not sure what you’re saying here. I agree that “no one wants that”. That’s what I meant when I said that partial agency seems to be a necessary subcomponent of full agency—even idealized full agents need to implement partial-agency optimizations for certain sub-processes, at least in the one case of reality->belief optimization. (Although, perhaps this is not true, since we should think of full agency as UDT which doesn’t update at all… maybe it is more accurate to say that full-er agents often want to use partial-er optimizations for sub-processes.)
So I don’t know what you mean when you say it seems like a different thing. I agree with Wei’s point that myopia isn’t fully sufficient to get reality->belief directionality; but, at least, it gets a whole lot of it, and reality->belief directionality implies myopia.
I’m not sure what you mean here, so let me give another example and see what you think.
Evolution incentivises a form of partial agency because it incentivizes comparative reproductive advantage, rather than absolute. A gene that reduces the reproductive rate of other organisms is as incentivized as one which increases that of its own. This leads to evolving-to-extinction and other less extreme inefficiencies—this is just usually not that bad because it is difficult for a gene to reduce the fitness of organisms it isn’t in, and methods of doing so usually have countermeasures. As a result, we can’t exactly think of evolution as optimizing something. It’s myopic in the sense that it prefers genes which are point-improvements for their carriers even at a cost to global fitness; it’s stop-gradient-y in that it optimizes with respect to the relatively fixed population which exists during an organism’s lifetime, ignoring the fact that increasing the frequency of a gene changes that population (and so creating the maximum-of-a-fixed-point-of-our-maximum effect for evolutionarily stable equilibria).
So, understanding partial agency better could help us think about what kind of agents are incentivized by evolution.
It’s true that a very intelligent organism such as humans can come along and change the rules of the game, but I’m not claiming that incentivising partial agency gets rid of inner alignment problems. I’m only claiming that **if the rules of the game remain intact** we can incentivise partial agency.
Sorry for the very late reply, I’ve been busy :/
To be clear, I don’t think iid explains it in all cases, I also think iid is just a particularly clean example. Hence why I said (emphasis added now):
Re:
My point is that the relevant distinction in that case seems to be “instrumental goal” vs. “terminal goal”, rather than “full agency” vs. “partial agency”. In other words, I expect that a map that split things up based on instrumental vs. terminal would do a better job of understanding the territory than one that used full vs. partial agency.
Re: evolution example, I agree that particular learning algorithms can be designed such that they incentivize partial agency. I think my intuition is that all of the particular kinds of partial agency we could incentivize would be too much of a handicap on powerful AI systems (or won’t work at all, e.g. if the way to get powerful AI systems is via mesa optimization).
Definitely agree with that.
Ah, I see. I definitely don’t disagree that epistemics is instrumental. (Maybe we have some terminal drive for it, but, let’s set that aside.) BUT:
I don’t think we can account for what’s going on here just by pointing that out. Yes, the fact that it’s instrumental means that we cut it off when it “goes too far”, and there’s not a nice encapsulation of what “goes too far” means. However, I think even when we set that aside there’s still an alter-the-map-to-fit-the-territory-not-the-other-way-around phenomenon. IE, yes, it’s a subgoal, but how can we understand the subgoal? Is it best understood as optimization, or something else?
When designing machine learning algorithms, this is essentially built in as a terminal goal; the training procedure incentivises predicting the data, not manipulating it. Or, if it does indeed incentivize manipulation of the data, we would like to understand that better; and we’d like to be able to design things which don’t have that incentive structure.
Ah, sorry for misinterpreting you.