I wrote up a long reply to this and then accidentally lost it :(
Let me first say that I definitely sympathize with skepticism/confusion about this whole line of thinking.
I roughly agree with your picture of what’s going on with “full agency”—it’s best thought of as fully cartesian idealized UDT, “learning” by searching for the best policy.
Initially I was on-board with your connection to iid, but now I think it’s a red herring.
I illustrated my idea with an iid example, but I can make a similar argument for algorithms which explicitly discard iid, such as Solomonoff induction. Solomonoff induction still won’t systematically learn to produce answers which manipulate the data. This is because SI’s judgement of the quality of a hypothesis doesn’t pay any attention to how dominant the hypothesis was during a given prediction—completely unlike RL, where you need to pay attention to what action you actually took. So if the current-most-probable hypothesis is a manipulator, throwing around its weight to make things easy to predict, and a small-probability hypothesis is “parasitically” taking advantage of the ease-of-prediction without paying the cost of implementing the manipulative strategy, the parasite will continue rising in probability until the manipulative strategy doesn’t have enough weight to shift the output probabilities the way it needs to to implement the manipulative strategy.
So, actually, iid isn’t what’s going on at all, although iid cases do seem like particularly clear illustrations. This further convinces me that there’s an interesting phenomenon to formalize here.
The reality --> beliefs optimization seems like a different thing: bidirectional optimization of that would correspond to minimizing the delta between beliefs and reality. No one actually wants to literally minimize that; having accurate beliefs is an instrumental goal for some other goal, not a terminal one.
I’m not sure what you’re saying here. I agree that “no one wants that”. That’s what I meant when I said that partial agency seems to be a necessary subcomponent of full agency—even idealized full agents need to implement partial-agency optimizations for certain sub-processes, at least in the one case of reality->belief optimization. (Although, perhaps this is not true, since we should think of full agency as UDT which doesn’t update at all… maybe it is more accurate to say that full-er agents often want to use partial-er optimizations for sub-processes.)
So I don’t know what you mean when you say it seems like a different thing. I agree with Wei’s point that myopia isn’t fully sufficient to get reality->belief directionality; but, at least, it gets a whole lot of it, and reality->belief directionality implies myopia.
That said, I’m not optimistic about creating incentives for particular kinds of partial agency: as soon as the model is able to reason, it can do all the same reasoning I did, and if it is actually trying to maximize some simple function of universe-histories, then it should move towards full agency upon doing this reasoning.
I’m not sure what you mean here, so let me give another example and see what you think.
Evolution incentivises a form of partial agency because it incentivizes comparative reproductive advantage, rather than absolute. A gene that reduces the reproductive rate of other organisms is as incentivized as one which increases that of its own. This leads to evolving-to-extinction and other less extreme inefficiencies—this is just usually not that bad because it is difficult for a gene to reduce the fitness of organisms it isn’t in, and methods of doing so usually have countermeasures. As a result, we can’t exactly think of evolution as optimizing something. It’s myopic in the sense that it prefers genes which are point-improvements for their carriers even at a cost to global fitness; it’s stop-gradient-y in that it optimizes with respect to the relatively fixed population which exists during an organism’s lifetime, ignoring the fact that increasing the frequency of a gene changes that population (and so creating the maximum-of-a-fixed-point-of-our-maximum effect for evolutionarily stable equilibria).
So, understanding partial agency better could help us think about what kind of agents are incentivized by evolution.
It’s true that a very intelligent organism such as humans can come along and change the rules of the game, but I’m not claiming that incentivising partial agency gets rid of inner alignment problems. I’m only claiming that **if the rules of the game remain intact** we can incentivise partial agency.
To be clear, I don’t think iid explains it in all cases, I also think iid is just a particularly clean example. Hence why I said (emphasis added now):
So my position is “partial agency arises because any embedded learning algorithm will necessarily leave out aspects that the idealized learning algorithm can identify”. And as a subclaim, that this often happens because of the effective iid assumption between data points in a learning algorithm.
Re:
I’m not sure what you’re saying here. I agree that “no one wants that”.
My point is that the relevant distinction in that case seems to be “instrumental goal” vs. “terminal goal”, rather than “full agency” vs. “partial agency”. In other words, I expect that a map that split things up based on instrumental vs. terminal would do a better job of understanding the territory than one that used full vs. partial agency.
Re: evolution example, I agree that particular learning algorithms can be designed such that they incentivize partial agency. I think my intuition is that all of the particular kinds of partial agency we could incentivize would be too much of a handicap on powerful AI systems (or won’t work at all, e.g. if the way to get powerful AI systems is via mesa optimization).
I’m only claiming that **if the rules of the game remain intact** we can incentivise partial agency.
My point is that the relevant distinction in that case seems to be “instrumental goal” vs. “terminal goal”, rather than “full agency” vs. “partial agency”. In other words, I expect that a map that split things up based on instrumental vs. terminal would do a better job of understanding the territory than one that used full vs. partial agency.
Ah, I see. I definitely don’t disagree that epistemics is instrumental. (Maybe we have some terminal drive for it, but, let’s set that aside.) BUT:
I don’t think we can account for what’s going on here just by pointing that out. Yes, the fact that it’s instrumental means that we cut it off when it “goes too far”, and there’s not a nice encapsulation of what “goes too far” means. However, I think even when we set that aside there’s still an alter-the-map-to-fit-the-territory-not-the-other-way-around phenomenon. IE, yes, it’s a subgoal, but how can we understand the subgoal? Is it best understood as optimization, or something else?
When designing machine learning algorithms, this is essentially built in as a terminal goal; the training procedure incentivises predicting the data, not manipulating it. Or, if it does indeed incentivize manipulation of the data, we would like to understand that better; and we’d like to be able to design things which don’t have that incentive structure.
To be clear, I don’t think iid explains it in all cases, I also think iid is just a particularly clean example.
I wrote up a long reply to this and then accidentally lost it :(
Let me first say that I definitely sympathize with skepticism/confusion about this whole line of thinking.
I roughly agree with your picture of what’s going on with “full agency”—it’s best thought of as fully cartesian idealized UDT, “learning” by searching for the best policy.
Initially I was on-board with your connection to iid, but now I think it’s a red herring.
I illustrated my idea with an iid example, but I can make a similar argument for algorithms which explicitly discard iid, such as Solomonoff induction. Solomonoff induction still won’t systematically learn to produce answers which manipulate the data. This is because SI’s judgement of the quality of a hypothesis doesn’t pay any attention to how dominant the hypothesis was during a given prediction—completely unlike RL, where you need to pay attention to what action you actually took. So if the current-most-probable hypothesis is a manipulator, throwing around its weight to make things easy to predict, and a small-probability hypothesis is “parasitically” taking advantage of the ease-of-prediction without paying the cost of implementing the manipulative strategy, the parasite will continue rising in probability until the manipulative strategy doesn’t have enough weight to shift the output probabilities the way it needs to to implement the manipulative strategy.
So, actually, iid isn’t what’s going on at all, although iid cases do seem like particularly clear illustrations. This further convinces me that there’s an interesting phenomenon to formalize here.
I’m not sure what you’re saying here. I agree that “no one wants that”. That’s what I meant when I said that partial agency seems to be a necessary subcomponent of full agency—even idealized full agents need to implement partial-agency optimizations for certain sub-processes, at least in the one case of reality->belief optimization. (Although, perhaps this is not true, since we should think of full agency as UDT which doesn’t update at all… maybe it is more accurate to say that full-er agents often want to use partial-er optimizations for sub-processes.)
So I don’t know what you mean when you say it seems like a different thing. I agree with Wei’s point that myopia isn’t fully sufficient to get reality->belief directionality; but, at least, it gets a whole lot of it, and reality->belief directionality implies myopia.
I’m not sure what you mean here, so let me give another example and see what you think.
Evolution incentivises a form of partial agency because it incentivizes comparative reproductive advantage, rather than absolute. A gene that reduces the reproductive rate of other organisms is as incentivized as one which increases that of its own. This leads to evolving-to-extinction and other less extreme inefficiencies—this is just usually not that bad because it is difficult for a gene to reduce the fitness of organisms it isn’t in, and methods of doing so usually have countermeasures. As a result, we can’t exactly think of evolution as optimizing something. It’s myopic in the sense that it prefers genes which are point-improvements for their carriers even at a cost to global fitness; it’s stop-gradient-y in that it optimizes with respect to the relatively fixed population which exists during an organism’s lifetime, ignoring the fact that increasing the frequency of a gene changes that population (and so creating the maximum-of-a-fixed-point-of-our-maximum effect for evolutionarily stable equilibria).
So, understanding partial agency better could help us think about what kind of agents are incentivized by evolution.
It’s true that a very intelligent organism such as humans can come along and change the rules of the game, but I’m not claiming that incentivising partial agency gets rid of inner alignment problems. I’m only claiming that **if the rules of the game remain intact** we can incentivise partial agency.
Sorry for the very late reply, I’ve been busy :/
To be clear, I don’t think iid explains it in all cases, I also think iid is just a particularly clean example. Hence why I said (emphasis added now):
Re:
My point is that the relevant distinction in that case seems to be “instrumental goal” vs. “terminal goal”, rather than “full agency” vs. “partial agency”. In other words, I expect that a map that split things up based on instrumental vs. terminal would do a better job of understanding the territory than one that used full vs. partial agency.
Re: evolution example, I agree that particular learning algorithms can be designed such that they incentivize partial agency. I think my intuition is that all of the particular kinds of partial agency we could incentivize would be too much of a handicap on powerful AI systems (or won’t work at all, e.g. if the way to get powerful AI systems is via mesa optimization).
Definitely agree with that.
Ah, I see. I definitely don’t disagree that epistemics is instrumental. (Maybe we have some terminal drive for it, but, let’s set that aside.) BUT:
I don’t think we can account for what’s going on here just by pointing that out. Yes, the fact that it’s instrumental means that we cut it off when it “goes too far”, and there’s not a nice encapsulation of what “goes too far” means. However, I think even when we set that aside there’s still an alter-the-map-to-fit-the-territory-not-the-other-way-around phenomenon. IE, yes, it’s a subgoal, but how can we understand the subgoal? Is it best understood as optimization, or something else?
When designing machine learning algorithms, this is essentially built in as a terminal goal; the training procedure incentivises predicting the data, not manipulating it. Or, if it does indeed incentivize manipulation of the data, we would like to understand that better; and we’d like to be able to design things which don’t have that incentive structure.
Ah, sorry for misinterpreting you.