In the case you described, u_A would be “Over the course of the entire history of the universe, I want to do 5 jumping jacks—no more, no less.” You then do 5 jumping jacks in the current epoch. After this, u_A will always output 1, regardless of policy, so its penalty should be zero, but since you call u_A on subhistories, it will say “I guess I’ve never done any jumping jacks, so attainable utility is 1 if I do 5 jumping jacks now, and 0 otherwise”, which seems wrong.
For all intents and purposes, you can consider the attainable utility maximizers to be alien agents. It wouldn’t make sense for you to give yourself credit for jumping jacks that someone else did!
Another intuition for this is that, all else equal, we generally don’t worry about the time at which the agent is instantiated, even though it’s experiencing a different “subhistory” of time.
My overall position here is that sure, maybe you could view it in the way you described. However, for our purposes, it seems to be more sensible to view it in this manner.
Thinking of it as alien agents does make more sense, I think that basically convinces me that this is not an important point to get hung up about. (Though I still do have residual feelings of weirdness.)
My overall position here is that sure, maybe you could view it in the way you described. However, for our purposes, it seems to be more sensible to view it in this manner.
I think that if you view things the way you seem to want to, then you have to give up on the high-level description of AUP as ‘penalising changes in the agent’s ability to achieve a wide variety of goals’.
You can call that thing ‘utility’, but it doesn’t really correspond to what you would normally think of as extent to which one has achieved a goal. For instance, usually you’d say that “win a game of go that I’m playing online with my friend Rohin” is a task that one should be able to have a utility function over. However, in your schema, I have to put utility functions over context-free observation-action subhistories. Presumably, the utility should be 1 for these subhistories that show a sequence of screens evolving validly to a victory for me, and 0 otherwise.
Now, suppose that at the start of the game, I spend one action to irreversibly change the source of my opponent’s moves from Rohin to GNU Go, a simple bot, while still displaying the player name as “Rohin”. In this case, I have in fact vastly reduced my ability to win a game against Rohin. However, the utility function evaluated on subhistories starting on my next observation won’t be able to tell that I did this, and as far as I can tell the AUP penalty doesn’t notice any change in my ability to achieve this goal.
In general, the utility of subhistories (if utility functions are going to track goals as we usually mean them) are going to have to depend on the whole history, since the whole history tells you more about the state of the world than the subhistory.
the utility function evaluated on subhistories starting on my next observation won’t be able to tell that I did this, and as far as I can tell the AUP penalty doesn’t notice any change in my ability to achieve this goal.
Your utility presently isn’t even requiring a check to see whether you’re playing against the right person. If the utility function actually did require this before dispensing any high utility, we would indeed have the correct difference as a result of this action. In this case, you’re saying that the utility function isn’t verifying in the subhistory, even though it’s not verifying in the default case either (where you don’t swap opponents). This is where the inconsistency comes from.
the whole history tells you more about the state of the world than the subhistory.
What is the “whole history”? We instantiate the main agent at arbitary times.
Your utility presently isn’t even requiring a check to see whether you’re playing against the right person. If the utility function actually did require this before dispensing any high utility, we would indeed have the correct difference as a result of this action. In this case, you’re saying that the utility function isn’t verifying in the subhistory, even though it’s not verifying in the default case either (where you don’t swap opponents).
Say that the utility does depend on whether the username on the screen is “Rohin”, but the initial action makes this an unreliable indicator of whether I’m playing against Rohin. Furthermore, say that the utility function would score the entire observation-action history that the agent observed as low utility. I claim that the argument still goes through. In fact, this seems to be the same thing that Stuart Armstrong is getting at in the first part of this post.
What is the “whole history”?
The whole history is all the observations and actions that the main agent has actually experienced.
So this is actually a separate issue (which I’ve been going back and forth on) involving the t+nth step not being included in the Q calculation. It should be fixed soon, as should this example in particular.
For all intents and purposes, you can consider the attainable utility maximizers to be alien agents. It wouldn’t make sense for you to give yourself credit for jumping jacks that someone else did!
Another intuition for this is that, all else equal, we generally don’t worry about the time at which the agent is instantiated, even though it’s experiencing a different “subhistory” of time.
My overall position here is that sure, maybe you could view it in the way you described. However, for our purposes, it seems to be more sensible to view it in this manner.
Thinking of it as alien agents does make more sense, I think that basically convinces me that this is not an important point to get hung up about. (Though I still do have residual feelings of weirdness.)
I think that if you view things the way you seem to want to, then you have to give up on the high-level description of AUP as ‘penalising changes in the agent’s ability to achieve a wide variety of goals’.
The goal is “I want to do 5 jumping jacks”. AUP measures the agent’s ability to do 5 jumping jacks.
You seem to be thinking of a utility as being over the actual history of the universe. They’re only over action-observation histories.
You can call that thing ‘utility’, but it doesn’t really correspond to what you would normally think of as extent to which one has achieved a goal. For instance, usually you’d say that “win a game of go that I’m playing online with my friend Rohin” is a task that one should be able to have a utility function over. However, in your schema, I have to put utility functions over context-free observation-action subhistories. Presumably, the utility should be 1 for these subhistories that show a sequence of screens evolving validly to a victory for me, and 0 otherwise.
Now, suppose that at the start of the game, I spend one action to irreversibly change the source of my opponent’s moves from Rohin to GNU Go, a simple bot, while still displaying the player name as “Rohin”. In this case, I have in fact vastly reduced my ability to win a game against Rohin. However, the utility function evaluated on subhistories starting on my next observation won’t be able to tell that I did this, and as far as I can tell the AUP penalty doesn’t notice any change in my ability to achieve this goal.
In general, the utility of subhistories (if utility functions are going to track goals as we usually mean them) are going to have to depend on the whole history, since the whole history tells you more about the state of the world than the subhistory.
Your utility presently isn’t even requiring a check to see whether you’re playing against the right person. If the utility function actually did require this before dispensing any high utility, we would indeed have the correct difference as a result of this action. In this case, you’re saying that the utility function isn’t verifying in the subhistory, even though it’s not verifying in the default case either (where you don’t swap opponents). This is where the inconsistency comes from.
What is the “whole history”? We instantiate the main agent at arbitary times.
Say that the utility does depend on whether the username on the screen is “Rohin”, but the initial action makes this an unreliable indicator of whether I’m playing against Rohin. Furthermore, say that the utility function would score the entire observation-action history that the agent observed as low utility. I claim that the argument still goes through. In fact, this seems to be the same thing that Stuart Armstrong is getting at in the first part of this post.
The whole history is all the observations and actions that the main agent has actually experienced.
So this is actually a separate issue (which I’ve been going back and forth on) involving the t+nth step not being included in the Q calculation. It should be fixed soon, as should this example in particular.