I argue that you should be very careful about believing these things.
You’re right, I was too loose with language there. A more accurate statement is “The general argument and intuitions behind the claim are compelling enough that I want any proposal to clearly explain why the argument doesn’t work for it”. Another statement is “the claim is compelling enough that I throw it at any particular proposal, and if it’s unclear I tend to be wary”. Another one is “if I were trying to design an impact measure, showing why that claim doesn’t work would be one of my top priorities”.
Perhaps we do mostly agree, since you are planning to talk more about this in the future.
it generally seems like the error that people make when they say, “well, I don’t see how to build an AGI right now, so it’ll take thousands of years”.
I think the analogous thing to say is, “well, I don’t see how to build an AGI right now because AIs don’t form abstractions, and no one else knows how to make AIs that form abstractions, so if anyone comes up with a plan for building AGI, they should be able to explain why it will form abstractions, or why AI doesn’t need to form abstractions”.
I actually think we could, but I have yet to publish my reasoning on how we would go about this, so you don’t need to take my word for now. Maybe we could discuss this when I’m able to post that?
Sure.
Another consideration I forgot to highlight: the agent’s actual goal should be pointing in (very) roughly the right direction, so it’s more inclined to have certain kind of impact than others.
Yeah, I agree this helps.
I don’t understand the issue here – the attainable u_A is measuring how well would I be able to start maximizing this goal from here? It seems to be captured by what you just described. It’s supposed to capture the future ability, regardless of what has happened so far. If you do a bunch of jumping jacks, and then cripple yourself, should your jumping jack ability remain high because you already did quite a few?
In the case you described, u_A would be “Over the course of the entire history of the universe, I want to do 5 jumping jacks—no more, no less.” You then do 5 jumping jacks in the current epoch. After this, u_A will always output 1, regardless of policy, so its penalty should be zero, but since you call u_A on subhistories, it will say “I guess I’ve never done any jumping jacks, so attainable utility is 1 if I do 5 jumping jacks now, and 0 otherwise”, which seems wrong.
In the case you described, u_A would be “Over the course of the entire history of the universe, I want to do 5 jumping jacks—no more, no less.” You then do 5 jumping jacks in the current epoch. After this, u_A will always output 1, regardless of policy, so its penalty should be zero, but since you call u_A on subhistories, it will say “I guess I’ve never done any jumping jacks, so attainable utility is 1 if I do 5 jumping jacks now, and 0 otherwise”, which seems wrong.
For all intents and purposes, you can consider the attainable utility maximizers to be alien agents. It wouldn’t make sense for you to give yourself credit for jumping jacks that someone else did!
Another intuition for this is that, all else equal, we generally don’t worry about the time at which the agent is instantiated, even though it’s experiencing a different “subhistory” of time.
My overall position here is that sure, maybe you could view it in the way you described. However, for our purposes, it seems to be more sensible to view it in this manner.
Thinking of it as alien agents does make more sense, I think that basically convinces me that this is not an important point to get hung up about. (Though I still do have residual feelings of weirdness.)
My overall position here is that sure, maybe you could view it in the way you described. However, for our purposes, it seems to be more sensible to view it in this manner.
I think that if you view things the way you seem to want to, then you have to give up on the high-level description of AUP as ‘penalising changes in the agent’s ability to achieve a wide variety of goals’.
You can call that thing ‘utility’, but it doesn’t really correspond to what you would normally think of as extent to which one has achieved a goal. For instance, usually you’d say that “win a game of go that I’m playing online with my friend Rohin” is a task that one should be able to have a utility function over. However, in your schema, I have to put utility functions over context-free observation-action subhistories. Presumably, the utility should be 1 for these subhistories that show a sequence of screens evolving validly to a victory for me, and 0 otherwise.
Now, suppose that at the start of the game, I spend one action to irreversibly change the source of my opponent’s moves from Rohin to GNU Go, a simple bot, while still displaying the player name as “Rohin”. In this case, I have in fact vastly reduced my ability to win a game against Rohin. However, the utility function evaluated on subhistories starting on my next observation won’t be able to tell that I did this, and as far as I can tell the AUP penalty doesn’t notice any change in my ability to achieve this goal.
In general, the utility of subhistories (if utility functions are going to track goals as we usually mean them) are going to have to depend on the whole history, since the whole history tells you more about the state of the world than the subhistory.
the utility function evaluated on subhistories starting on my next observation won’t be able to tell that I did this, and as far as I can tell the AUP penalty doesn’t notice any change in my ability to achieve this goal.
Your utility presently isn’t even requiring a check to see whether you’re playing against the right person. If the utility function actually did require this before dispensing any high utility, we would indeed have the correct difference as a result of this action. In this case, you’re saying that the utility function isn’t verifying in the subhistory, even though it’s not verifying in the default case either (where you don’t swap opponents). This is where the inconsistency comes from.
the whole history tells you more about the state of the world than the subhistory.
What is the “whole history”? We instantiate the main agent at arbitary times.
Your utility presently isn’t even requiring a check to see whether you’re playing against the right person. If the utility function actually did require this before dispensing any high utility, we would indeed have the correct difference as a result of this action. In this case, you’re saying that the utility function isn’t verifying in the subhistory, even though it’s not verifying in the default case either (where you don’t swap opponents).
Say that the utility does depend on whether the username on the screen is “Rohin”, but the initial action makes this an unreliable indicator of whether I’m playing against Rohin. Furthermore, say that the utility function would score the entire observation-action history that the agent observed as low utility. I claim that the argument still goes through. In fact, this seems to be the same thing that Stuart Armstrong is getting at in the first part of this post.
What is the “whole history”?
The whole history is all the observations and actions that the main agent has actually experienced.
So this is actually a separate issue (which I’ve been going back and forth on) involving the t+nth step not being included in the Q calculation. It should be fixed soon, as should this example in particular.
You’re right, I was too loose with language there. A more accurate statement is “The general argument and intuitions behind the claim are compelling enough that I want any proposal to clearly explain why the argument doesn’t work for it”. Another statement is “the claim is compelling enough that I throw it at any particular proposal, and if it’s unclear I tend to be wary”. Another one is “if I were trying to design an impact measure, showing why that claim doesn’t work would be one of my top priorities”.
Perhaps we do mostly agree, since you are planning to talk more about this in the future.
I think the analogous thing to say is, “well, I don’t see how to build an AGI right now because AIs don’t form abstractions, and no one else knows how to make AIs that form abstractions, so if anyone comes up with a plan for building AGI, they should be able to explain why it will form abstractions, or why AI doesn’t need to form abstractions”.
Sure.
Yeah, I agree this helps.
In the case you described, u_A would be “Over the course of the entire history of the universe, I want to do 5 jumping jacks—no more, no less.” You then do 5 jumping jacks in the current epoch. After this, u_A will always output 1, regardless of policy, so its penalty should be zero, but since you call u_A on subhistories, it will say “I guess I’ve never done any jumping jacks, so attainable utility is 1 if I do 5 jumping jacks now, and 0 otherwise”, which seems wrong.
For all intents and purposes, you can consider the attainable utility maximizers to be alien agents. It wouldn’t make sense for you to give yourself credit for jumping jacks that someone else did!
Another intuition for this is that, all else equal, we generally don’t worry about the time at which the agent is instantiated, even though it’s experiencing a different “subhistory” of time.
My overall position here is that sure, maybe you could view it in the way you described. However, for our purposes, it seems to be more sensible to view it in this manner.
Thinking of it as alien agents does make more sense, I think that basically convinces me that this is not an important point to get hung up about. (Though I still do have residual feelings of weirdness.)
I think that if you view things the way you seem to want to, then you have to give up on the high-level description of AUP as ‘penalising changes in the agent’s ability to achieve a wide variety of goals’.
The goal is “I want to do 5 jumping jacks”. AUP measures the agent’s ability to do 5 jumping jacks.
You seem to be thinking of a utility as being over the actual history of the universe. They’re only over action-observation histories.
You can call that thing ‘utility’, but it doesn’t really correspond to what you would normally think of as extent to which one has achieved a goal. For instance, usually you’d say that “win a game of go that I’m playing online with my friend Rohin” is a task that one should be able to have a utility function over. However, in your schema, I have to put utility functions over context-free observation-action subhistories. Presumably, the utility should be 1 for these subhistories that show a sequence of screens evolving validly to a victory for me, and 0 otherwise.
Now, suppose that at the start of the game, I spend one action to irreversibly change the source of my opponent’s moves from Rohin to GNU Go, a simple bot, while still displaying the player name as “Rohin”. In this case, I have in fact vastly reduced my ability to win a game against Rohin. However, the utility function evaluated on subhistories starting on my next observation won’t be able to tell that I did this, and as far as I can tell the AUP penalty doesn’t notice any change in my ability to achieve this goal.
In general, the utility of subhistories (if utility functions are going to track goals as we usually mean them) are going to have to depend on the whole history, since the whole history tells you more about the state of the world than the subhistory.
Your utility presently isn’t even requiring a check to see whether you’re playing against the right person. If the utility function actually did require this before dispensing any high utility, we would indeed have the correct difference as a result of this action. In this case, you’re saying that the utility function isn’t verifying in the subhistory, even though it’s not verifying in the default case either (where you don’t swap opponents). This is where the inconsistency comes from.
What is the “whole history”? We instantiate the main agent at arbitary times.
Say that the utility does depend on whether the username on the screen is “Rohin”, but the initial action makes this an unreliable indicator of whether I’m playing against Rohin. Furthermore, say that the utility function would score the entire observation-action history that the agent observed as low utility. I claim that the argument still goes through. In fact, this seems to be the same thing that Stuart Armstrong is getting at in the first part of this post.
The whole history is all the observations and actions that the main agent has actually experienced.
So this is actually a separate issue (which I’ve been going back and forth on) involving the t+nth step not being included in the Q calculation. It should be fixed soon, as should this example in particular.