I don’t really know how I’m supposed to change your mind in such cases. If it’s by coming up with a concrete example where things clearly fail, I don’t think I can do that, and we should probably end this conversation. I’ve outlined some ways in which I think things could fail, but anything involving all possible utility functions and reasoning about long-term convergent instrumental goals is sufficiently imprecise that I can’t be certain that anything in particular would fail.
I don’t think you need to change my mind here, because I agree with you. I was careful to emphasize that I don’t claim AUP is presently AGI-safe. It seems like we’ve just been able to blow away quite a few impossible-seeming issues that had previously afflicted impact measures, and from my personal experience, the framework seems flexible and amenable to further improvement.
What I’m arguing is specifically that we shouldn’t say it’s impossible to fix these weird aspects. First, due to the inaccuracy of similar predictions in the past, and second, because it generally seems like the error that people make when they say, “well, I don’t see how to build an AGI right now, so it’ll take thousands of years”. How long have we spent trying to fix these issues? I doubt I’ve seriously thought about how to relax AUP for more than five minutes.
In sum, I am arguing that the attitude right now should not be that this method is safe, but rather that we seem leaps and bounds closer to the goal, and we have reason to be somewhat optimistic about our chances of fixing the remaining issues.
if we wanted to build von Neumann probes but couldn’t without the AI’s help, it couldn’t do that for us.
I actually think we could, but I have yet to publish my reasoning on how we would go about this, so you don’t need to take my word for now. Maybe we could discuss this when I’m able to post that?
See paragraph above about why approval makes me happier but doesn’t fully remove my worries.
Another consideration I forgot to highlight: the agent’s actual goal should be pointing in (very) roughly the right direction, so it’s more inclined to have certain kind of impact than others.
By default I’d expect this to knock out half of all actions, which is quite a problem for small, granular action sets.
This is a great point.
Uh, I thought I gave a very strong one—you can’t encode the utility function “I want to do X exactly once”. Let’s consider the “I want to do X exactly once, on the first timestep”. You could try to do this by writing the u_A = 1 if a_1 = X, and 0 otherwise. Since you apply u_A on different subhistories, this actually wants you to take action X on the first action of every epoch. If you’re using the full history for action selection, that may not be the case, but the attainable utility calculation will definitely think “The attainable utility for u_A is 1 if I can take action X at time step t+n+1, and 0 otherwise” even if you have already taken action X.
I don’t understand the issue here – the attainable u_A is measuring how well would I be able to start maximizing this goal from here? It seems to be captured by what you just described. It’s supposed to capture the future ability, regardless of what has happened so far. If you do a bunch of jumping jacks, and then cripple yourself, should your jumping jack ability remain high because you already did quite a few?
It’s an argument that’s quantifying over all possible implementations of impact measures. The claim is “you cannot satisfy the conjunction of three desiderata—objectivity (no dependence on values), safety (preventing any catastrophic plans) and non-trivialness (the AI is still able to do useful things)”. I certainly haven’t proven this claim, nor have I given such a strong argument that everyone should mostly believe it, but I do currently believe this claim.
I argue that you should be very careful about believing these things. I think that a lot of the reason why we had such difficulty with impact measures was because of incorrectly believing things like this. This isn’t to say that you’re wrong, but rather that we should extremely cautious about these beliefs in general. Universal quantifiers are strong, and it’s often hard to distinguish between “it really can’t be done”, and “I don’t presently see how to do it”.
This seems to apply to any AUP-agent that is substantially value aware through approval.
“If we substantially base our impact measure on some kind of value learning”. There is no value-learning input required.
I argue that you should be very careful about believing these things.
You’re right, I was too loose with language there. A more accurate statement is “The general argument and intuitions behind the claim are compelling enough that I want any proposal to clearly explain why the argument doesn’t work for it”. Another statement is “the claim is compelling enough that I throw it at any particular proposal, and if it’s unclear I tend to be wary”. Another one is “if I were trying to design an impact measure, showing why that claim doesn’t work would be one of my top priorities”.
Perhaps we do mostly agree, since you are planning to talk more about this in the future.
it generally seems like the error that people make when they say, “well, I don’t see how to build an AGI right now, so it’ll take thousands of years”.
I think the analogous thing to say is, “well, I don’t see how to build an AGI right now because AIs don’t form abstractions, and no one else knows how to make AIs that form abstractions, so if anyone comes up with a plan for building AGI, they should be able to explain why it will form abstractions, or why AI doesn’t need to form abstractions”.
I actually think we could, but I have yet to publish my reasoning on how we would go about this, so you don’t need to take my word for now. Maybe we could discuss this when I’m able to post that?
Sure.
Another consideration I forgot to highlight: the agent’s actual goal should be pointing in (very) roughly the right direction, so it’s more inclined to have certain kind of impact than others.
Yeah, I agree this helps.
I don’t understand the issue here – the attainable u_A is measuring how well would I be able to start maximizing this goal from here? It seems to be captured by what you just described. It’s supposed to capture the future ability, regardless of what has happened so far. If you do a bunch of jumping jacks, and then cripple yourself, should your jumping jack ability remain high because you already did quite a few?
In the case you described, u_A would be “Over the course of the entire history of the universe, I want to do 5 jumping jacks—no more, no less.” You then do 5 jumping jacks in the current epoch. After this, u_A will always output 1, regardless of policy, so its penalty should be zero, but since you call u_A on subhistories, it will say “I guess I’ve never done any jumping jacks, so attainable utility is 1 if I do 5 jumping jacks now, and 0 otherwise”, which seems wrong.
In the case you described, u_A would be “Over the course of the entire history of the universe, I want to do 5 jumping jacks—no more, no less.” You then do 5 jumping jacks in the current epoch. After this, u_A will always output 1, regardless of policy, so its penalty should be zero, but since you call u_A on subhistories, it will say “I guess I’ve never done any jumping jacks, so attainable utility is 1 if I do 5 jumping jacks now, and 0 otherwise”, which seems wrong.
For all intents and purposes, you can consider the attainable utility maximizers to be alien agents. It wouldn’t make sense for you to give yourself credit for jumping jacks that someone else did!
Another intuition for this is that, all else equal, we generally don’t worry about the time at which the agent is instantiated, even though it’s experiencing a different “subhistory” of time.
My overall position here is that sure, maybe you could view it in the way you described. However, for our purposes, it seems to be more sensible to view it in this manner.
Thinking of it as alien agents does make more sense, I think that basically convinces me that this is not an important point to get hung up about. (Though I still do have residual feelings of weirdness.)
My overall position here is that sure, maybe you could view it in the way you described. However, for our purposes, it seems to be more sensible to view it in this manner.
I think that if you view things the way you seem to want to, then you have to give up on the high-level description of AUP as ‘penalising changes in the agent’s ability to achieve a wide variety of goals’.
You can call that thing ‘utility’, but it doesn’t really correspond to what you would normally think of as extent to which one has achieved a goal. For instance, usually you’d say that “win a game of go that I’m playing online with my friend Rohin” is a task that one should be able to have a utility function over. However, in your schema, I have to put utility functions over context-free observation-action subhistories. Presumably, the utility should be 1 for these subhistories that show a sequence of screens evolving validly to a victory for me, and 0 otherwise.
Now, suppose that at the start of the game, I spend one action to irreversibly change the source of my opponent’s moves from Rohin to GNU Go, a simple bot, while still displaying the player name as “Rohin”. In this case, I have in fact vastly reduced my ability to win a game against Rohin. However, the utility function evaluated on subhistories starting on my next observation won’t be able to tell that I did this, and as far as I can tell the AUP penalty doesn’t notice any change in my ability to achieve this goal.
In general, the utility of subhistories (if utility functions are going to track goals as we usually mean them) are going to have to depend on the whole history, since the whole history tells you more about the state of the world than the subhistory.
the utility function evaluated on subhistories starting on my next observation won’t be able to tell that I did this, and as far as I can tell the AUP penalty doesn’t notice any change in my ability to achieve this goal.
Your utility presently isn’t even requiring a check to see whether you’re playing against the right person. If the utility function actually did require this before dispensing any high utility, we would indeed have the correct difference as a result of this action. In this case, you’re saying that the utility function isn’t verifying in the subhistory, even though it’s not verifying in the default case either (where you don’t swap opponents). This is where the inconsistency comes from.
the whole history tells you more about the state of the world than the subhistory.
What is the “whole history”? We instantiate the main agent at arbitary times.
Your utility presently isn’t even requiring a check to see whether you’re playing against the right person. If the utility function actually did require this before dispensing any high utility, we would indeed have the correct difference as a result of this action. In this case, you’re saying that the utility function isn’t verifying in the subhistory, even though it’s not verifying in the default case either (where you don’t swap opponents).
Say that the utility does depend on whether the username on the screen is “Rohin”, but the initial action makes this an unreliable indicator of whether I’m playing against Rohin. Furthermore, say that the utility function would score the entire observation-action history that the agent observed as low utility. I claim that the argument still goes through. In fact, this seems to be the same thing that Stuart Armstrong is getting at in the first part of this post.
What is the “whole history”?
The whole history is all the observations and actions that the main agent has actually experienced.
So this is actually a separate issue (which I’ve been going back and forth on) involving the t+nth step not being included in the Q calculation. It should be fixed soon, as should this example in particular.
I don’t think you need to change my mind here, because I agree with you. I was careful to emphasize that I don’t claim AUP is presently AGI-safe. It seems like we’ve just been able to blow away quite a few impossible-seeming issues that had previously afflicted impact measures, and from my personal experience, the framework seems flexible and amenable to further improvement.
What I’m arguing is specifically that we shouldn’t say it’s impossible to fix these weird aspects. First, due to the inaccuracy of similar predictions in the past, and second, because it generally seems like the error that people make when they say, “well, I don’t see how to build an AGI right now, so it’ll take thousands of years”. How long have we spent trying to fix these issues? I doubt I’ve seriously thought about how to relax AUP for more than five minutes.
In sum, I am arguing that the attitude right now should not be that this method is safe, but rather that we seem leaps and bounds closer to the goal, and we have reason to be somewhat optimistic about our chances of fixing the remaining issues.
I actually think we could, but I have yet to publish my reasoning on how we would go about this, so you don’t need to take my word for now. Maybe we could discuss this when I’m able to post that?
Another consideration I forgot to highlight: the agent’s actual goal should be pointing in (very) roughly the right direction, so it’s more inclined to have certain kind of impact than others.
This is a great point.
I don’t understand the issue here – the attainable u_A is measuring how well would I be able to start maximizing this goal from here? It seems to be captured by what you just described. It’s supposed to capture the future ability, regardless of what has happened so far. If you do a bunch of jumping jacks, and then cripple yourself, should your jumping jack ability remain high because you already did quite a few?
I argue that you should be very careful about believing these things. I think that a lot of the reason why we had such difficulty with impact measures was because of incorrectly believing things like this. This isn’t to say that you’re wrong, but rather that we should extremely cautious about these beliefs in general. Universal quantifiers are strong, and it’s often hard to distinguish between “it really can’t be done”, and “I don’t presently see how to do it”.
“If we substantially base our impact measure on some kind of value learning”. There is no value-learning input required.
You’re right, I was too loose with language there. A more accurate statement is “The general argument and intuitions behind the claim are compelling enough that I want any proposal to clearly explain why the argument doesn’t work for it”. Another statement is “the claim is compelling enough that I throw it at any particular proposal, and if it’s unclear I tend to be wary”. Another one is “if I were trying to design an impact measure, showing why that claim doesn’t work would be one of my top priorities”.
Perhaps we do mostly agree, since you are planning to talk more about this in the future.
I think the analogous thing to say is, “well, I don’t see how to build an AGI right now because AIs don’t form abstractions, and no one else knows how to make AIs that form abstractions, so if anyone comes up with a plan for building AGI, they should be able to explain why it will form abstractions, or why AI doesn’t need to form abstractions”.
Sure.
Yeah, I agree this helps.
In the case you described, u_A would be “Over the course of the entire history of the universe, I want to do 5 jumping jacks—no more, no less.” You then do 5 jumping jacks in the current epoch. After this, u_A will always output 1, regardless of policy, so its penalty should be zero, but since you call u_A on subhistories, it will say “I guess I’ve never done any jumping jacks, so attainable utility is 1 if I do 5 jumping jacks now, and 0 otherwise”, which seems wrong.
For all intents and purposes, you can consider the attainable utility maximizers to be alien agents. It wouldn’t make sense for you to give yourself credit for jumping jacks that someone else did!
Another intuition for this is that, all else equal, we generally don’t worry about the time at which the agent is instantiated, even though it’s experiencing a different “subhistory” of time.
My overall position here is that sure, maybe you could view it in the way you described. However, for our purposes, it seems to be more sensible to view it in this manner.
Thinking of it as alien agents does make more sense, I think that basically convinces me that this is not an important point to get hung up about. (Though I still do have residual feelings of weirdness.)
I think that if you view things the way you seem to want to, then you have to give up on the high-level description of AUP as ‘penalising changes in the agent’s ability to achieve a wide variety of goals’.
The goal is “I want to do 5 jumping jacks”. AUP measures the agent’s ability to do 5 jumping jacks.
You seem to be thinking of a utility as being over the actual history of the universe. They’re only over action-observation histories.
You can call that thing ‘utility’, but it doesn’t really correspond to what you would normally think of as extent to which one has achieved a goal. For instance, usually you’d say that “win a game of go that I’m playing online with my friend Rohin” is a task that one should be able to have a utility function over. However, in your schema, I have to put utility functions over context-free observation-action subhistories. Presumably, the utility should be 1 for these subhistories that show a sequence of screens evolving validly to a victory for me, and 0 otherwise.
Now, suppose that at the start of the game, I spend one action to irreversibly change the source of my opponent’s moves from Rohin to GNU Go, a simple bot, while still displaying the player name as “Rohin”. In this case, I have in fact vastly reduced my ability to win a game against Rohin. However, the utility function evaluated on subhistories starting on my next observation won’t be able to tell that I did this, and as far as I can tell the AUP penalty doesn’t notice any change in my ability to achieve this goal.
In general, the utility of subhistories (if utility functions are going to track goals as we usually mean them) are going to have to depend on the whole history, since the whole history tells you more about the state of the world than the subhistory.
Your utility presently isn’t even requiring a check to see whether you’re playing against the right person. If the utility function actually did require this before dispensing any high utility, we would indeed have the correct difference as a result of this action. In this case, you’re saying that the utility function isn’t verifying in the subhistory, even though it’s not verifying in the default case either (where you don’t swap opponents). This is where the inconsistency comes from.
What is the “whole history”? We instantiate the main agent at arbitary times.
Say that the utility does depend on whether the username on the screen is “Rohin”, but the initial action makes this an unreliable indicator of whether I’m playing against Rohin. Furthermore, say that the utility function would score the entire observation-action history that the agent observed as low utility. I claim that the argument still goes through. In fact, this seems to be the same thing that Stuart Armstrong is getting at in the first part of this post.
The whole history is all the observations and actions that the main agent has actually experienced.
So this is actually a separate issue (which I’ve been going back and forth on) involving the t+nth step not being included in the Q calculation. It should be fixed soon, as should this example in particular.