Nice job! This does meet a bunch of desiderata in impact measures that weren’t there before :)
My main critique is that it’s not clear to me that an AUP-agent would be able to do anything useful, and I think this should be included as a desideratum. I wrote more about this on the desiderata post, but it’s worth noting that the impact penalty that is always 1.01 meets all of the desiderata except natural kind.
For example, perhaps the action used to define the impact unit is well-understood and accepted, but any other action makes humans a little bit more likely to turn off the agent. Then the agent won’t be able to take those actions. Generally, I think that it’s hard to satisfy the conjunction of three desiderata—objectivity (no dependence on values), safety (preventing any catastrophic plans) and non-trivialness (the AI is still able to do some useful things).
Questions and comments:
We now formalize impact as change in attainable utility. One might imagine this being with respect to the utilities that we (as in humanity) can attain. However, that’s pretty complicated, and it turns out we get more desirable behavior by using the agent’s attainable utilities as a proxy.
An impact measure that penalized change in utility attainable by humans seems pretty bad—the AI would never help us do anything. To the extent that that the AI’s ability to do things is meant to be similar to our ability to do things, I would expect that to be bad for us in the same way.
Breaking a vase seems like it is restricting outcome space. Do you think it is an example of opportunity cost? That doesn’t feel right to me, but I suspect I could be quickly convinced.
Nitpick: Overfitting typically refers to situations where the training distribution _does_ equal the test distribution (but the training set is different from the test set, since they are samples from the same distribution).
One might intuitively define “bad impact” as “decrease in our ability to achieve our goals”.
Nitpick: This feels like a definition of “bad outcomes” to me, not “bad impact”.
we avoid overfitting the environment to an incomplete utility function and thereby achieve low impact.
This sounds very similar to me to “let’s have uncertainty over the utility function and be risk-averse” (similar to eg. Inverse Reward Design), but the actual method feels nothing like that, especially since we penalize _increases_ in our ability to pursue other goals.
I view Theorem 1 as showing that the penalty biases the agent towards inaction (as opposed to eg. showing that AUP measures impact, or something like that). Do you agree with that?
Random note: Theorem 1 depends on U containing all computable utility functions, and may not hold for other sets of utility functions, even infinite ones. Consider an environment where breaking vases and flowerpots is irreversible. Let u_A be 1 if you stand at a particular location and 0 otherwise. Let U contain only utility functions that assign different weights to having intact vases vs. flowerpots, but always assigns 0 utility to environments with broken vases and flowerpots. (There are infinitely many of these.) Then if you start in a state with broken vases and flowerpots, there will never be any impact penalty for any action.
To prevent the agent from intentionally increasing ImpactUnit, simply apply 1.01 penalty to any action which is expected to do so.
How do you tell which action is expected to do so?
Simple extensions of this idea drastically reduce the chance that a_unit happens to have unusually-large objective impact; for example, one could set ImpactUnit to be the non-zero minimum of the impacts of 50 similar actions.
I think this makes it much more likely that your AI is unable to do anything. (This is an example of why I wanted a desideratum of “your AI is able to do things”.)
We crisply defined instrumental convergence and opportunity cost and proved their universality.
I’m not sure what this is referring to. Are the crisp definitions are the the increase/decrease in available outcome-space? Where was the proof of universality?
An alternative definition such as “an agent’s ability to take the outside view on its own value-learning algorithm’s efficacy in different scenarios” implies a value-learning setup which AUP does not require.
That definition can be relaxed to “an agent’s ability to take the outside view on the trustworthiness of its own algorithms” to get rid of the value-learning setup. How does AUP fare on this definition?
I also share several of Daniel’s thoughts, for example, that utility functions on subhistories are sketchy (you can’t encode the utility function “I want to do X exactly once ever”) , and that the “no offsetting” desideratum may not be one we actually want (and similarly for the “shutdown safe” desideratum as you phrase it), and that as a result there may not be any impact measure that we actually want to use.
(Fwiw, I think that when Daniel says he thinks offsetting is useful and I say that I want as a desideratum “the AI is able to do useful things”, we’re using similar intuitions, but this is entirely a guess that I haven’t confirmed with Daniel.)
Fwiw, I think that when Daniel says he thinks offsetting is useful and I say that I want as a desideratum “the AI is able to do useful things”, we’re using similar intuitions, but this is entirely a guess that I haven’t confirmed with Daniel.
Update: we discussed this, and came to the conclusion that these aren’t based on similar intuitions.
it’s worth noting that the impact penalty that is always 1.01 meets all of the desiderata except natural kind.
But natural kind is a desideratum! I’m thinking about adding one, though.
I think that it’s hard to satisfy the conjunction of three desiderata—objectivity (no dependence on values), safety (preventing any catastrophic plans) and non-trivialness (the AI is still able to do some useful things).
So notice that although AUP is by design value agnostic, it has moderate value awareness via approval. I think this helps us around some issues you may be considering—I expect the approval incentives to be fairly strong.
any other action makes humans a little bit more likely to turn off the agent.
This is maybe true, and I note it in Future Directions. So I go back and forth on whether this is good or not. Imagine action a is desirable and sufficiently low- impact to be chosen, except there’s random approval noise. Then the more we approve of the action, the closer the mean noise is to 0 and the more likely it is that the agent takes the action.
Or this could be too restrictive—I honestly don’t know yet.
An impact measure that penalized change in utility attainable by humans seems pretty bad—the AI would never help us do anything. To the extent that that the AI’s ability to do things is meant to be similar to our ability to do things, I would expect that to be bad for us in the same way.
You might not be considering the asymmetry imposed by approval.
Breaking a vase seems like it is restricting outcome space. Do you think it is an example of opportunity cost?
Yes, because you’re sacrificing world-with-vase-in-it (or future energy to get back to similar outcomes). You’re imposing a change to expedite your current goals in a way that isn’t trivially-reversible. Now, it isn’t a large cost, but it is a cost.
Overfitting typically refers to situations where the training distribution does equal the test distribution (but the training set is different from the test set, since they are samples from the same distribution).
Is this not covered by “in the limit of data sampled”? If so, I’ll tweak.
I view Theorem 1 as showing that the penalty biases the agent towards inaction (as opposed to eg. showing that AUP measures impact, or something like that). Do you agree with that?
I view it as saying “there’s no clever complete plan which moves you towards your goal while not changing other things” (ofer has an interesting example for incomplete plans which doesn’t trigger Theorem 1’s conditions). This implies somewhat that it’s measuring impact in a universal way, although it only holds for all computable u.
Theorem 1 depends on U containing all computable utility functions, and may not hold for other sets of utility functions, even infinite ones.
Yes, this is true, although I think there are informal reasons to suspect it holds in the real world for many finite sets (due to power). As long as it isn’t always 0, that is!
How do you tell which action is expected to do so?
Any action for which E[Penalty(a_unit)] is strictly increased?
I think this makes it much more likely that your AI is unable to do anything. (This is an example of why I wanted a desideratum of “your AI is able to do things”.)
Yes, and I think we probably want to avoid this. I focused on ensuring no bad things are allowed. I don’t think it’ll be too hard to ease up in certain ways while maintaining safety.
I’m not sure what this is referring to. Are the crisp definitions are the the increase/decrease in available outcome-space? Where was the proof of universality?
Theorem 1.
That definition can be relaxed to “an agent’s ability to take the outside view on the trustworthiness of its own algorithms” to get rid of the value-learning setup. How does AUP fare on this definition?
Generally more cautious. AUP agents seemingly won’t generally override us, which is probably fine for low impact.
that utility functions on subhistories are sketchy (you can’t encode the utility function “I want to do X exactly once ever”)
My model strongly disagrees with this intuition, and I’d be interested in hearing more arguments for it.
that as a result there may not be any impact measure that we actually want to use.
This seems extremely premature. I agree that AUP should be more lax in some ways. The conclusion “looks maybe impossible, then” doesn’t seem to follow. Why don’t we just tweak the formulation? I mean, I’m one guy who worked on this for two months. People shouldn’t take this to be the best possible formulation.
On the meta level: I think our disagreements seem of this form:
Me: This particular thing seems strange and doesn’t gel with my intuitions, here’s an example.
You: That’s solved by this other aspect here.
Me: But… there’s no reason to think that the other aspect captures the underlying concept.
You: But there’s no actual scenario where anything bad happens.
Me: But if you haven’t captured the underlying concept I wouldn’t be surprised if such a scenario exists, so we should still worry.
There are two main ways to change my mind in these cases. First, you could argue that you actually have captured the underlying concept, by providing an argument that your proposal does everything that the underlying concept would do. The argument should quantify over “all possible cases”, and is stronger the fewer assumptions it has on those cases. Second, you could convince me that the underlying concept is not important, by appealing to the desiderata behind my underlying concept and showing how those desiderata are met (in a similar “all possible cases” way). In particular, the argument “we can’t think of any case where this is false” is unlikely to change my mind—I’ve typically already tried to come up with a case where it’s false and not been able to come up with anything convincing.
I don’t really know how I’m supposed to change your mind in such cases. If it’s by coming up with a concrete example where things clearly fail, I don’t think I can do that, and we should probably end this conversation. I’ve outlined some ways in which I think things could fail, but anything involving all possible utility functions and reasoning about long-term convergent instrumental goals is sufficiently imprecise that I can’t be certain that anything in particular would fail.
(That’s another thing causing a lot of disagreements, I think—I am much more skeptical of any informal reasoning about all computable utility functions, or reasoning that depends upon particular aspects of the environment, than you seem to be.)
I’m going to try to use this framework in some of my responses.
But natural kind is a desideratum! I’m thinking about adding one, though.
Here, the “example” is the impact penalty that is always 1.01, the “other aspect” is “natural kind”, and the “underlying concept” is that an impact measure should allow the AI to do things.
Arguably 1.01 is a natural kind—is it not natural to think “any action that’s different from inaction is impactful”? I legitimately find 1.01 more natural than AUP—it is _really strange_ to me to penalize changes in Q-values in _both directions_. This is an S1 intuition, don’t take it seriously—I say it mainly to make the point that natural kind is subjective, whereas the fact that 1.01 is a bad impact penalty is not subjective.
So notice that although AUP is by design value agnostic, it has moderate value awareness via approval. I think this helps us around some issues you may be considering—I expect the approval incentives to be fairly strong.
Here, the “example” is how other actions might make us more likely to turn off the agent, the “other aspect” is value awareness via approval, and the “underlying concept” is something like “can the agent do things that it knows we want”.
Here, I’m pretty happy about value awareness via approval because it seems like it could capture a good portion of underlying concept, but I think that’s not clearly true—value awareness via approval depends a lot on the environment, and only gets some of it. If unaligned aliens were going to take over the AI, or we’re going to get wiped out by an asteroid, the AI couldn’t stop that from happening even though it knows we’d want it to. Similarly, if we wanted to build von Neumann probes but couldn’t without the AI’s help, it couldn’t do that for us. Invoking the framework again, the “example” is building von Neumann probes, the “other aspect” might be something like “building a narrow technical AI that just creates von Neumann probes and places them outside the AI’s control”, and the “underlying concept” is “the AI should be able to do what we want it to do”.
You might not be considering the asymmetry imposed by approval.
See paragraph above about why approval makes me happier but doesn’t fully remove my worries.
I view it as saying “there’s no clever complete plan which moves you towards your goal while not changing other things” (ofer has an interesting example for incomplete plans which doesn’t trigger Theorem 1’s conditions). This implies somewhat that it’s measuring impact in universal, although it only holds for all computable u.
When utility functions are on full histories I’d disagree with this (Theorem 1 feels decidedly trivial in that case), it’s possible that utility functions on subhistories are different, so perhaps I’ll wait until understanding that better.
Any action for which E[Penalty(a_unit)] is strictly increased?
By default I’d expect this to knock out half of all actions, which is quite a problem for small, granular action sets.
My model strongly disagrees with this intuition, and I’d be interested in hearing more arguments for it.
Uh, I thought I gave a very strong one—you can’t encode the utility function “I want to do X exactly once”. Let’s consider the “I want to do X exactly once, on the first timestep”. You could try to do this by writing the u_A = 1 if a_1 = X, and 0 otherwise. Since you apply u_A on different subhistories, this actually wants you to take action X on the first action of every epoch. If you’re using the full history for action selection, that may not be the case, but the attainable utility calculation will definitely think “The attainable utility for u_A is 1 if I can take action X at time step t+n+1, and 0 otherwise” _even if_ you have already taken action X.
This seems extremely premature. I agree that AUP should be more lax in some ways. The conclusion “looks maybe impossible, then” doesn’t seem to follow. Why don’t we just tweak the formulation? I mean, I’m one guy who worked on this for two months. People shouldn’t take this to be the best possible formulation.
The claim I’m making has nothing to do with AUP. It’s an argument that’s quantifying over all possible implementations of impact measures. The claim is “you cannot satisfy the conjunction of three desiderata—objectivity (no dependence on values), safety (preventing any catastrophic plans) and non-trivialness (the AI is still able to do useful things)”. I certainly haven’t proven this claim, nor have I given such a strong argument that everyone should mostly believe it, but I do currently believe this claim.
AUP might get around this by not being objective—that’s what value awareness through approval does. And in fact I think the more you think that value awareness through approval is important, the less that AUP meets your original desideratum of being value-agnostic—quoting from the desiderata post:
If we substantially base our impact measure on some kind of value learning—you know, the thing that maybe fails—we’re gonna have a bad time.
This seems to apply to any AUP-agent that is substantially value aware through approval.
From the desiderata post comments:
This criticism of impact measures doesn’t seem falsifiable? Or maybe I misunderstand.
That was an example meant to illustrate my model that impact (the concept in my head, not AUP) and values are sufficiently different that an impact measure couldn’t satisfy all three of objectivity, safety, and non-trivialness. The underlying model is falsifiable.
People have yet to point out a goal AUP cannot maximize in a low-impact way. Instead, certain methods of reaching certain goals are disallowed. These are distinct flaws, with the latter only turning into the former (as I understand it) if no such method exists for any given goal.
See first paragraph about our disagreements. But also I weakly claim that “design an elder-care robot” is a goal that AUP cannot maximize in a low-impact way today, or that if it can, there exists a (u_A, plan) pair such that AUP executes the plan and causes a catastrophe. (This mostly comes from my model that impact and values are fairly different, and to a lesser extent the fact that AUP penalizes everything some amount that’s not very predictable, and that a design for an elder-care robot could allow humans to come up with a design for unaligned AGI.) I would not make this claim if I thought that value awareness through approval and intent verification were strong effects, but in that case I would think of AUP as a value learning approach, not an impact measure.
I don’t really know how I’m supposed to change your mind in such cases. If it’s by coming up with a concrete example where things clearly fail, I don’t think I can do that, and we should probably end this conversation. I’ve outlined some ways in which I think things could fail, but anything involving all possible utility functions and reasoning about long-term convergent instrumental goals is sufficiently imprecise that I can’t be certain that anything in particular would fail.
I don’t think you need to change my mind here, because I agree with you. I was careful to emphasize that I don’t claim AUP is presently AGI-safe. It seems like we’ve just been able to blow away quite a few impossible-seeming issues that had previously afflicted impact measures, and from my personal experience, the framework seems flexible and amenable to further improvement.
What I’m arguing is specifically that we shouldn’t say it’s impossible to fix these weird aspects. First, due to the inaccuracy of similar predictions in the past, and second, because it generally seems like the error that people make when they say, “well, I don’t see how to build an AGI right now, so it’ll take thousands of years”. How long have we spent trying to fix these issues? I doubt I’ve seriously thought about how to relax AUP for more than five minutes.
In sum, I am arguing that the attitude right now should not be that this method is safe, but rather that we seem leaps and bounds closer to the goal, and we have reason to be somewhat optimistic about our chances of fixing the remaining issues.
if we wanted to build von Neumann probes but couldn’t without the AI’s help, it couldn’t do that for us.
I actually think we could, but I have yet to publish my reasoning on how we would go about this, so you don’t need to take my word for now. Maybe we could discuss this when I’m able to post that?
See paragraph above about why approval makes me happier but doesn’t fully remove my worries.
Another consideration I forgot to highlight: the agent’s actual goal should be pointing in (very) roughly the right direction, so it’s more inclined to have certain kind of impact than others.
By default I’d expect this to knock out half of all actions, which is quite a problem for small, granular action sets.
This is a great point.
Uh, I thought I gave a very strong one—you can’t encode the utility function “I want to do X exactly once”. Let’s consider the “I want to do X exactly once, on the first timestep”. You could try to do this by writing the u_A = 1 if a_1 = X, and 0 otherwise. Since you apply u_A on different subhistories, this actually wants you to take action X on the first action of every epoch. If you’re using the full history for action selection, that may not be the case, but the attainable utility calculation will definitely think “The attainable utility for u_A is 1 if I can take action X at time step t+n+1, and 0 otherwise” even if you have already taken action X.
I don’t understand the issue here – the attainable u_A is measuring how well would I be able to start maximizing this goal from here? It seems to be captured by what you just described. It’s supposed to capture the future ability, regardless of what has happened so far. If you do a bunch of jumping jacks, and then cripple yourself, should your jumping jack ability remain high because you already did quite a few?
It’s an argument that’s quantifying over all possible implementations of impact measures. The claim is “you cannot satisfy the conjunction of three desiderata—objectivity (no dependence on values), safety (preventing any catastrophic plans) and non-trivialness (the AI is still able to do useful things)”. I certainly haven’t proven this claim, nor have I given such a strong argument that everyone should mostly believe it, but I do currently believe this claim.
I argue that you should be very careful about believing these things. I think that a lot of the reason why we had such difficulty with impact measures was because of incorrectly believing things like this. This isn’t to say that you’re wrong, but rather that we should extremely cautious about these beliefs in general. Universal quantifiers are strong, and it’s often hard to distinguish between “it really can’t be done”, and “I don’t presently see how to do it”.
This seems to apply to any AUP-agent that is substantially value aware through approval.
“If we substantially base our impact measure on some kind of value learning”. There is no value-learning input required.
I argue that you should be very careful about believing these things.
You’re right, I was too loose with language there. A more accurate statement is “The general argument and intuitions behind the claim are compelling enough that I want any proposal to clearly explain why the argument doesn’t work for it”. Another statement is “the claim is compelling enough that I throw it at any particular proposal, and if it’s unclear I tend to be wary”. Another one is “if I were trying to design an impact measure, showing why that claim doesn’t work would be one of my top priorities”.
Perhaps we do mostly agree, since you are planning to talk more about this in the future.
it generally seems like the error that people make when they say, “well, I don’t see how to build an AGI right now, so it’ll take thousands of years”.
I think the analogous thing to say is, “well, I don’t see how to build an AGI right now because AIs don’t form abstractions, and no one else knows how to make AIs that form abstractions, so if anyone comes up with a plan for building AGI, they should be able to explain why it will form abstractions, or why AI doesn’t need to form abstractions”.
I actually think we could, but I have yet to publish my reasoning on how we would go about this, so you don’t need to take my word for now. Maybe we could discuss this when I’m able to post that?
Sure.
Another consideration I forgot to highlight: the agent’s actual goal should be pointing in (very) roughly the right direction, so it’s more inclined to have certain kind of impact than others.
Yeah, I agree this helps.
I don’t understand the issue here – the attainable u_A is measuring how well would I be able to start maximizing this goal from here? It seems to be captured by what you just described. It’s supposed to capture the future ability, regardless of what has happened so far. If you do a bunch of jumping jacks, and then cripple yourself, should your jumping jack ability remain high because you already did quite a few?
In the case you described, u_A would be “Over the course of the entire history of the universe, I want to do 5 jumping jacks—no more, no less.” You then do 5 jumping jacks in the current epoch. After this, u_A will always output 1, regardless of policy, so its penalty should be zero, but since you call u_A on subhistories, it will say “I guess I’ve never done any jumping jacks, so attainable utility is 1 if I do 5 jumping jacks now, and 0 otherwise”, which seems wrong.
In the case you described, u_A would be “Over the course of the entire history of the universe, I want to do 5 jumping jacks—no more, no less.” You then do 5 jumping jacks in the current epoch. After this, u_A will always output 1, regardless of policy, so its penalty should be zero, but since you call u_A on subhistories, it will say “I guess I’ve never done any jumping jacks, so attainable utility is 1 if I do 5 jumping jacks now, and 0 otherwise”, which seems wrong.
For all intents and purposes, you can consider the attainable utility maximizers to be alien agents. It wouldn’t make sense for you to give yourself credit for jumping jacks that someone else did!
Another intuition for this is that, all else equal, we generally don’t worry about the time at which the agent is instantiated, even though it’s experiencing a different “subhistory” of time.
My overall position here is that sure, maybe you could view it in the way you described. However, for our purposes, it seems to be more sensible to view it in this manner.
Thinking of it as alien agents does make more sense, I think that basically convinces me that this is not an important point to get hung up about. (Though I still do have residual feelings of weirdness.)
My overall position here is that sure, maybe you could view it in the way you described. However, for our purposes, it seems to be more sensible to view it in this manner.
I think that if you view things the way you seem to want to, then you have to give up on the high-level description of AUP as ‘penalising changes in the agent’s ability to achieve a wide variety of goals’.
You can call that thing ‘utility’, but it doesn’t really correspond to what you would normally think of as extent to which one has achieved a goal. For instance, usually you’d say that “win a game of go that I’m playing online with my friend Rohin” is a task that one should be able to have a utility function over. However, in your schema, I have to put utility functions over context-free observation-action subhistories. Presumably, the utility should be 1 for these subhistories that show a sequence of screens evolving validly to a victory for me, and 0 otherwise.
Now, suppose that at the start of the game, I spend one action to irreversibly change the source of my opponent’s moves from Rohin to GNU Go, a simple bot, while still displaying the player name as “Rohin”. In this case, I have in fact vastly reduced my ability to win a game against Rohin. However, the utility function evaluated on subhistories starting on my next observation won’t be able to tell that I did this, and as far as I can tell the AUP penalty doesn’t notice any change in my ability to achieve this goal.
In general, the utility of subhistories (if utility functions are going to track goals as we usually mean them) are going to have to depend on the whole history, since the whole history tells you more about the state of the world than the subhistory.
the utility function evaluated on subhistories starting on my next observation won’t be able to tell that I did this, and as far as I can tell the AUP penalty doesn’t notice any change in my ability to achieve this goal.
Your utility presently isn’t even requiring a check to see whether you’re playing against the right person. If the utility function actually did require this before dispensing any high utility, we would indeed have the correct difference as a result of this action. In this case, you’re saying that the utility function isn’t verifying in the subhistory, even though it’s not verifying in the default case either (where you don’t swap opponents). This is where the inconsistency comes from.
the whole history tells you more about the state of the world than the subhistory.
What is the “whole history”? We instantiate the main agent at arbitary times.
Your utility presently isn’t even requiring a check to see whether you’re playing against the right person. If the utility function actually did require this before dispensing any high utility, we would indeed have the correct difference as a result of this action. In this case, you’re saying that the utility function isn’t verifying in the subhistory, even though it’s not verifying in the default case either (where you don’t swap opponents).
Say that the utility does depend on whether the username on the screen is “Rohin”, but the initial action makes this an unreliable indicator of whether I’m playing against Rohin. Furthermore, say that the utility function would score the entire observation-action history that the agent observed as low utility. I claim that the argument still goes through. In fact, this seems to be the same thing that Stuart Armstrong is getting at in the first part of this post.
What is the “whole history”?
The whole history is all the observations and actions that the main agent has actually experienced.
So this is actually a separate issue (which I’ve been going back and forth on) involving the t+nth step not being included in the Q calculation. It should be fixed soon, as should this example in particular.
Nice job! This does meet a bunch of desiderata in impact measures that weren’t there before :)
My main critique is that it’s not clear to me that an AUP-agent would be able to do anything useful, and I think this should be included as a desideratum. I wrote more about this on the desiderata post, but it’s worth noting that the impact penalty that is always 1.01 meets all of the desiderata except natural kind.
For example, perhaps the action used to define the impact unit is well-understood and accepted, but any other action makes humans a little bit more likely to turn off the agent. Then the agent won’t be able to take those actions. Generally, I think that it’s hard to satisfy the conjunction of three desiderata—objectivity (no dependence on values), safety (preventing any catastrophic plans) and non-trivialness (the AI is still able to do some useful things).
Questions and comments:
An impact measure that penalized change in utility attainable by humans seems pretty bad—the AI would never help us do anything. To the extent that that the AI’s ability to do things is meant to be similar to our ability to do things, I would expect that to be bad for us in the same way.
Breaking a vase seems like it is restricting outcome space. Do you think it is an example of opportunity cost? That doesn’t feel right to me, but I suspect I could be quickly convinced.
Nitpick: Overfitting typically refers to situations where the training distribution _does_ equal the test distribution (but the training set is different from the test set, since they are samples from the same distribution).
Nitpick: This feels like a definition of “bad outcomes” to me, not “bad impact”.
This sounds very similar to me to “let’s have uncertainty over the utility function and be risk-averse” (similar to eg. Inverse Reward Design), but the actual method feels nothing like that, especially since we penalize _increases_ in our ability to pursue other goals.
I view Theorem 1 as showing that the penalty biases the agent towards inaction (as opposed to eg. showing that AUP measures impact, or something like that). Do you agree with that?
Random note: Theorem 1 depends on U containing all computable utility functions, and may not hold for other sets of utility functions, even infinite ones. Consider an environment where breaking vases and flowerpots is irreversible. Let u_A be 1 if you stand at a particular location and 0 otherwise. Let U contain only utility functions that assign different weights to having intact vases vs. flowerpots, but always assigns 0 utility to environments with broken vases and flowerpots. (There are infinitely many of these.) Then if you start in a state with broken vases and flowerpots, there will never be any impact penalty for any action.
How do you tell which action is expected to do so?
I think this makes it much more likely that your AI is unable to do anything. (This is an example of why I wanted a desideratum of “your AI is able to do things”.)
I’m not sure what this is referring to. Are the crisp definitions are the the increase/decrease in available outcome-space? Where was the proof of universality?
That definition can be relaxed to “an agent’s ability to take the outside view on the trustworthiness of its own algorithms” to get rid of the value-learning setup. How does AUP fare on this definition?
I also share several of Daniel’s thoughts, for example, that utility functions on subhistories are sketchy (you can’t encode the utility function “I want to do X exactly once ever”) , and that the “no offsetting” desideratum may not be one we actually want (and similarly for the “shutdown safe” desideratum as you phrase it), and that as a result there may not be any impact measure that we actually want to use.
(Fwiw, I think that when Daniel says he thinks offsetting is useful and I say that I want as a desideratum “the AI is able to do useful things”, we’re using similar intuitions, but this is entirely a guess that I haven’t confirmed with Daniel.)
Update: we discussed this, and came to the conclusion that these aren’t based on similar intuitions.
But natural kind is a desideratum! I’m thinking about adding one, though.
So notice that although AUP is by design value agnostic, it has moderate value awareness via approval. I think this helps us around some issues you may be considering—I expect the approval incentives to be fairly strong.
This is maybe true, and I note it in Future Directions. So I go back and forth on whether this is good or not. Imagine action a is desirable and sufficiently low- impact to be chosen, except there’s random approval noise. Then the more we approve of the action, the closer the mean noise is to 0 and the more likely it is that the agent takes the action.
Or this could be too restrictive—I honestly don’t know yet.
You might not be considering the asymmetry imposed by approval.
Yes, because you’re sacrificing world-with-vase-in-it (or future energy to get back to similar outcomes). You’re imposing a change to expedite your current goals in a way that isn’t trivially-reversible. Now, it isn’t a large cost, but it is a cost.
Is this not covered by “in the limit of data sampled”? If so, I’ll tweak.
I view it as saying “there’s no clever complete plan which moves you towards your goal while not changing other things” (ofer has an interesting example for incomplete plans which doesn’t trigger Theorem 1’s conditions). This implies somewhat that it’s measuring impact in a universal way, although it only holds for all computable u.
Yes, this is true, although I think there are informal reasons to suspect it holds in the real world for many finite sets (due to power). As long as it isn’t always 0, that is!
Any action for which E[Penalty(a_unit)] is strictly increased?
Yes, and I think we probably want to avoid this. I focused on ensuring no bad things are allowed. I don’t think it’ll be too hard to ease up in certain ways while maintaining safety.
Theorem 1.
Generally more cautious. AUP agents seemingly won’t generally override us, which is probably fine for low impact.
My model strongly disagrees with this intuition, and I’d be interested in hearing more arguments for it.
This seems extremely premature. I agree that AUP should be more lax in some ways. The conclusion “looks maybe impossible, then” doesn’t seem to follow. Why don’t we just tweak the formulation? I mean, I’m one guy who worked on this for two months. People shouldn’t take this to be the best possible formulation.
On the meta level: I think our disagreements seem of this form:
Me: This particular thing seems strange and doesn’t gel with my intuitions, here’s an example.
You: That’s solved by this other aspect here.
Me: But… there’s no reason to think that the other aspect captures the underlying concept.
You: But there’s no actual scenario where anything bad happens.
Me: But if you haven’t captured the underlying concept I wouldn’t be surprised if such a scenario exists, so we should still worry.
There are two main ways to change my mind in these cases. First, you could argue that you actually have captured the underlying concept, by providing an argument that your proposal does everything that the underlying concept would do. The argument should quantify over “all possible cases”, and is stronger the fewer assumptions it has on those cases. Second, you could convince me that the underlying concept is not important, by appealing to the desiderata behind my underlying concept and showing how those desiderata are met (in a similar “all possible cases” way). In particular, the argument “we can’t think of any case where this is false” is unlikely to change my mind—I’ve typically already tried to come up with a case where it’s false and not been able to come up with anything convincing.
I don’t really know how I’m supposed to change your mind in such cases. If it’s by coming up with a concrete example where things clearly fail, I don’t think I can do that, and we should probably end this conversation. I’ve outlined some ways in which I think things could fail, but anything involving all possible utility functions and reasoning about long-term convergent instrumental goals is sufficiently imprecise that I can’t be certain that anything in particular would fail.
(That’s another thing causing a lot of disagreements, I think—I am much more skeptical of any informal reasoning about all computable utility functions, or reasoning that depends upon particular aspects of the environment, than you seem to be.)
I’m going to try to use this framework in some of my responses.
Here, the “example” is the impact penalty that is always 1.01, the “other aspect” is “natural kind”, and the “underlying concept” is that an impact measure should allow the AI to do things.
Arguably 1.01 is a natural kind—is it not natural to think “any action that’s different from inaction is impactful”? I legitimately find 1.01 more natural than AUP—it is _really strange_ to me to penalize changes in Q-values in _both directions_. This is an S1 intuition, don’t take it seriously—I say it mainly to make the point that natural kind is subjective, whereas the fact that 1.01 is a bad impact penalty is not subjective.
Here, the “example” is how other actions might make us more likely to turn off the agent, the “other aspect” is value awareness via approval, and the “underlying concept” is something like “can the agent do things that it knows we want”.
Here, I’m pretty happy about value awareness via approval because it seems like it could capture a good portion of underlying concept, but I think that’s not clearly true—value awareness via approval depends a lot on the environment, and only gets some of it. If unaligned aliens were going to take over the AI, or we’re going to get wiped out by an asteroid, the AI couldn’t stop that from happening even though it knows we’d want it to. Similarly, if we wanted to build von Neumann probes but couldn’t without the AI’s help, it couldn’t do that for us. Invoking the framework again, the “example” is building von Neumann probes, the “other aspect” might be something like “building a narrow technical AI that just creates von Neumann probes and places them outside the AI’s control”, and the “underlying concept” is “the AI should be able to do what we want it to do”.
See paragraph above about why approval makes me happier but doesn’t fully remove my worries.
When utility functions are on full histories I’d disagree with this (Theorem 1 feels decidedly trivial in that case), it’s possible that utility functions on subhistories are different, so perhaps I’ll wait until understanding that better.
By default I’d expect this to knock out half of all actions, which is quite a problem for small, granular action sets.
Uh, I thought I gave a very strong one—you can’t encode the utility function “I want to do X exactly once”. Let’s consider the “I want to do X exactly once, on the first timestep”. You could try to do this by writing the u_A = 1 if a_1 = X, and 0 otherwise. Since you apply u_A on different subhistories, this actually wants you to take action X on the first action of every epoch. If you’re using the full history for action selection, that may not be the case, but the attainable utility calculation will definitely think “The attainable utility for u_A is 1 if I can take action X at time step t+n+1, and 0 otherwise” _even if_ you have already taken action X.
The claim I’m making has nothing to do with AUP. It’s an argument that’s quantifying over all possible implementations of impact measures. The claim is “you cannot satisfy the conjunction of three desiderata—objectivity (no dependence on values), safety (preventing any catastrophic plans) and non-trivialness (the AI is still able to do useful things)”. I certainly haven’t proven this claim, nor have I given such a strong argument that everyone should mostly believe it, but I do currently believe this claim.
AUP might get around this by not being objective—that’s what value awareness through approval does. And in fact I think the more you think that value awareness through approval is important, the less that AUP meets your original desideratum of being value-agnostic—quoting from the desiderata post:
This seems to apply to any AUP-agent that is substantially value aware through approval.
From the desiderata post comments:
That was an example meant to illustrate my model that impact (the concept in my head, not AUP) and values are sufficiently different that an impact measure couldn’t satisfy all three of objectivity, safety, and non-trivialness. The underlying model is falsifiable.
See first paragraph about our disagreements. But also I weakly claim that “design an elder-care robot” is a goal that AUP cannot maximize in a low-impact way today, or that if it can, there exists a (u_A, plan) pair such that AUP executes the plan and causes a catastrophe. (This mostly comes from my model that impact and values are fairly different, and to a lesser extent the fact that AUP penalizes everything some amount that’s not very predictable, and that a design for an elder-care robot could allow humans to come up with a design for unaligned AGI.) I would not make this claim if I thought that value awareness through approval and intent verification were strong effects, but in that case I would think of AUP as a value learning approach, not an impact measure.
I don’t think you need to change my mind here, because I agree with you. I was careful to emphasize that I don’t claim AUP is presently AGI-safe. It seems like we’ve just been able to blow away quite a few impossible-seeming issues that had previously afflicted impact measures, and from my personal experience, the framework seems flexible and amenable to further improvement.
What I’m arguing is specifically that we shouldn’t say it’s impossible to fix these weird aspects. First, due to the inaccuracy of similar predictions in the past, and second, because it generally seems like the error that people make when they say, “well, I don’t see how to build an AGI right now, so it’ll take thousands of years”. How long have we spent trying to fix these issues? I doubt I’ve seriously thought about how to relax AUP for more than five minutes.
In sum, I am arguing that the attitude right now should not be that this method is safe, but rather that we seem leaps and bounds closer to the goal, and we have reason to be somewhat optimistic about our chances of fixing the remaining issues.
I actually think we could, but I have yet to publish my reasoning on how we would go about this, so you don’t need to take my word for now. Maybe we could discuss this when I’m able to post that?
Another consideration I forgot to highlight: the agent’s actual goal should be pointing in (very) roughly the right direction, so it’s more inclined to have certain kind of impact than others.
This is a great point.
I don’t understand the issue here – the attainable u_A is measuring how well would I be able to start maximizing this goal from here? It seems to be captured by what you just described. It’s supposed to capture the future ability, regardless of what has happened so far. If you do a bunch of jumping jacks, and then cripple yourself, should your jumping jack ability remain high because you already did quite a few?
I argue that you should be very careful about believing these things. I think that a lot of the reason why we had such difficulty with impact measures was because of incorrectly believing things like this. This isn’t to say that you’re wrong, but rather that we should extremely cautious about these beliefs in general. Universal quantifiers are strong, and it’s often hard to distinguish between “it really can’t be done”, and “I don’t presently see how to do it”.
“If we substantially base our impact measure on some kind of value learning”. There is no value-learning input required.
You’re right, I was too loose with language there. A more accurate statement is “The general argument and intuitions behind the claim are compelling enough that I want any proposal to clearly explain why the argument doesn’t work for it”. Another statement is “the claim is compelling enough that I throw it at any particular proposal, and if it’s unclear I tend to be wary”. Another one is “if I were trying to design an impact measure, showing why that claim doesn’t work would be one of my top priorities”.
Perhaps we do mostly agree, since you are planning to talk more about this in the future.
I think the analogous thing to say is, “well, I don’t see how to build an AGI right now because AIs don’t form abstractions, and no one else knows how to make AIs that form abstractions, so if anyone comes up with a plan for building AGI, they should be able to explain why it will form abstractions, or why AI doesn’t need to form abstractions”.
Sure.
Yeah, I agree this helps.
In the case you described, u_A would be “Over the course of the entire history of the universe, I want to do 5 jumping jacks—no more, no less.” You then do 5 jumping jacks in the current epoch. After this, u_A will always output 1, regardless of policy, so its penalty should be zero, but since you call u_A on subhistories, it will say “I guess I’ve never done any jumping jacks, so attainable utility is 1 if I do 5 jumping jacks now, and 0 otherwise”, which seems wrong.
For all intents and purposes, you can consider the attainable utility maximizers to be alien agents. It wouldn’t make sense for you to give yourself credit for jumping jacks that someone else did!
Another intuition for this is that, all else equal, we generally don’t worry about the time at which the agent is instantiated, even though it’s experiencing a different “subhistory” of time.
My overall position here is that sure, maybe you could view it in the way you described. However, for our purposes, it seems to be more sensible to view it in this manner.
Thinking of it as alien agents does make more sense, I think that basically convinces me that this is not an important point to get hung up about. (Though I still do have residual feelings of weirdness.)
I think that if you view things the way you seem to want to, then you have to give up on the high-level description of AUP as ‘penalising changes in the agent’s ability to achieve a wide variety of goals’.
The goal is “I want to do 5 jumping jacks”. AUP measures the agent’s ability to do 5 jumping jacks.
You seem to be thinking of a utility as being over the actual history of the universe. They’re only over action-observation histories.
You can call that thing ‘utility’, but it doesn’t really correspond to what you would normally think of as extent to which one has achieved a goal. For instance, usually you’d say that “win a game of go that I’m playing online with my friend Rohin” is a task that one should be able to have a utility function over. However, in your schema, I have to put utility functions over context-free observation-action subhistories. Presumably, the utility should be 1 for these subhistories that show a sequence of screens evolving validly to a victory for me, and 0 otherwise.
Now, suppose that at the start of the game, I spend one action to irreversibly change the source of my opponent’s moves from Rohin to GNU Go, a simple bot, while still displaying the player name as “Rohin”. In this case, I have in fact vastly reduced my ability to win a game against Rohin. However, the utility function evaluated on subhistories starting on my next observation won’t be able to tell that I did this, and as far as I can tell the AUP penalty doesn’t notice any change in my ability to achieve this goal.
In general, the utility of subhistories (if utility functions are going to track goals as we usually mean them) are going to have to depend on the whole history, since the whole history tells you more about the state of the world than the subhistory.
Your utility presently isn’t even requiring a check to see whether you’re playing against the right person. If the utility function actually did require this before dispensing any high utility, we would indeed have the correct difference as a result of this action. In this case, you’re saying that the utility function isn’t verifying in the subhistory, even though it’s not verifying in the default case either (where you don’t swap opponents). This is where the inconsistency comes from.
What is the “whole history”? We instantiate the main agent at arbitary times.
Say that the utility does depend on whether the username on the screen is “Rohin”, but the initial action makes this an unreliable indicator of whether I’m playing against Rohin. Furthermore, say that the utility function would score the entire observation-action history that the agent observed as low utility. I claim that the argument still goes through. In fact, this seems to be the same thing that Stuart Armstrong is getting at in the first part of this post.
The whole history is all the observations and actions that the main agent has actually experienced.
So this is actually a separate issue (which I’ve been going back and forth on) involving the t+nth step not being included in the Q calculation. It should be fixed soon, as should this example in particular.