it’s worth noting that the impact penalty that is always 1.01 meets all of the desiderata except natural kind.
But natural kind is a desideratum! I’m thinking about adding one, though.
I think that it’s hard to satisfy the conjunction of three desiderata—objectivity (no dependence on values), safety (preventing any catastrophic plans) and non-trivialness (the AI is still able to do some useful things).
So notice that although AUP is by design value agnostic, it has moderate value awareness via approval. I think this helps us around some issues you may be considering—I expect the approval incentives to be fairly strong.
any other action makes humans a little bit more likely to turn off the agent.
This is maybe true, and I note it in Future Directions. So I go back and forth on whether this is good or not. Imagine action a is desirable and sufficiently low- impact to be chosen, except there’s random approval noise. Then the more we approve of the action, the closer the mean noise is to 0 and the more likely it is that the agent takes the action.
Or this could be too restrictive—I honestly don’t know yet.
An impact measure that penalized change in utility attainable by humans seems pretty bad—the AI would never help us do anything. To the extent that that the AI’s ability to do things is meant to be similar to our ability to do things, I would expect that to be bad for us in the same way.
You might not be considering the asymmetry imposed by approval.
Breaking a vase seems like it is restricting outcome space. Do you think it is an example of opportunity cost?
Yes, because you’re sacrificing world-with-vase-in-it (or future energy to get back to similar outcomes). You’re imposing a change to expedite your current goals in a way that isn’t trivially-reversible. Now, it isn’t a large cost, but it is a cost.
Overfitting typically refers to situations where the training distribution does equal the test distribution (but the training set is different from the test set, since they are samples from the same distribution).
Is this not covered by “in the limit of data sampled”? If so, I’ll tweak.
I view Theorem 1 as showing that the penalty biases the agent towards inaction (as opposed to eg. showing that AUP measures impact, or something like that). Do you agree with that?
I view it as saying “there’s no clever complete plan which moves you towards your goal while not changing other things” (ofer has an interesting example for incomplete plans which doesn’t trigger Theorem 1’s conditions). This implies somewhat that it’s measuring impact in a universal way, although it only holds for all computable u.
Theorem 1 depends on U containing all computable utility functions, and may not hold for other sets of utility functions, even infinite ones.
Yes, this is true, although I think there are informal reasons to suspect it holds in the real world for many finite sets (due to power). As long as it isn’t always 0, that is!
How do you tell which action is expected to do so?
Any action for which E[Penalty(a_unit)] is strictly increased?
I think this makes it much more likely that your AI is unable to do anything. (This is an example of why I wanted a desideratum of “your AI is able to do things”.)
Yes, and I think we probably want to avoid this. I focused on ensuring no bad things are allowed. I don’t think it’ll be too hard to ease up in certain ways while maintaining safety.
I’m not sure what this is referring to. Are the crisp definitions are the the increase/decrease in available outcome-space? Where was the proof of universality?
Theorem 1.
That definition can be relaxed to “an agent’s ability to take the outside view on the trustworthiness of its own algorithms” to get rid of the value-learning setup. How does AUP fare on this definition?
Generally more cautious. AUP agents seemingly won’t generally override us, which is probably fine for low impact.
that utility functions on subhistories are sketchy (you can’t encode the utility function “I want to do X exactly once ever”)
My model strongly disagrees with this intuition, and I’d be interested in hearing more arguments for it.
that as a result there may not be any impact measure that we actually want to use.
This seems extremely premature. I agree that AUP should be more lax in some ways. The conclusion “looks maybe impossible, then” doesn’t seem to follow. Why don’t we just tweak the formulation? I mean, I’m one guy who worked on this for two months. People shouldn’t take this to be the best possible formulation.
On the meta level: I think our disagreements seem of this form:
Me: This particular thing seems strange and doesn’t gel with my intuitions, here’s an example.
You: That’s solved by this other aspect here.
Me: But… there’s no reason to think that the other aspect captures the underlying concept.
You: But there’s no actual scenario where anything bad happens.
Me: But if you haven’t captured the underlying concept I wouldn’t be surprised if such a scenario exists, so we should still worry.
There are two main ways to change my mind in these cases. First, you could argue that you actually have captured the underlying concept, by providing an argument that your proposal does everything that the underlying concept would do. The argument should quantify over “all possible cases”, and is stronger the fewer assumptions it has on those cases. Second, you could convince me that the underlying concept is not important, by appealing to the desiderata behind my underlying concept and showing how those desiderata are met (in a similar “all possible cases” way). In particular, the argument “we can’t think of any case where this is false” is unlikely to change my mind—I’ve typically already tried to come up with a case where it’s false and not been able to come up with anything convincing.
I don’t really know how I’m supposed to change your mind in such cases. If it’s by coming up with a concrete example where things clearly fail, I don’t think I can do that, and we should probably end this conversation. I’ve outlined some ways in which I think things could fail, but anything involving all possible utility functions and reasoning about long-term convergent instrumental goals is sufficiently imprecise that I can’t be certain that anything in particular would fail.
(That’s another thing causing a lot of disagreements, I think—I am much more skeptical of any informal reasoning about all computable utility functions, or reasoning that depends upon particular aspects of the environment, than you seem to be.)
I’m going to try to use this framework in some of my responses.
Here, the “example” is the impact penalty that is always 1.01, the “other aspect” is “natural kind”, and the “underlying concept” is that an impact measure should allow the AI to do things.
Arguably 1.01 is a natural kind—is it not natural to think “any action that’s different from inaction is impactful”? I legitimately find 1.01 more natural than AUP—it is _really strange_ to me to penalize changes in Q-values in _both directions_. This is an S1 intuition, don’t take it seriously—I say it mainly to make the point that natural kind is subjective, whereas the fact that 1.01 is a bad impact penalty is not subjective.
Here, I’m pretty happy about value awareness via approval because it seems like it could capture a good portion of underlying concept, but I think that’s not clearly true—value awareness via approval depends a lot on the environment, and only gets some of it. If unaligned aliens were going to take over the AI, or we’re going to get wiped out by an asteroid, the AI couldn’t stop that from happening even though it knows we’d want it to. Similarly, if we wanted to build von Neumann probes but couldn’t without the AI’s help, it couldn’t do that for us. Invoking the framework again, the “example” is building von Neumann probes, the “other aspect” might be something like “building a narrow technical AI that just creates von Neumann probes and places them outside the AI’s control”, and the “underlying concept” is “the AI should be able to do what we want it to do”.
See paragraph above about why approval makes me happier but doesn’t fully remove my worries.
When utility functions are on full histories I’d disagree with this (Theorem 1 feels decidedly trivial in that case), it’s possible that utility functions on subhistories are different, so perhaps I’ll wait until understanding that better.
By default I’d expect this to knock out half of all actions, which is quite a problem for small, granular action sets.
Uh, I thought I gave a very strong one—you can’t encode the utility function “I want to do X exactly once”. Let’s consider the “I want to do X exactly once, on the first timestep”. You could try to do this by writing the u_A = 1 if a_1 = X, and 0 otherwise. Since you apply u_A on different subhistories, this actually wants you to take action X on the first action of every epoch. If you’re using the full history for action selection, that may not be the case, but the attainable utility calculation will definitely think “The attainable utility for u_A is 1 if I can take action X at time step t+n+1, and 0 otherwise” _even if_ you have already taken action X.
Thinking of it as alien agents does make more sense, I think that basically convinces me that this is not an important point to get hung up about. (Though I still do have residual feelings of weirdness.)
