Well, any Pareto optimal policy w.r.t. a bunch of utility functions must be Bayesian or limit of Bayesian. So if your policy requires combining utilities for different histories, it must be Pareto dominated. If true human utility is among the u’s, that seems hard to justify.
That suggests that “utility functions” might be a misnomer for the u’s. Maybe we should think of them as a diverse set of metrics about the world, which we don’t want to irrevocably change, because any large drop in human utility will likely be reflected in the metrics? In that case, can we treat them as one high-dimensional vector and describe the idea geometrically?
I think you’re thinking of some weird blend of maximizing a mixture of utilities and minimizing changes in utility functions, instead of minimizing changes in attainable utility values. This difference is quite fundamental.
Minimizing “change in how many dollars and dogs I have” is quite different from minimizing “change in how many additional dollars and dogs I could get within [a year]”.
I’m still trying to get my head around this. Here’s another possibly dumb question: I have the world’s first strong AI on my laptop, and ask it to download the latest Deadpool movie for me. Unfortunately the first step of that plan requires connecting to the internet, which is also the first step of taking over the world. Will that stop the AI from doing what I ask?
Depends on N. You’re correct that that is instrumentally convergent, but it might be necessary. We can (according to my mental model) N-increment until we get satisfactory performance, stopping well before we get to the “manufacture lots of computers all downloading the Deadpool movie” level of impact. The reason I’m somewhat confident that there is a clear, relatively wide gap between these two levels is the existence and severity of approval incentives.
I don’t see how, if we also want it to be shutdown safe. After all, its model of us could be incorrect, so we might (to its surprise) want to shut it down—without its plans then having predictably higher impact than intended. To me, the prefix method seems more desirable in that way.
There isn’t in that case; however, from Daniel’s comment (which he was using to make a somewhat different point):
AUP thinks very differently about building a nuclear reactor and then adding safety features than it does about building the safety features and then the dangerous bits of the nuclear reactor
I find this reassuring. If we didn’t have this, we would admit plans which are only low impact if not interrupted.
I don’t see why that’s necessary, since we‘re still able to do both plans?
Looking at it from another angle, agents which avoid freely putting themselves (even temporarily) in instrumentally convergent positions seem safer with respect to unexpected failures, so it might also be desirable in this case even though it isn’t objectively impactful in the classical sense.
I’m just trying to figure out if things could be neater. Many low-impact plans accidentally share prefixes with high-impact plans, and it feels weird if many of our orders semi-randomly require tweaking N.
Well, any Pareto optimal policy w.r.t. a bunch of utility functions must be Bayesian or limit of Bayesian. So if your policy requires combining utilities for different histories, it must be Pareto dominated. If true human utility is among the u’s, that seems hard to justify.
That suggests that “utility functions” might be a misnomer for the u’s. Maybe we should think of them as a diverse set of metrics about the world, which we don’t want to irrevocably change, because any large drop in human utility will likely be reflected in the metrics? In that case, can we treat them as one high-dimensional vector and describe the idea geometrically?
I think you’re thinking of some weird blend of maximizing a mixture of utilities and minimizing changes in utility functions, instead of minimizing changes in attainable utility values. This difference is quite fundamental.
Minimizing “change in how many dollars and dogs I have” is quite different from minimizing “change in how many additional dollars and dogs I could get within [a year]”.
I’m still trying to get my head around this. Here’s another possibly dumb question: I have the world’s first strong AI on my laptop, and ask it to download the latest Deadpool movie for me. Unfortunately the first step of that plan requires connecting to the internet, which is also the first step of taking over the world. Will that stop the AI from doing what I ask?
Depends on N. You’re correct that that is instrumentally convergent, but it might be necessary. We can (according to my mental model) N-increment until we get satisfactory performance, stopping well before we get to the “manufacture lots of computers all downloading the Deadpool movie” level of impact. The reason I’m somewhat confident that there is a clear, relatively wide gap between these two levels is the existence and severity of approval incentives.
I see, so the AI will avoid prefixes of high impact plans. Can we make it avoid high impact plans only?
I don’t see how, if we also want it to be shutdown safe. After all, its model of us could be incorrect, so we might (to its surprise) want to shut it down—without its plans then having predictably higher impact than intended. To me, the prefix method seems more desirable in that way.
What’s the high impact if we shut down the AI while it’s downloading the movie?
There isn’t in that case; however, from Daniel’s comment (which he was using to make a somewhat different point):
I find this reassuring. If we didn’t have this, we would admit plans which are only low impact if not interrupted.
Is it possible to draw a boundary between Daniel’s case and mine?
I don’t see why that’s necessary, since we‘re still able to do both plans?
Looking at it from another angle, agents which avoid freely putting themselves (even temporarily) in instrumentally convergent positions seem safer with respect to unexpected failures, so it might also be desirable in this case even though it isn’t objectively impactful in the classical sense.
I’m just trying to figure out if things could be neater. Many low-impact plans accidentally share prefixes with high-impact plans, and it feels weird if many of our orders semi-randomly require tweaking N.
That’s a good point, and I definitely welcome further thought along these lines. I’ll think about it more as well!