A fair objection.
I had a quick search online and also flicked through Boyd’s Convex Optimization, and didn’t find Stuart Russell’s claim expounded on. Would you be able to point me in a direction to look further into this?
Nevertheless, let me try to provide more detailed reasoning for my counterclaim. I assume that Russell’s claim is indeed true in the classic optimisation domain, where there is a function R^N → R f(x) as well as some inequality constraints on a subset of x.
However, I argue that this is not a good model for maximising a utility function in the real world.
First of all, it is not necessarily possible to freely search over x, as x corresponds to environmental states. All classic optimisation techniques that I know of assume that you may set x to any value regardless of the history of values that x was set to. This is not the case in the real world; there are many environmental states which are not accessible from other environmental states. For example, if Earth were to be swallowed up into a black hole, we wouldn’t be able to restore the environment of me typing out this response to you on LW ever again.
In effect, what I’m describing is the difference in optimising in a RL setting than the classical setting. And whilst I can believe some result on extremal values exists in the classical setting, I’d be very surprised indeed if something similar exists in the RL setting. Particularly when the transition matrices are unknown to the agent i.e. it does not have a perfect model of the environment already.
So I’ve laid out my skepticism for the extremal values claim in RL, but is there any reason to believe my counterclaim that RL optimisation naturally leads to non-extremal choices? Here I think I’ll have to be handwavy and gestur-y again, for now (afaik, no literature exists pertaining to this topic and what I’m going to say next, but please do inform me if this is not the case).
Any optimisation process requires evaluating f(x) for different values of x. In order to be able to evaluate f(x), the agent has two distinct choices:
Either it can try setting the environment state to x directly;
Or it can build a model f* of f, and evaluate f*(x) as its estimate;
(roughly, this corresponds to model-free and model-based RL respectively)
Utilising 1 is likely to be highly suboptimal for finding the global optima if the environment is highly ‘irreversible’ i.e. there are many states x that, if you enter them, you are closed off from a large remaining subspace of X. Better is to build the model f* as ‘safely’ as possible, with few evaluations, and where you are reasonably sure the evaluations keep your future choices of x as open as possible. I think this is ‘obvious’ in a worst-case analysis over possible functions f, but it also feels true in an average case with some kind of uniform prior over f.
And now for the most handwavy part: I suspect most elements of the state vector x representing the universe are much more commonly irreversible at extreme values than when they take non-extremal values. But really, this is a bit of a red herring from the headline point—regardless of extremality of values or not, I think an intelligent enough agent will be reticent to enter states which it is not sure it can reverse back out of, and that for me is ‘cautious’ behaviour.
A meta-related comment from someone who’s not deep into alignment (yet) but does work in AI/academia.
My impression on reading LessWrong has been that the people who are deep into alignment research are generally spending a great deal of their time working on their own independent research agendas, which—naturally—they feel are the most fruitful paths to take for alignment.
I’m glad that we seem to be seeing a few more posts of this nature recently (e.g. with Infra-Bayes, etc) where established researchers spend more of their time both investigating and critiquing others’ approaches. This is one good way to get alignment researchers to stack more, imo.