bmg comments on Reviews of “Is power-seeking AI an existential risk?”

bmg 18 Dec 2021 23:27 UTC
3 points
Hm, I’d probably disagree.

A couple thoughts here:

First: To me, it seems one important characteristic of “planners” is that they can improve their decisions/behavior even without doing additional learning. For example, if I’m playing chess, there might be some move that (based on my previous learning) initially presents itself as the obvious one to make. But I can sit there and keep running mental simulations of different games I haven’t yet played (“What would happen if I moved that piece there…?”) and arrive at better and better decisions.

It doesn’t seem like that’d be true of a deployed version of AlphaGo without MCTS. If you present it with some board state, it seems like it will just take whatever action (or distribution of actions) is already baked into its policy. There’s not a sense, I think, in which it will keep improving its decision. Unlike in the MCTS case, you can’t tweak some simple parameter and give it more ‘time to think’ and allow it to make a better decision. So that’s one sense in which AlphaGo without MCTS doesn’t seem, to me, like it could exhibit planning.

However, second: A version of AlphaGo without explicit MCTS might still qualify as a “planner” on a thinner conception of “planning.” In this case, I suppose the hypothesis would be that: when we do a single forward pass through the network, we carry out some computations that are roughly equivalent to the computations involved in (e.g.) MCTS. I suppose that can’t be ruled out, although I’m also not entirely sure how to think about it. One thing we could still say, though, is that insofar as planning processes tend to involve a lot of sequential steps, the number of layers in MCTS-less AlphaGo would seriously limit the amount of ‘planning’ it can do. Eight layers don’t seem like nearly enough for a forward pass to correspond to any meaningful amount of planning.

So my overall view is: For a somewhat strict conception of “planning,” it doesn’t seem like feedforward networks can plan. For a somewhat loose conception of “planning,” it actually is conceivable that a feedforward network could plan — but (I think) only if it had a really huge number of layers. I’m also not sure that there would a tendency for the system to start engaging in this kind of “planning” as layer count increases; I haven’t thought enough to have a strong take.^[1]
1. ↩︎
  Also, to clarify: I think that the question of whether feedforward networks can plan probably isn’t very practically relevant, in-and-of-itself — since they’re going to be less important than other kinds of networks. I’m interested in this question mainly as a way of pulling apart different conceptions of “planning,” noticing ambiguities and disagreements, etc.
- gwern 30 Dec 2021 1:46 UTC
  5 points
  Parent
  
  For a somewhat loose conception of “planning,” it actually is conceivable that a feedforward network could plan — but (I think) only if it had a really huge number of layers.
  
  Search doesn’t buy you that much, remember. After relatively few nodes, you’ve already gotten much of the benefit from finetuning the value estimates (eg. AlphaZero, or the MuZero appendix). And you can do weight-tying to repeat feedforward layers, or just repeat layers/the model. (Is AlphaFold2 recurrent? Is ALBERT recurrent? A VIN? Or a diffusion model? Or a neural ODE?) This is probably why Jones finds that distilling MCTS runtime search into search-less feedforward parameters comes, empirically, at a favorable exchange rate I wouldn’t call ‘really huge’.
- Daniel Kokotajlo 19 Dec 2021 1:25 UTC
  2 points
  Parent
  OK, thanks. Why is it important that they be able to easily improve their performance without learning?
  I agree that eight layers doesn’t seem like enough to do some serious sequential pondering. For comparison, humans take multiple seconds—often minutes—of subjective time to do this, at something like 100 sequential steps per second.
  - Joe Carlsmith 21 Dec 2021 4:44 UTC
    6 points
    Parent
    Cool, these comments helped me get more clarity about where Ben is coming from.
    Ben, I think the conception of planning I’m working with is closest to your “loose” sense. That is, roughly put, I think of planning as happening when (a) something like simulations are happening, and (b) the output is determined (in the right way) at least partly on the basis of those simulations (this definition isn’t ideal, but hopefully it’s close enough for now). Whereas it sounds like you think of (strict) planning as happening when (a) something like simulations are happening, and (c) the agent’s overall policy ends up different (and better) as a result.
    What’s the difference between (b) and (c)? One operationalization could be: if you gave an agent input 1, then let it do its simulations thing and produce an output, then gave it input 1 again, could the agent’s performance improve, on this round, in virtue of the simulation-running that it did on the first round? On my model, this isn’t necessary for planning; whereas on yours, it sounds like it is?
    Let’s say this is indeed a key distinction. If so, let’s call my version “Joe-planning” and your version “Ben-planning.” My main point re: feedforward neural network was that they could do Joe-planning in principle, which it sounds like you think at least conceivable. I agree that it seems tough for shallow feedforward networks to do much of Joe-planning in practice. I also grant that when humans plan, they are generally doing Ben-planning in addition to Joe-planning (e.g., they’re generally in a position to do better on a given problem in virtue of having planned about that same problem yesterday).
    Seems like key questions re: the connection to AI X-risk include:
    Is there reason to think a given type of planning especially dangerous and/or relevant to the overall argument for AI X-risk?
    Should we expect that type of planning to be necessary for various types of task performance?
    Re: (1), I do think Ben-planning poses dangers that Joe-planning doesn’t. Notably, Ben planning does indeed allow a system to improve/change its policy “on its own” and without new data, whereas Joe planning need not — and this seems more likely to yield unexpected behavior. This seems continuous, though, with the fact that a Ben-planning agent is learning/improving its capabilities in general, which I flag separately as an important risk factor.
    Another answer to (1), suggested by some of your comments, could appeal to the possibility that agents are more dangerous when you can tweak a single simple parameter like “how much time they have to think” or “search depth” and thereby get better performance (this feels related to Eliezer’s worries about “turning up the intelligence dial” by “running it with larger bounds on the for-loops”). I agree that if you can just “turn up the intelligence dial,” that is quite a bit more worrying than if you can’t — but I think this is fairly orthogonal to the Joe-planning vs. Ben-planning distinction. For example, I think you can have Joe-planning agents where you can increase e.g. their search depth by tweaking a single parameter, and you can have Ben-planning agents where the parameters you’d need to tweak aren’t under your control (or the agent’s control), but rather are buried inside some tangled opaque neural network you don’t understand.
    The central reason I’m interested in Joe-planning, though, is that I think the instrumental convergence argument makes the most sense if Joe-planning is involved—e.g., if the agent is running simulations that allow it to notice and respond to incentives to seek power (there are versions of the argument that don’t appeal to Joe-planning, but I like these less—see discussion in footnote 87 here). It’s true that you can end up power-seeking-ish via non-Joe-planning paths (for example, if in training you developed sphex-ish heuristics that favor power-seeking-ish actions); but when I actually imagine AI systems that end up power-seeking, I imagine it happening because they noticed, in the course of modeling the world in order to achieve their goals, that power-seeking (even in ways humans wouldn’t like) would help.
    Can this happen without Ben-planning? I think it can. Suppose, for example, that none of your previous Joe-planning models were power-seeking. Then, you train a new Joe-planner, who can run more sophisticated simulations. On some inputs, this Joe-planner realizes that power-seeking is advantageous, and goes for it (or starts deceiving you, or whatever).
    Re: (2), for the reasons discussed in section 3.1, I tend to see Joe-planning as pretty key to lots of task-performance — though I acknowledge that my intuitions are surprised by how much it looks like you can do via something more intuitively “sphexish.” And I acknowledge that some of those arguments may apply less to Ben-planning. I do think this is some comfort, since agents that learn via planning are indeed scarier. But I am separately worried that ongoing learning will be very useful/incentivized, too.