DaemonicSigil comments on Do the Safety Properties of Powerful AI Systems Need to be Adversarially Robust? Why?

DaemonicSigil 14 Mar 2023 6:16 UTC
1 point
0
Comments on ThermodynamicBot

IMO, not only is “plug every possible h into U(h)” extremely computationally infeasible

To be clear, I’m not saying Thermodynamic bot does the computation the slow exponential way. I already explained how it could be done in polynomial time, at least for a world model that looks like a factor graph that’s a tree. Call this ThermodynamicBot-F. You could also imagine the role of “world model” being filled by a neural network (blob of weights) that approximates the full thermodynamic computation. We can call this ThermodynamicBot-N.

Yes, I understand that running a search that will kill you if it succeeds is dumb. This has been known for many years. The question is how do we actually write a program to do a sane search? You quote TurnTrout:

It would be far wiser to not consider all possible plans, and instead close off large parts of the search space. You can consider what plans to think about next, and how long to think, and so on. And then you aren’t argmaxing. You’re using resources effectively.

I don’t find this particularly helpful. If we know which plans are adversarial so we can eliminate them from the search space, we’re already half way to solving alignment. I don’t think the plans a bounded agent is going to eliminate so that it can finish its thinking on time are automatically going to be the adversarial ones. I think this is a problem that is going to take actual effort.

In particular for ThermodynamicBot:
- Case where the world model is implemented in a factor graph (ThermodynamicBot-F): This gives exactly the same result as searching across all inputs, but the computation is efficient, and not really wasteful in any sense. If we imagine trying to “improve” the belief propagation algorithm to simultaneously make it more efficient and also remove some subset of plans it’s searching over that are “adversarial”, I can’t really imagine a way to do that, and it would certainly make the algorithm more complicated and less elegant.
- Case where a neural network world model is being used (ThermodynamicBot-N): In this case there are likely plans that will be missed by ThermodynamicBot-N because of the bounded nature of its world model, even though they would be found by searching across all inputs. But if we imagine training the world model to make it better, I would generally expect this to increase the world model’s ability to find adversarial plans just like it increases its ability to find good plans. In general, I don’t expect there to be any correlation where all the adversarial plans happen to be eliminated due to bounded reasoning. Why should we be so lucky that all the errors we’re making happen to cancel each other out?
I think the sorts of planning methods that try to approximate in the real world the behavior of “think about all possible plans and pick a good one” are unworkable in the limit, not just from an alignment standpoint but also from a practical capability standpoint, so I don’t expect us to build competent agents that use them, so I don’t worry about them or their attendant need for adversarial robustness.

I agree if we’re literally talking about brute force search here. If we’re talking about the more realistic ThermodynamicBot designs I’ve mentioned, then I’m not sure I agree. In some sense, all methods an agent could use to plan are “picking plans from plan-space that are better than most other plans”. Even ActorCriticBot is “trying” to approximate argmax. If we could train it to minimal loss, it would be an ArgMaxBot. Is there some particular approximation or heuristic that we can adopt, where if we do adopt it we go from dangerously approaching ArgMaxBot to safely searching through only good plans? An approximation used by ActorCriticBot, but not by ThermodynamicBot-N? If so, I have no idea what the crucial approximation is that you could be thinking of.

I also don’t think it’s at all obvious that ThermodynamicBot designs are necessarily capability-limited. It makes a lot of sense to integrate planning very closely with the world model. Might be worth betting on the direction of future RL research here if we can set sufficiently objective resolution criteria? In any case, I do think this counts as some progress in this discussion, since we’ve found an example of an agent that we both agree your argument doesn’t apply to.

Comments on PolicyGradientBot vs ActorCriticBot

In my view, there’s kind of a huge gulf between PolicyGradientBot and ActorCriticBot, where the gradients flowing backwards into ActorCriticBot’s actor end up carrying a lot of information. This allows for much better performance, and in particular much better sample efficiency, at the cost that some of the information is about weaknesses in ActorCriticBot’s critic.

To take a particular example, if the critic overvalues blue diamonds, then gradients flowing into the actor are going to be steeper for actions that obtain blue diamonds. Then in a new environment where there’s a bucket of blue paint sitting in the corner, it seems reasonable to expect that the actor might try to use that bucket to paint diamonds blue, at least assuming it’s sufficiently intelligent and flexible.

For PolicyGradientBot on the other hand, while it could still result in alignment failures, it seems much more like we’re just directly training a policy. But PolicyGradientBot is very slow when it comes to sample efficiency.

WRT other algorithms like temporal difference learning that lie kind of in between PolicyGradientBot and ActorCriticBot, I think the question of what happens for ActorCriticBot is already a crux in this discussion, but feel free to add more bot types if you think it would be useful.

Is ActorCriticBot robust?

the actor in actor-critic doesn’t make its decisions by running an internal CriticEstimator(plan) and doing whatever evaluates best

Again, I’m not saying a brute force search over plans is being done here, but I’d generally expect that what the actor is doing is very strongly linked to what the critic values, and I’d say it’s very likely that the Actor has lots of components inside of it roughly related to the question “what is the critic going to think about this situation?” For example, if the critic consistently overvalues blue, then I’d predict that the actor has lots of circuits inside of it related to blueness. Do you disagree with this?

Obviously the actor’s ideas of what’s good aren’t going to be perfectly faithful to the critic: There will exist some adversarial plans that the actor just isn’t going to generate, but again the question is: Why should we be so lucky that the errors we’re making exactly cancel out? I don’t see any reason to expect that the actor’s imperfect approximation of the critic and critic’s imperfect approximation of our true desires should cancel out so well that the actor never generates any adversarial plans at all.

DaemonicSigil comments on Do the Safety Properties of Powerful AI Systems Need to be Adversarially Robust? Why?

Comments on ThermodynamicBot

Comments on PolicyGradientBot vs ActorCriticBot

Is ActorCriticBot robust?