ryan_greenblatt comments on Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

ryan_greenblatt 22 May 2024 21:55 UTC
2 points
0

And so on. If you want to prove something like “the AI will never copy itself to an external computer”, or “the AI will never communicate outside this trusted channel”, or “the AI will never tamper with this sensor”, or something like that, then your world model might not need to be all that detailed.

I agree that you can improve safety by checking outputs in specific ways in cases where we can do so (e.g. requiring the AI to formally verify its code doesn’t have side channels).

The relevant question is whether there are interesting cases where we can’t currently verify the output with conventional tools (e.g. formal software analysis or bridge design), but we can using a world model we’ve constructed.

One story for why we might be able to do this is that the world model is able to generally predict the world as well as the AI.

Suppose you had a world model which was as smart as GPT-3 (but magically interpretable). Do you think this would be useful for something? IMO no and besides, we don’t need the interpretability as we can just train GPT-3 to do stuff.

So, it either has to be smarter than GPT-3 or have a wildly different capability profile.

I can’t really image any plausible stories where this world model doesn’t end up having to actually be smart while the world model is also able massively improve the frontier of what we can check.
- Joar Skalse 24 May 2024 15:37 UTC
  1 point
  0
  Parent
  I’m not so convinved of this. Yes, for some complex safety properties, the world model will probably have to be very smart. However, this does not mean that you have to model everything—depending on your safety specification and use case, you may be able to factor out a huge amount of complexity. We know from existing cases that this is true on a small scale—why should it not also be true on a medium or large scale?
  For example, with a detailed model of the human body, you may be able to prove whether or not a given chemical could be harmful to ingest. This cannot be done with current tools, because we don’t have a detailed computational model of the human body (and even if we did, we would not be able to use it for scaleable inference). However, this seems like the kind of thing that could plausibly be created in the not-so-long term using AI tools. And if we had such a model, we could prove many interesting safety properties for e.g. pharmacutical development AIs (even if these AIs know many things that are not covered by this world model).
  Suppose you had a world model which was as smart as GPT-3 (but magically interpretable). Do you think this would be useful for something?
  I think that would be extremely useful, because it would tell us many things about how to implement cognitive algorithms. But I don’t think it would be very useful for proving safety properties (which I assume was your actual question). GPT-3′s capabilities are wide but shallow, but in most cases, what we would need are capabilities that are narrow but deep.