ryan_greenblatt comments on Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

ryan_greenblatt 22 May 2024 21:49 UTC
2 points
0

However, that isn’t the use-case here (the world model is not meant to be agentic).

Worth noting here that I’m mostly talking about Bengio’s proposal wrt to the bayes related arguments.

And I agree that the world model isn’t meant to be a schemer, but it’s not as though we can guarantee that without some additional property...

(Such as ensuring that the world model is interpretable.)
- Joar Skalse 24 May 2024 15:38 UTC
  1 point
  0
  Parent
  In practice, I think you are unlikely to end up with a schemer unless you train your model to solve some agentic task (or tain it to model a system that may itself be a schemer, such as a human). However, in order to guarantee that, I agree we need some additional property (such as interpretability, or some learning-theoretic guarantee).
  - ryan_greenblatt 24 May 2024 18:54 UTC
    2 points
    0
    Parent
    (I think most of the hard-to-handle risk from scheming comes from cases where we can’t easily make smarter AIs which we know aren’t schemers. If we can get another copy of the AI which is just as smart but which has been “de-agentified”, then I don’t think scheming poses a substantial threat. (Because e.g. we can just deploy this second model as a monitor for the first.) My guess is that a “world-model” vs “agent” distinction isn’t going to be very real in practice. (And in order to make an AI good at reasoning about the world, it will need to actively be an agent in the same way that your reasoning is agentic.) Of course, there are risks other than scheming.)