Paul W comments on Davidad’s Bold Plan for Alignment: An In-Depth Explanation

Paul W 18 May 2024 16:04 UTC
1 point
2
Is the field advanced enough that it would be feasible to have a guaranteed no-zero-day evaluation and deployment codebase that is competitive with a regular codebase?
As far as I know (I’m not an expert), such absolute guarantees are too hard right now, especially if the AI you’re trying to verify is arbitrarily complex. However, the training process ought to yield an AI with specific properties. I’m not entirely sure I got what you meant by “a guaranteed no-zero-day evaluation and deployment codebase”. Would you mind explaining more ?

“Or is the claim that it’s feasible to build a conservative world model that tells you “maybe a zero-day” very quickly once you start doing things not explicitly within a dumb world model?”
I think that’s closer to the idea: you {reject and penalize, during training} as soon as the AI tries something that might be “exploiting a zero-day”, in the sense that the world-model can’t rule out this possibility with high confidence^[1]. That way, the training process is expected to reward simpler, more easily verified actions.

Then, a key question is “what else you do want from your AI ?”: of course, it is supposed to perform critical tasks, not just “let you see what program is running”^[2], so there is tension between the various specifications you enter. The question of how far you can actually go, how much you can actually ask for, is both crucial, and wide open, as far as I can tell.
1. ^
  Some of the uncertainty lies in how accurate and how conservative the world-model is; you won’t get a “100% guarantee” anyway, especially since you’re only aiming for probabilistic bounds within the model.
2. ^
  Otherwise, a sponge would do.
- Fabien Roger 20 May 2024 12:06 UTC
  2 points
  0
  Parent
  I was thinking that the formal guarantees would be about state evaluations (i.e. state → badness bounds) - which would require sth like “showing there is no zero-day” (since “a code-base with a zero-day” might be catastrophically bad if no constraints are put on actions). Thanks for pointing out they can be about action (i.e. (state, action) → badness bounds), which seem intuitively easier to get good bounds for (you don’t need to show there are no zero-days, just that the currently considered action is extremely unlikely to exploit a potential zero-day).
  I’d be curious to know what kind of formal process could prove that (codebase, codebase-interaction) pairs are provably not-bad (with high probability, and with a false positive rate low enough if you trained an AI to minimize it). My guess is that there is nothing like that on the horizon (that could become competitive at all), but I could be wrong.
  (“let you see what program is running” was an example of a very minimal safety guarantee I would like to have, not a representative safety guarantee. My point is that I’d be surprised if people got even such a simple and easy safety guarantee anytime soon, using formal methods to check AI actions that actually do useful stuff.)