I’m definitely not claiming that it is likely we will manage to catch AIs doing such egregiously bad actions, but I think if we did catch them, this would provide an adequate warning shot.
I think if an effective control approach is employed, early transformatively useful AIs are dangerously misaligned, and these early misaligned AIs are unwilling or unable to punt to a later generation of AIs, then catching AIs red-handed is pretty likely relative to other ways of ensuring safety. I think all of these assumptions I stated as conditions are pretty plausible.
I’ll admit I have only been loosely following the control stuff, but FWIW I would be excited about a potential @peterbarnett & @ryan_greenblatt dialogue in which you two to try to identify & analyze any potential disagreements. Example questions:
What is the most capable system that you think we are likely to be able to control?
What kind of value do you think we could get out of such a system?
To what extent do you expect that system to be able to produce insights that help us escape the acute risk period (i.e., get out of a scenario where someone else can come along and build a catastrophe-capable system without implementing control procedures or someone else comes along and scales to the point where the control procedures are no longer sufficient)
I think if an effective control approach is employed, early transformatively useful AIs are dangerously misaligned, and these early misaligned AIs are unwilling or unable to punt to a later generation of AIs, then catching AIs red-handed is pretty likely relative to other ways of ensuring safety. I think all of these assumptions I stated as conditions are pretty plausible.
I’ll admit I have only been loosely following the control stuff, but FWIW I would be excited about a potential @peterbarnett & @ryan_greenblatt dialogue in which you two to try to identify & analyze any potential disagreements. Example questions:
What is the most capable system that you think we are likely to be able to control?
What kind of value do you think we could get out of such a system?
To what extent do you expect that system to be able to produce insights that help us escape the acute risk period (i.e., get out of a scenario where someone else can come along and build a catastrophe-capable system without implementing control procedures or someone else comes along and scales to the point where the control procedures are no longer sufficient)
(You might be interested in the Dialogue I already did with habryka.)