Suppose you monitor the CoT and the model is rederiving something it’s not supposed to know. You could halt that particular CoT, but eventually you’re creating pressure for an instance where the model manages to escape notice, possibly through luck, possibly through steganography. See “Nearest Unblocked Strategy”.
That’s why I think CoT monitoring alone is insufficient, and people should be thinking about additional (runtime?) defenses. Curious to hear if you have any ideas!
A few months ago, someone here suggested that more x-risk advocacy should go through comedians and podcasts.
Youtube just recommended this Joe Rogan clip to me from a few days ago: The Worst Case Scenario for AI. Joe Rogan legitimately seemed pretty freaked out.
@So8res maybe you could get Yampolskiy to refer you to Rogan for a podcast appearance promoting your book?