Dan H comments on Buck’s Shortform

Dan H 27 May 2024 7:32 UTC
LW: 15 AF: 8
0
AF
Some years ago we wrote that “[AI] systems will monitor for destructive behavior, and these monitoring systems need to be robust to adversaries” and discussed monitoring systems that can create “AI tripwires could help uncover early misaligned systems before they can cause damage.” https://www.lesswrong.com/posts/5HtDzRAk7ePWsiL2L/open-problems-in-ai-x-risk-pais-5#Adversarial_Robustness

Since then, I’ve updated that adversarial robustness for LLMs is much more tractable (preview of paper out very soon). In vision settings, progress is extraordinarily slow but not necessarily for LLMs.
- Buck 27 May 2024 14:11 UTC
  LW: 3 AF: 4
  0
  AF Parent
  Thanks for the link. I don’t actually think that either of the sentences you quoted are closely related to what I’m talking about. You write “[AI] systems will monitor for destructive behavior, and these monitoring systems need to be robust to adversaries”; I think that this is you making a version of the argument that I linked to Adam Gleave and Euan McLean making and that I think is wrong. You wrote “AI tripwires could help uncover early misaligned systems before they can cause damage” in a section on anomaly detection, which isn’t a central case of what I’m talking about.
  - ryan_greenblatt 27 May 2024 18:42 UTC
    LW: 6 AF: 5
    0
    AF Parent
    I’m not sure if I agree as far as the first sentence you mention is concerned. I agree that the rest of the paragraph is talking about something similar to the point that Adam Gleave and Euan McLean make, but the exact sentence is:
    
    Separately, humans and systems will monitor for destructive behavior, and these monitoring systems need to be robust to adversaries.
    
    “Separately” is quite key here.
    
    I assume this is intended to include AI adversaries and high stakes monitoring.