ryan_greenblatt comments on Vote on Interesting Disagreements

ryan_greenblatt 8 Nov 2023 4:52 UTC
65 points
0
Ambitious mechanistic interpretability is quite unlikely^[1] to be able to confidently assess^[2] whether AIs^[3] are deceptively aligned (or otherwise have dangerous propensities) in the next 10 years.
1. ↩︎
  greater than 90% failure
2. ↩︎
  likelihood ratio of 10
3. ↩︎
  I’m refering to which ever AIs are pivotal or cruxy for things to go well prior to human obsolescence.