We think that strategic AI deception – where a model outwardly seems aligned but is in fact misaligned – is a crucial step in many major catastrophic AI risk scenarios and that detecting deception in real-world models is the most important and tractable step to addressing this problem.
Can you elaborate on why the team believes it’s the most important and most tractable step?
Thanks for the posting the announcement.
Can you elaborate on why the team believes it’s the most important and most tractable step?