I wish we talked more about which particular AI systems design principles give us confidence in safety.
For example: context is always erased; humans review ~all context and output tokens.
These principles are likely to disappear unless we center them in our analysis & demands.
Less worried rn about issues like “RLHF incentivizes deception”
Much more worried about “a pair of mostly-aligned AI systems talk to each other for 3 hours and then make POST requests, and no human actually reviews the full transcript”
Yo Shavit says: