Bogdan Ionut Cirstea comments on Bogdan Ionut Cirstea’s Shortform

Bogdan Ionut Cirstea 26 Apr 2024 18:02 UTC
3 points
0
Noteably, the mainline approach for catching doesn’t involve any internals usage at all, let alone labeling a bunch of things.
This was indeed my impression (except for potentially using steering vectors, which I think are mentioned in one of the sections in ‘Catching AIs red-handed’), but I think not using any internals might be overconservative / might increase the monitoring / safety tax too much (I think this is probably true more broadly of the current control agenda framing).
What links here?
- Bogdan Ionut Cirstea's comment on Schelling game evaluations for AI control by Olli Järviniemi (10 Oct 2024 14:42 UTC; 2 points)