I directionally agree with this (and think it’s good to write about this more, strongly upvoted!)
For clarity, I would distinguish between two control-related ideas more explicitly when talking about how much work should go into what area:
“ensuring that if the AIs are not aligned [...], then you are still OK” (which I think is the main meaning of “AI control”)
Making ~worst-case assumptions about things like neural representations or inductive biases (which in practice means you likely rely on black-box methods, as in Redwood’s existing work on control).
I think 2. is arguably the most promising strategy for 1., but I’ve occasionally noticed myself conflating them more than I should.
1. gives you the naive 50⁄50 equilibrium, i.e. 50% of people should naively work on this broad notion of control. But I think other reasons in favor apply more strongly to 2. (e.g. the tractability arguments are significantly weaker for model internals-based approaches to 1.)
I also think (non-confidently) that 2. is what’s really very different from most existing research. For control in the first, broad sense, some research seems less clearly on either the control or alignment side.
But I do agree that safety-motivated researchers should evaluate approaches from a control perspective (in the broad sense) more on the margin. And I also really like the narrower black-box approach to control!
I agree and regret focusing as much as we did 2 in the past; I’m excited for work on “white box control” (there’s some under way, and I’m excited for more).
We focused on black box control because it’s much more obvious how you’d make a legible and conservative safety case for it. But in hindsight this was a mistake, I think: people probably aren’t going to actually make legible or conservative safety arguments for their deployments, so there’s less of an advantage than I’d thought.
I plan to write more about white box control soon (but I plan to write about a lot of stuff…)
I directionally agree with this (and think it’s good to write about this more, strongly upvoted!)
For clarity, I would distinguish between two control-related ideas more explicitly when talking about how much work should go into what area:
“ensuring that if the AIs are not aligned [...], then you are still OK” (which I think is the main meaning of “AI control”)
Making ~worst-case assumptions about things like neural representations or inductive biases (which in practice means you likely rely on black-box methods, as in Redwood’s existing work on control).
I think 2. is arguably the most promising strategy for 1., but I’ve occasionally noticed myself conflating them more than I should.
1. gives you the naive 50⁄50 equilibrium, i.e. 50% of people should naively work on this broad notion of control. But I think other reasons in favor apply more strongly to 2. (e.g. the tractability arguments are significantly weaker for model internals-based approaches to 1.)
I also think (non-confidently) that 2. is what’s really very different from most existing research. For control in the first, broad sense, some research seems less clearly on either the control or alignment side.
But I do agree that safety-motivated researchers should evaluate approaches from a control perspective (in the broad sense) more on the margin. And I also really like the narrower black-box approach to control!
I agree and regret focusing as much as we did 2 in the past; I’m excited for work on “white box control” (there’s some under way, and I’m excited for more).
We focused on black box control because it’s much more obvious how you’d make a legible and conservative safety case for it. But in hindsight this was a mistake, I think: people probably aren’t going to actually make legible or conservative safety arguments for their deployments, so there’s less of an advantage than I’d thought.
I plan to write more about white box control soon (but I plan to write about a lot of stuff…)