We can make scheming legible at least to sophisticated-scheming-skeptics like Sam Altman and Dario.
Especially if the AI is even sandbagging on simply-coding when it thinks it’s for safety research. And if it’s not doing that, we can get some useful safety work out of it.
We can make scheming legible at least to sophisticated-scheming-skeptics like Sam Altman and Dario.
Especially if the AI is even sandbagging on simply-coding when it thinks it’s for safety research. And if it’s not doing that, we can get some useful safety work out of it.
@Adam Kaufman @Tyler Tracy @David Matolcsi see Ryan’s comments.
If it was enough evidence that I was strongly convinced sure. But IDK if I would be convinced because the evidence might be actually unclear.
I agree you’ll be able to get some work out, but you might be taking a bit productivity hit.
Also, TBC, I’m not generally that worried about generic sandbagging on safety research relative to other problems.