AI control is useful to corporations even if it doesn’t result in more capabilities. This is why so much money is invested in it. Customers want predictable and reliable AI. There is a great post here about AI’s aligning to Do What I want and Double Checking in the short term. There’s your motive.
Also in a world where we stop safety research, it’s not obvious to me why capabilities research will be stopped or even slowed down. I can imagine them being slightly less economically valuable but not much less capable. If anything, without reliability, devs might be pushed to extract value out of these models by making them more capable.
Fixing them will push the failure modes beyond our ability to understand and anticipate, let alone fix.
So that’s why this point isn’t very obvious to me. It seems like we can just have both failures we can understand and can’t understand. They aren’t mutually exclusive.[1]
AI control is useful to corporations even if it doesn’t result in more capabilities. This is why so much money is invested in it. Customers want predictable and reliable AI. There is a great post here about AI’s aligning to Do What I want and Double Checking in the short term. There’s your motive.
Also in a world where we stop safety research, it’s not obvious to me why capabilities research will be stopped or even slowed down. I can imagine them being slightly less economically valuable but not much less capable. If anything, without reliability, devs might be pushed to extract value out of these models by making them more capable.
So that’s why this point isn’t very obvious to me. It seems like we can just have both failures we can understand and can’t understand. They aren’t mutually exclusive.[1]
Also if we can’t understand why something is bad, even given a long amount of time, is it really bad?