Non-deceptive failures are easy to notice, but they’re not necessarily easy to eliminate—and if you don’t eliminate them, they’ll keep happening until some do slip through. I think I take them more seriously than you.
Non-deceptive failures are easy to notice, but they’re not necessarily easy to eliminate
I agree, I was trying to note this in my second paragraph, but I guess this was insufficiently clear.
I added the sentence “Being easy-to-study doesn’t imply easy-to-solve”.
I think I take them more seriously than you.
Seems too hard to tell based on this limited context. I think non-scheming failures are about 50% of the risk and probably should be about 50% of the effort of the AI safety-from-misalignment community. (I can see some arguments for scheming/deceptive alignment being more important toi work on in advance, but it also might be that non-scheming is more tractible and a higher fraction of risk in short timelines, so IDK overall.)
Non-deceptive failures are easy to notice, but they’re not necessarily easy to eliminate—and if you don’t eliminate them, they’ll keep happening until some do slip through. I think I take them more seriously than you.
I agree, I was trying to note this in my second paragraph, but I guess this was insufficiently clear.
I added the sentence “Being easy-to-study doesn’t imply easy-to-solve”.
Seems too hard to tell based on this limited context. I think non-scheming failures are about 50% of the risk and probably should be about 50% of the effort of the AI safety-from-misalignment community. (I can see some arguments for scheming/deceptive alignment being more important toi work on in advance, but it also might be that non-scheming is more tractible and a higher fraction of risk in short timelines, so IDK overall.)