Intuitively, this involves two components: the ability to robustly steer high-level structures like objectives, and something good to target at.
I agree.
But if we solve these two problems then I think you could go further and say we don’t really need to care about deceptiveness at all. Our AI will just be aligned.
But if we solve these two problems then I think you could go further and say we don’t really need to care about deceptiveness at all. Our AI will just be aligned.
I agree, but one idea behind deep deception is that it’s an easy-to-miss failure mode. Specifically, I had someone come up after a talk on high-level interpretability to say it didn’t solve deep deception, and well, I disagreed. I don’t talk about it in terms of deceptiveness, but it glosses over a few inferential steps relating to deception that are easy to stumble over, so the claim wasn’t without merit—especially because I think many other agendas miss that insight.
I agree.
But if we solve these two problems then I think you could go further and say we don’t really need to care about deceptiveness at all. Our AI will just be aligned.
P.S
This made me laugh
I agree, but one idea behind deep deception is that it’s an easy-to-miss failure mode. Specifically, I had someone come up after a talk on high-level interpretability to say it didn’t solve deep deception, and well, I disagreed. I don’t talk about it in terms of deceptiveness, but it glosses over a few inferential steps relating to deception that are easy to stumble over, so the claim wasn’t without merit—especially because I think many other agendas miss that insight.