“we don’t know if deceptive alignment is real at all (I maintain it isn’t, on the mainline).”
You think it isn’t a substantial risk of LLMs as they are trained today, or that it isn’t a risk of any plausible training regime for any plausible deep learning system? (I would agree with the first, but not the second)
“we don’t know if deceptive alignment is real at all (I maintain it isn’t, on the mainline).”
You think it isn’t a substantial risk of LLMs as they are trained today, or that it isn’t a risk of any plausible training regime for any plausible deep learning system? (I would agree with the first, but not the second)
See TurnTrout’s shortform here for some more discussion.