I think I’m mostly right, but using a somewhat confused frame.
It makes more sense to think of MAD approaches as detecting all abnormal reasons (including deceptive alignment) by default, and then if we get that working we’ll try to decrease false anomalies by doing something like comparing the least common ancestor of the measurements in a novel mechanism to the least common ancestor of the measurements on trusted mechanisms.
I think I’m mostly right, but using a somewhat confused frame.
It makes more sense to think of MAD approaches as detecting all abnormal reasons (including deceptive alignment) by default, and then if we get that working we’ll try to decrease false anomalies by doing something like comparing the least common ancestor of the measurements in a novel mechanism to the least common ancestor of the measurements on trusted mechanisms.