I have a pretty fundamental concern with these sorts of techniques as a mechanism for eventually assessing alignment
that would lead to safety or alignment goodharting problem.
that would lead to safety or alignment goodharting problem.