I think it might be interesting to note potential risks of deceptive models creating false or misleading labels for features. In general I think coming up with better and more robust automated labeling of features is an important direction.
I worked in a group at a recent hackathon on demonstrating the feasibility of creating bad labels in bills method. https://www.lesswrong.com/posts/PyzZ6gcB7BaGAgcQ7/deceptive-agents-can-collude-to-hide-dangerous-features-in
I think it might be interesting to note potential risks of deceptive models creating false or misleading labels for features. In general I think coming up with better and more robust automated labeling of features is an important direction.
I worked in a group at a recent hackathon on demonstrating the feasibility of creating bad labels in bills method. https://www.lesswrong.com/posts/PyzZ6gcB7BaGAgcQ7/deceptive-agents-can-collude-to-hide-dangerous-features-in