Are you mostly hoping for people to come up with new alignment schemes that incorporate this (e.g. coming up with proposals like these that include a meta-level adversarial training step) or are you also hoping that people start actually doing meta-level adversarial evaluation of there existing alignment schemes (e.g. Anthropic tries to find a failure mode for whatever scheme they used to align Claude).
Are you mostly hoping for people to come up with new alignment schemes that incorporate this (e.g. coming up with proposals like these that include a meta-level adversarial training step) or are you also hoping that people start actually doing meta-level adversarial evaluation of there existing alignment schemes (e.g. Anthropic tries to find a failure mode for whatever scheme they used to align Claude).
I’m interested in both of those.