I have also seem some partially formalized stabs at something like “interpretability challenges”, somewhat inspired by Auditing Games, where there are multiple challenges (i.e. bronze, silver and gold awards), and the bronze challenge is meant to be something achievable by interpretability researchers within a couple years, and the gold challenge is meant to be something like “you can actually reliably detect deceptive adversaries, and other key properties a competent civilization would have before running dangerously powerful AGI.”
This isn’t the same as an “alignment prize”, but might be easier to specify.
I have also seem some partially formalized stabs at something like “interpretability challenges”, somewhat inspired by Auditing Games, where there are multiple challenges (i.e. bronze, silver and gold awards), and the bronze challenge is meant to be something achievable by interpretability researchers within a couple years, and the gold challenge is meant to be something like “you can actually reliably detect deceptive adversaries, and other key properties a competent civilization would have before running dangerously powerful AGI.”
This isn’t the same as an “alignment prize”, but might be easier to specify.