I’ve seen proposals of the form “Eliezer and Paul both have to agree the researchers have solved alignment to get the grand prize”, which seems better than not-that, but, still seems insufficiently legibly fair to make it work with 1000 researchers.
I have also seem some partially formalized stabs at something like “interpretability challenges”, somewhat inspired by Auditing Games, where there are multiple challenges (i.e. bronze, silver and gold awards), and the bronze challenge is meant to be something achievable by interpretability researchers within a couple years, and the gold challenge is meant to be something like “you can actually reliably detect deceptive adversaries, and other key properties a competent civilization would have before running dangerously powerful AGI.”
This isn’t the same as an “alignment prize”, but might be easier to specify.
I’ve seen proposals of the form “Eliezer and Paul both have to agree the researchers have solved alignment to get the grand prize”, which seems better than not-that, but, still seems insufficiently legibly fair to make it work with 1000 researchers.
I have also seem some partially formalized stabs at something like “interpretability challenges”, somewhat inspired by Auditing Games, where there are multiple challenges (i.e. bronze, silver and gold awards), and the bronze challenge is meant to be something achievable by interpretability researchers within a couple years, and the gold challenge is meant to be something like “you can actually reliably detect deceptive adversaries, and other key properties a competent civilization would have before running dangerously powerful AGI.”
This isn’t the same as an “alignment prize”, but might be easier to specify.