I agree. I imagine a disagreement where the person says “you can’t prove my proposal won’t work” and the alignment judge says “you can’t prove your proposal will work”, and then there’s some nasty, very-hard-to-resolve debate about whether AGI is dangerous by default or not, involving demands for more detail, and then the person says “I can’t provide more detail because we don’t have AGI yet”, etc.
I think of the disagreement between Paul and Eliezer about whether IDA was a promising path to safe AGI (back when Paul was more optimistic about it); both parties there were exceptionally smart and knowledgeable about alignment, and they still couldn’t reconcile.
Giving people a more narrow problem (e.g. ELK) would help in some ways, but there could still be disagreement over whether solving that particular problem in advance (or at all) is in fact necessary to avert AGI catastrophe.
I’ve seen proposals of the form “Eliezer and Paul both have to agree the researchers have solved alignment to get the grand prize”, which seems better than not-that, but, still seems insufficiently legibly fair to make it work with 1000 researchers.
I have also seem some partially formalized stabs at something like “interpretability challenges”, somewhat inspired by Auditing Games, where there are multiple challenges (i.e. bronze, silver and gold awards), and the bronze challenge is meant to be something achievable by interpretability researchers within a couple years, and the gold challenge is meant to be something like “you can actually reliably detect deceptive adversaries, and other key properties a competent civilization would have before running dangerously powerful AGI.”
This isn’t the same as an “alignment prize”, but might be easier to specify.
I agree. I imagine a disagreement where the person says “you can’t prove my proposal won’t work” and the alignment judge says “you can’t prove your proposal will work”, and then there’s some nasty, very-hard-to-resolve debate about whether AGI is dangerous by default or not, involving demands for more detail, and then the person says “I can’t provide more detail because we don’t have AGI yet”, etc.
I think of the disagreement between Paul and Eliezer about whether IDA was a promising path to safe AGI (back when Paul was more optimistic about it); both parties there were exceptionally smart and knowledgeable about alignment, and they still couldn’t reconcile.
Giving people a more narrow problem (e.g. ELK) would help in some ways, but there could still be disagreement over whether solving that particular problem in advance (or at all) is in fact necessary to avert AGI catastrophe.
I’ve seen proposals of the form “Eliezer and Paul both have to agree the researchers have solved alignment to get the grand prize”, which seems better than not-that, but, still seems insufficiently legibly fair to make it work with 1000 researchers.
I have also seem some partially formalized stabs at something like “interpretability challenges”, somewhat inspired by Auditing Games, where there are multiple challenges (i.e. bronze, silver and gold awards), and the bronze challenge is meant to be something achievable by interpretability researchers within a couple years, and the gold challenge is meant to be something like “you can actually reliably detect deceptive adversaries, and other key properties a competent civilization would have before running dangerously powerful AGI.”
This isn’t the same as an “alignment prize”, but might be easier to specify.