I’m not sure I see the point to awarding an already-in-the-works 67 page paper that happened to be released at the time of the competition if the goal of the prize is to stimulate AI work that otherwise would not have happened.
Personally, my long-term goal is a world where high-quality work on alignment is consistently funded, and where people doing high-quality work on alignment have plenty of money. I think that an effort to restrict to counterfactually-additional alignment work would “save” some money (in the sense that I’d have the money rather than some researcher who is doing alignment work) but wouldn’t be great for that long-term goal.
Also, if you actually think about the dynamics they are pretty crappy, even if you only avoid “obvious” cases. For example, it would become really hard for anyone to actually assess counterfactual impact, since every winner would need to make it look like there was at least a plausible counterfactual impact. (I already wish there was less implicit social pressure in that direction.)
I think you want to reward output rather than output that would not have otherwise happened.
This is similar to the fact that if you want to train calibration, you have to optimize you log score and just observe your lack of calibration as an opportunity to increase your log score.
If I understand correctly, one of the goals of this initiative is to increase the prestige that is associated with making useful contributions in AI safety. For that purpose, it doesn’t matter whether the prize incentivized the winning authors or not. But it is important that enough people will trust that the main criterion for selecting the winning works is usefulness.
My take on this is that the ideal version of this prize selects for both usefulness and counterfactualness, but selecting for counterfactualness without producing weird side effects seems hard. (I do think it’s worth spending an hour or two thinking about how to properly incentivize or reward counterfactualness, just, if you haven’t come up with anything, strictly rewarding quality/usefulness seems better)
I wouldn’t be in favor of adding explicit rules for goodheart related reasons. I think prizes and grants should have the minimum rules to account for basic logistics and the rest should be illegible.
I’m not sure I see the point to awarding an already-in-the-works 67 page paper that happened to be released at the time of the competition if the goal of the prize is to stimulate AI work that otherwise would not have happened.
Personally, my long-term goal is a world where high-quality work on alignment is consistently funded, and where people doing high-quality work on alignment have plenty of money. I think that an effort to restrict to counterfactually-additional alignment work would “save” some money (in the sense that I’d have the money rather than some researcher who is doing alignment work) but wouldn’t be great for that long-term goal.
Also, if you actually think about the dynamics they are pretty crappy, even if you only avoid “obvious” cases. For example, it would become really hard for anyone to actually assess counterfactual impact, since every winner would need to make it look like there was at least a plausible counterfactual impact. (I already wish there was less implicit social pressure in that direction.)
On reflection I strongly agree that social pressure around counterfactualness is a net harm for motivation.
I think you want to reward output rather than output that would not have otherwise happened.
This is similar to the fact that if you want to train calibration, you have to optimize you log score and just observe your lack of calibration as an opportunity to increase your log score.
If I understand correctly, one of the goals of this initiative is to increase the prestige that is associated with making useful contributions in AI safety. For that purpose, it doesn’t matter whether the prize incentivized the winning authors or not. But it is important that enough people will trust that the main criterion for selecting the winning works is usefulness.
My take on this is that the ideal version of this prize selects for both usefulness and counterfactualness, but selecting for counterfactualness without producing weird side effects seems hard. (I do think it’s worth spending an hour or two thinking about how to properly incentivize or reward counterfactualness, just, if you haven’t come up with anything, strictly rewarding quality/usefulness seems better)
> selecting for counterfactualness without producing weird side effects seems hard
agree, I just thought the winner in this case was over the top enough to not be in the fuzzy boundary but clearly on the other side.
Our rules don’t draw that boundary at the moment, and I’m not even sure how it could be phrased. Do you have any suggestions?
I wouldn’t be in favor of adding explicit rules for goodheart related reasons. I think prizes and grants should have the minimum rules to account for basic logistics and the rest should be illegible.
Ah, yeah that makes sense.