I do like the core concept here, but I think for it to work you need to have a pretty well specified problem that people can’t weasel out of. (I expect the default result of this to be “1000 researchers all come up with reasons they think they’ve solved alignment, without really understanding what was supposed to be hard in the first place.”)
You touch upon this in your post but I think it’s kinda the main blocker.
I do think might be a surmountable obstacle though.
I agree. I imagine a disagreement where the person says “you can’t prove my proposal won’t work” and the alignment judge says “you can’t prove your proposal will work”, and then there’s some nasty, very-hard-to-resolve debate about whether AGI is dangerous by default or not, involving demands for more detail, and then the person says “I can’t provide more detail because we don’t have AGI yet”, etc.
I think of the disagreement between Paul and Eliezer about whether IDA was a promising path to safe AGI (back when Paul was more optimistic about it); both parties there were exceptionally smart and knowledgeable about alignment, and they still couldn’t reconcile.
Giving people a more narrow problem (e.g. ELK) would help in some ways, but there could still be disagreement over whether solving that particular problem in advance (or at all) is in fact necessary to avert AGI catastrophe.
I’ve seen proposals of the form “Eliezer and Paul both have to agree the researchers have solved alignment to get the grand prize”, which seems better than not-that, but, still seems insufficiently legibly fair to make it work with 1000 researchers.
I have also seem some partially formalized stabs at something like “interpretability challenges”, somewhat inspired by Auditing Games, where there are multiple challenges (i.e. bronze, silver and gold awards), and the bronze challenge is meant to be something achievable by interpretability researchers within a couple years, and the gold challenge is meant to be something like “you can actually reliably detect deceptive adversaries, and other key properties a competent civilization would have before running dangerously powerful AGI.”
This isn’t the same as an “alignment prize”, but might be easier to specify.
I don’t think well specified problem is the way to go.
Instead chose a group of judges, and if the majority of them think something is progress, then that counts a winn. This are the rules, and no argument afterwards.
But to make this fair, the judges should be available to discuss possible research directions, and what they would and would not concider progress, before and during the 3 months of reaserch. Or if that’s not possible, due to the scale, maybe some group of people who are good at predicting what the real judges would say.
Besides, giving out the money, and not stay in touch during the 3 months, seems like a bad idea anyway.
I agree that actual judges are better for actually solving the problem, the question is whether you can legibly scale the reputation of those judges to 1000 researchers who aren’t already bought into the judge’s paradigm.
I don’t know. I would start small, and worry about scaling later.
If this is worth doing at all it is also worth doing at smaller scale, so it would not be a waste if the project don’t scale.
Also, I suspect that quality is more important than quantity here. Doing a really good job running this program with the 50 most influential AI reserchers, might be better payoff, than doing a half job with the top 1000 reaerchers.
Oh, maybe, but it seemed like the specific proposal here was the scale. It seemed like hiring a few people to do AI research was… sort of just the same as all the hiring and independent research that’s already been happening. The novel suggestion here is hiring less aligned people at a scale that would demonstrate to the rest of the world the problem was really hard. (not sure exactly what the OP meant, and not sure what specific new element you were was interested in)
it seemed like the specific proposal here was the scale.
I guess a question here is what the returns-to-scale curve looks like? I’d be surprised if the 501-1000th researchers were more valuable than the 1-500th, suggesting there is a smaller version that’s still worth doing.
I don’t know where this guess comes from, but my guess might be that the curve is increasing up to somewhere between 10 and 100 researchers, and decreasing after that. But also there’s likely to be threshold effects at round/newsworthy numbers?
I’m not going to speculate about who ment what. But for me the new and interesing idea whas to pay people to do reserch in order to change their mind, as oposed to pay people to do reaserch in order to produce reserch.
As far as I know all the current hireing and paying independent reserchers is directed towards people who already do belive that AI Safety reserch is dificult and important. Paying people who are not yet convinced is a new move (as far as I know) even at small scale.
I guess that is is currently possible for an AI risk sceptics to get an AI Safety reserch grant. But none of the grant are designed for this purpouse, right? I think the format of a very high payed (more thatn they would earn otherwese) very short (not a major interuption to ongoing research) offer, with the possibility of a price at the end, is more optimised to get skeptics onboard.
In short, the design of a funding program will be very diffrent when you have a difrent goal in mind.
I do like the core concept here, but I think for it to work you need to have a pretty well specified problem that people can’t weasel out of. (I expect the default result of this to be “1000 researchers all come up with reasons they think they’ve solved alignment, without really understanding what was supposed to be hard in the first place.”)
You touch upon this in your post but I think it’s kinda the main blocker.
I do think might be a surmountable obstacle though.
I agree. I imagine a disagreement where the person says “you can’t prove my proposal won’t work” and the alignment judge says “you can’t prove your proposal will work”, and then there’s some nasty, very-hard-to-resolve debate about whether AGI is dangerous by default or not, involving demands for more detail, and then the person says “I can’t provide more detail because we don’t have AGI yet”, etc.
I think of the disagreement between Paul and Eliezer about whether IDA was a promising path to safe AGI (back when Paul was more optimistic about it); both parties there were exceptionally smart and knowledgeable about alignment, and they still couldn’t reconcile.
Giving people a more narrow problem (e.g. ELK) would help in some ways, but there could still be disagreement over whether solving that particular problem in advance (or at all) is in fact necessary to avert AGI catastrophe.
I’ve seen proposals of the form “Eliezer and Paul both have to agree the researchers have solved alignment to get the grand prize”, which seems better than not-that, but, still seems insufficiently legibly fair to make it work with 1000 researchers.
I have also seem some partially formalized stabs at something like “interpretability challenges”, somewhat inspired by Auditing Games, where there are multiple challenges (i.e. bronze, silver and gold awards), and the bronze challenge is meant to be something achievable by interpretability researchers within a couple years, and the gold challenge is meant to be something like “you can actually reliably detect deceptive adversaries, and other key properties a competent civilization would have before running dangerously powerful AGI.”
This isn’t the same as an “alignment prize”, but might be easier to specify.
I don’t think well specified problem is the way to go.
Instead chose a group of judges, and if the majority of them think something is progress, then that counts a winn. This are the rules, and no argument afterwards.
But to make this fair, the judges should be available to discuss possible research directions, and what they would and would not concider progress, before and during the 3 months of reaserch. Or if that’s not possible, due to the scale, maybe some group of people who are good at predicting what the real judges would say.
Besides, giving out the money, and not stay in touch during the 3 months, seems like a bad idea anyway.
I agree that actual judges are better for actually solving the problem, the question is whether you can legibly scale the reputation of those judges to 1000 researchers who aren’t already bought into the judge’s paradigm.
I don’t know. I would start small, and worry about scaling later.
If this is worth doing at all it is also worth doing at smaller scale, so it would not be a waste if the project don’t scale.
Also, I suspect that quality is more important than quantity here. Doing a really good job running this program with the 50 most influential AI reserchers, might be better payoff, than doing a half job with the top 1000 reaerchers.
Oh, maybe, but it seemed like the specific proposal here was the scale. It seemed like hiring a few people to do AI research was… sort of just the same as all the hiring and independent research that’s already been happening. The novel suggestion here is hiring less aligned people at a scale that would demonstrate to the rest of the world the problem was really hard. (not sure exactly what the OP meant, and not sure what specific new element you were was interested in)
I guess a question here is what the returns-to-scale curve looks like? I’d be surprised if the 501-1000th researchers were more valuable than the 1-500th, suggesting there is a smaller version that’s still worth doing.
I don’t know where this guess comes from, but my guess might be that the curve is increasing up to somewhere between 10 and 100 researchers, and decreasing after that. But also there’s likely to be threshold effects at round/newsworthy numbers?
That does sound right-ish.
I’m not going to speculate about who ment what. But for me the new and interesing idea whas to pay people to do reserch in order to change their mind, as oposed to pay people to do reaserch in order to produce reserch.
As far as I know all the current hireing and paying independent reserchers is directed towards people who already do belive that AI Safety reserch is dificult and important. Paying people who are not yet convinced is a new move (as far as I know) even at small scale.
I guess that is is currently possible for an AI risk sceptics to get an AI Safety reserch grant. But none of the grant are designed for this purpouse, right? I think the format of a very high payed (more thatn they would earn otherwese) very short (not a major interuption to ongoing research) offer, with the possibility of a price at the end, is more optimised to get skeptics onboard.
In short, the design of a funding program will be very diffrent when you have a difrent goal in mind.