[Edit: I initially thought of this purely tongue in cheek, but maybe there is something here that is worth examining further?]
You have cognitively powerful agents (highly competent researchers) who have incentives (250k+ salaries) to do things that you don’t want them to do (create AGIs that are likely unaligned), and you want them to instead do things that benefit humanity (work on alignment) instead.
It seems to me that offering $100k salaries to work for you instead is not an effective solution to this alignment problem. It relies on the agents being already aligned to the extent that a $150k/yr loss is outweighed by other incentives.
If money were not a tight constraint, it seems to me that offering $250k/yr would be worthwhile even if for no other reason than having them not work on racing to AGI.
The “pay to not race to AGI” would only make sense if there were a smallish pool of replacements ready to step in and work on racing to AGI. This doesn’t seem to be the case. The difference it’d make might not be zero, but close enough to be obviously inefficient.
In particular, there are presumably effective ways to use money to create a greater number of aligned AIS researchers—it’s just that [give people a lot of money to work on it] probably isn’t one of them.
In those terms the point is that paying $250k+ does not align them—it simply hides the problem of their misalignment.
[work on alignment] does not necessarily benefit humanity, even in expectation. That requires [the right kind of work on alignment] - and we don’t have good tests for that. Aiming for the right kind of work is no guarantee you’re doing it—but it beats not aiming for it. (again, argument is admittedly less clear for engineers)
Paying $100k doesn’t solve this alignment problem—it just allows us to see it. We want defection to be obvious. [here I emphasize that I don’t have any problem with people who would work for $100k happening to get $250k, if that seems efficient]
Worth noting that we don’t require a “benefit humanity” motivation—intellectual curiosity will do fine (and I imagine this is already a major motivation for most researchers: how many would be working on the problem if it were painfully dull?). We only require that they’re actually solving the problem. If we knew how to get them to do that for other reasons that’d be fine—but I don’t think money or status are good levers here. (or at the very least, they’re levers with large downsides)
[Edit: I initially thought of this purely tongue in cheek, but maybe there is something here that is worth examining further?]
You have cognitively powerful agents (highly competent researchers) who have incentives (250k+ salaries) to do things that you don’t want them to do (create AGIs that are likely unaligned), and you want them to instead do things that benefit humanity (work on alignment) instead.
It seems to me that offering $100k salaries to work for you instead is not an effective solution to this alignment problem. It relies on the agents being already aligned to the extent that a $150k/yr loss is outweighed by other incentives.
If money were not a tight constraint, it seems to me that offering $250k/yr would be worthwhile even if for no other reason than having them not work on racing to AGI.
The “pay to not race to AGI” would only make sense if there were a smallish pool of replacements ready to step in and work on racing to AGI. This doesn’t seem to be the case. The difference it’d make might not be zero, but close enough to be obviously inefficient.
In particular, there are presumably effective ways to use money to create a greater number of aligned AIS researchers—it’s just that [give people a lot of money to work on it] probably isn’t one of them.
In those terms the point is that paying $250k+ does not align them—it simply hides the problem of their misalignment.
[work on alignment] does not necessarily benefit humanity, even in expectation. That requires [the right kind of work on alignment] - and we don’t have good tests for that. Aiming for the right kind of work is no guarantee you’re doing it—but it beats not aiming for it. (again, argument is admittedly less clear for engineers)
Paying $100k doesn’t solve this alignment problem—it just allows us to see it. We want defection to be obvious. [here I emphasize that I don’t have any problem with people who would work for $100k happening to get $250k, if that seems efficient]
Worth noting that we don’t require a “benefit humanity” motivation—intellectual curiosity will do fine (and I imagine this is already a major motivation for most researchers: how many would be working on the problem if it were painfully dull?).
We only require that they’re actually solving the problem. If we knew how to get them to do that for other reasons that’d be fine—but I don’t think money or status are good levers here. (or at the very least, they’re levers with large downsides)