How do you account for the fact that the impact of a particular contribution to object-level alignment research can compound over time?
Let’s say I have a technical alignment idea now that is both hard to learn and very usefwl, such that every recipient of it does alignment research a little more efficiently. But it takes time before that idea disseminates across the community.
At first, only a few people bother to learn it sufficiently to understand that it’s valuable. But every person that does so adds to the total strength of the signal that tells the rest of the community that they should prioritise learning this.
Not sure if this is the right framework, but let’s say that researchers will only bother learning it if the strength of the signal hits their person-specific threshold for prioritising it.
Number of researchers are normally distributed (or something) over threshold height, and the strength of the signal starts out below the peak of the distribution.
Then (under some assumptions about the strength of individual signals and the distribution of threshold height), every learner that adds to the signal will, at first, attract more than one learner that adds to the signal, until the signal passes the peak of the distribution and the idea reaches satiation/fixation in the community.
If something like the above model is correct, then the impact of alignment research plausibly goes down over time.
But the same is true of a lot of time-buying work (like outreach). I don’t know how to balance this, but I am now a little more skeptical of the relative value of buying time.
Importantly, this is not the same as “outreach”. Strong technical alignment ideas are most likely incompatible with almost everyone outside the community, so the idea doesn’t increase the number of people working on alignment.
How do you account for the fact that the impact of a particular contribution to object-level alignment research can compound over time?
Let’s say I have a technical alignment idea now that is both hard to learn and very usefwl, such that every recipient of it does alignment research a little more efficiently. But it takes time before that idea disseminates across the community.
At first, only a few people bother to learn it sufficiently to understand that it’s valuable. But every person that does so adds to the total strength of the signal that tells the rest of the community that they should prioritise learning this.
Not sure if this is the right framework, but let’s say that researchers will only bother learning it if the strength of the signal hits their person-specific threshold for prioritising it.
Number of researchers are normally distributed (or something) over threshold height, and the strength of the signal starts out below the peak of the distribution.
Then (under some assumptions about the strength of individual signals and the distribution of threshold height), every learner that adds to the signal will, at first, attract more than one learner that adds to the signal, until the signal passes the peak of the distribution and the idea reaches satiation/fixation in the community.
If something like the above model is correct, then the impact of alignment research plausibly goes down over time.
But the same is true of a lot of time-buying work (like outreach). I don’t know how to balance this, but I am now a little more skeptical of the relative value of buying time.
Importantly, this is not the same as “outreach”. Strong technical alignment ideas are most likely incompatible with almost everyone outside the community, so the idea doesn’t increase the number of people working on alignment.