Suppose you think that both capabilities and alignment behave like abstract quantities, ie real numbers.
And suppose that you think there is a threshold amount of alignment, and a threshold amount of capabilities, making a race to which threshold is reached first.
If you also assume that the contribution of your research is fairly small, and our uncertainty about the threshold locations is high,
then we have the heuristic, only publish your research if the ratio between capabilities and alignment that it produces is better than the ratio over all future research.
(note that research on how to make better chips counts as capabilities research in this model)
Another way to think about it is that the problems are created by research. If you don’t think that “another new piece of AI research has been produced” is reason to shift probabilities of success up or down, it just moves timelines forward, then the average piece of research is neither good nor bad.
Hmm, so in this model we assume that (i) the research output of the rest of the world is known (ii) we are deciding about one result only (iii) the thresholds are unknown. In this case you are right that we need to compare our alignment : capability ratio to the rest of the world’s alignment : capability ratio.
Now assume that, instead of just one result overall, you produce a single result every year. Most of the results in the sequence have alignment : capability ratio way above the rest of the world, but then there is a year in which the ratio is only barely above the rest of the world. In this case, you are better off not publishing the irregular result, even though the naive ratio criterion says to publish. We can reconcile it with the previous model by including your own research in the reference, but it creates a somewhat confusing self-reference.
Second, we can switch to modeling the research output of the rest of the world as a random walk. In this case, if the average direction of progress is pointing towards failure, then moving along this direction is net negative, since it reduces the chance to get success by luck.
Suppose you think that both capabilities and alignment behave like abstract quantities, ie real numbers.
And suppose that you think there is a threshold amount of alignment, and a threshold amount of capabilities, making a race to which threshold is reached first.
If you also assume that the contribution of your research is fairly small, and our uncertainty about the threshold locations is high,
then we have the heuristic, only publish your research if the ratio between capabilities and alignment that it produces is better than the ratio over all future research.
(note that research on how to make better chips counts as capabilities research in this model)
Another way to think about it is that the problems are created by research. If you don’t think that “another new piece of AI research has been produced” is reason to shift probabilities of success up or down, it just moves timelines forward, then the average piece of research is neither good nor bad.
Hmm, so in this model we assume that (i) the research output of the rest of the world is known (ii) we are deciding about one result only (iii) the thresholds are unknown. In this case you are right that we need to compare our alignment : capability ratio to the rest of the world’s alignment : capability ratio.
Now assume that, instead of just one result overall, you produce a single result every year. Most of the results in the sequence have alignment : capability ratio way above the rest of the world, but then there is a year in which the ratio is only barely above the rest of the world. In this case, you are better off not publishing the irregular result, even though the naive ratio criterion says to publish. We can reconcile it with the previous model by including your own research in the reference, but it creates a somewhat confusing self-reference.
Second, we can switch to modeling the research output of the rest of the world as a random walk. In this case, if the average direction of progress is pointing towards failure, then moving along this direction is net negative, since it reduces the chance to get success by luck.