The use of “Differential Progress” (“does this advance safety more or capabilities more?”) by the AI safety community to evaluate the value of research is ill-motivated.
Most capabilities advancements are not very counterfactual (“some similar advancement would have happened anyway”), whereas safety research is. In other words: differential progress measures absolute rather than comparative advantage / disregards the impact of supply on value / measures value as the y-intercept of the demand curve rather than the intersection of the demand and supply curves.
Even if you looked at actual market value, just p_safety > p_capabilities isn’t a principled condition.
Concretely, I think that harping on differential progress risks AI safety getting crowded out by harmless but useless work—most obviously “AI bias” “AI disinformation”, and in my more controversial opinion, overtly prosaic AI safety research which will not give us any insights that can be generalized beyond current architectures. A serious solution to AI alignment will in all likelihood involve risky things like imagining more powerful architectures and revealing some deeper insights about intelligence.
I think there are two important insights here. One is that counterfactual differential progress is the right metric for weighing whether ideas or work should be published. This seems obviously true but not obvious, so well worth stating, and frequently.
The second important idea is that doing detailed work on alignment requires talking about specific AGI designs. This also seems obviously true, but I think goes unnoticed and unappreciated a lot of the time. How an AGI arrives at decisions, beliefs, and values is going to be dependent on its specific architectures.
Putting these two concepts together makes the publication decision much more difficult. Should we cripple alignment work in the interest of having more time before AGI? One pat answer I see is “discuss those ideas privately not publicly”. But in practice, this severely limits the number of eyes on each idea, making it vastly more likley that good ideas in alignment aren’t spread worked on quickly.
I don’t have any good solutions here, but want to note that this issue seems critically important for alignment work. I’ve personally been roadblocked in substantial ways by this dilemma.
My background means I have relatively a lot of knowledge and theories about how the human mind works. I have specific ideas about several possible routes from current AI to x-risk AGI. Each of these routes also has associated alignment plans. But I can’t discuss those plans in detail without discussing the AGI designs in detail. They sound vague and unconvincing without the design forms they fit into. This is a sharp limitation on how much progress I can make on these ideas. I have a handful of people who can and will engage in detail in private, limited and vague engagement in public where the ideas must remain vague, and largely I am working on my own. Private feedback indicates that these AGI designs and alignment schemes might well be viable and relevant, although of course a debunking is always one conversation away.
This is not ideal, nor do I know of a route around it.
The use of “Differential Progress” (“does this advance safety more or capabilities more?”) by the AI safety community to evaluate the value of research is ill-motivated.
Most capabilities advancements are not very counterfactual (“some similar advancement would have happened anyway”), whereas safety research is. In other words: differential progress measures absolute rather than comparative advantage / disregards the impact of supply on value / measures value as the y-intercept of the demand curve rather than the intersection of the demand and supply curves.
Even if you looked at actual market value, just p_safety > p_capabilities isn’t a principled condition.
Concretely, I think that harping on differential progress risks AI safety getting crowded out by harmless but useless work—most obviously “AI bias” “AI disinformation”, and in my more controversial opinion, overtly prosaic AI safety research which will not give us any insights that can be generalized beyond current architectures. A serious solution to AI alignment will in all likelihood involve risky things like imagining more powerful architectures and revealing some deeper insights about intelligence.
I think there are two important insights here. One is that counterfactual differential progress is the right metric for weighing whether ideas or work should be published. This seems obviously true but not obvious, so well worth stating, and frequently.
The second important idea is that doing detailed work on alignment requires talking about specific AGI designs. This also seems obviously true, but I think goes unnoticed and unappreciated a lot of the time. How an AGI arrives at decisions, beliefs, and values is going to be dependent on its specific architectures.
Putting these two concepts together makes the publication decision much more difficult. Should we cripple alignment work in the interest of having more time before AGI? One pat answer I see is “discuss those ideas privately not publicly”. But in practice, this severely limits the number of eyes on each idea, making it vastly more likley that good ideas in alignment aren’t spread worked on quickly.
I don’t have any good solutions here, but want to note that this issue seems critically important for alignment work. I’ve personally been roadblocked in substantial ways by this dilemma.
My background means I have relatively a lot of knowledge and theories about how the human mind works. I have specific ideas about several possible routes from current AI to x-risk AGI. Each of these routes also has associated alignment plans. But I can’t discuss those plans in detail without discussing the AGI designs in detail. They sound vague and unconvincing without the design forms they fit into. This is a sharp limitation on how much progress I can make on these ideas. I have a handful of people who can and will engage in detail in private, limited and vague engagement in public where the ideas must remain vague, and largely I am working on my own. Private feedback indicates that these AGI designs and alignment schemes might well be viable and relevant, although of course a debunking is always one conversation away.
This is not ideal, nor do I know of a route around it.