Hmm, this might apply to some hypothetical situation, but it’s pretty hard for me to believe that it applies in practice in the present day.
First, you can share your solution while also writing about its flaws.
Second, I think “some TAI will be built whether there is any solution or not” is more likely than “TAI will be built iff a solution is available, even if the solution is flawed”.
Third, I just don’t see the path to success that goes through secrecy. If you don’t publish partial solutions, then you’re staking everything on being able to produce a complete solution by yourself, without any collaboration or external critique. This seems highly dubious, unless maybe if you literally gathered the brightest minds on the planet Manhattan project style, but no AI safety org is anywhere near that.
I can see secrecy being valuable in the endgame, when it does become plausible to create a full solution by yourself, or even build TAI by yourself, but I don’t see it in the present day.
Again, I broadly agree—usually I expect sharing to be best. My point is mostly that there’s more to account for in general than an [alignment : capability] ratio.
Some thoughts on your specific points:
First, you can share your solution while also writing about its flaws.
Sure, but if the ‘flaw’ is of the form [doesn’t address problem various people don’t believe really matters/exists], then it’s not clear that this helps. E.g. outer alignment solution that doesn’t deal with deceptive alignment.
Second, I think “some TAI will be built whether there is any solution or not” is more likely than “TAI will be built iff a solution is available, even if the solution is flawed”.
Here I probably agree with you, on reflection. My initial thought was that the [plausible-flawed-solution] makes things considerably worse, but I don’t think I was accounting for the scenario where people believe a robust alignment solution will be needed at some point, but just not yet—because you see this system isn’t a dangerous one...
Does this make up the majority of your TAI-catastrophe-probability? I.e. it’s mostly “we don’t need to worry yet… Foom” rather than “we don’t ever need to worry about (e.g.) deceptive alignment… Foom”.
Third, I just don’t see the path to success that goes through secrecy...
I definitely agree, but I do think it’s important not to approach the question in black-and-white terms. For some information it may make sense to share with say 5 or 20 people (at least at first).
I might consider narrower sharing if:
I have wide error bars on the [alignment : capabilities] ratio of my work. (in particular, approaches that only boost average-case performance may fit better as “capabilities” here, even if they’re boosting performance through better alignment [essentially this is one of Critch/Krueger’s points in ARCHES])
I have high confidence my work solves [some narrow problem], high confidence it won’t help in solving the harder alignment problems, and an expectation that it may be mistaken for a sufficient solution.
Personally, I find it implausible I’d ever decide to share tangible progress with no-one, but I can easily imagine being cautious in sharing publicly.
On the other hand, doomy default expectations argue for ramping up the variance in the hope of bumping into ‘miracles’. I’d guess that increased sharing of alignment work boosts the miracle rate more significantly than capabilities.
Does this make up the majority of your TAI-catastrophe-probability? I.e. it’s mostly “we don’t need to worry yet… Foom” rather than “we don’t ever need to worry about (e.g.) deceptive alignment… Foom”.
Sort of. Basically, I just think that a lot of actors are likely to floor the gas pedal no matter what (including all leading labs that currently exist). We will be lucky if they include any serious precautions even if such precautions are public knowledge, not to mention exhibit the security mindset to implement them properly. An actor that truly internalized the risk would not be pushing the SOTA in AI capabilities until the alignment problem is as solved as conceivably possible in advance.
Hmm, this might apply to some hypothetical situation, but it’s pretty hard for me to believe that it applies in practice in the present day.
First, you can share your solution while also writing about its flaws.
Second, I think “some TAI will be built whether there is any solution or not” is more likely than “TAI will be built iff a solution is available, even if the solution is flawed”.
Third, I just don’t see the path to success that goes through secrecy. If you don’t publish partial solutions, then you’re staking everything on being able to produce a complete solution by yourself, without any collaboration or external critique. This seems highly dubious, unless maybe if you literally gathered the brightest minds on the planet Manhattan project style, but no AI safety org is anywhere near that.
I can see secrecy being valuable in the endgame, when it does become plausible to create a full solution by yourself, or even build TAI by yourself, but I don’t see it in the present day.
Again, I broadly agree—usually I expect sharing to be best. My point is mostly that there’s more to account for in general than an [alignment : capability] ratio.
Some thoughts on your specific points:
Sure, but if the ‘flaw’ is of the form [doesn’t address problem various people don’t believe really matters/exists], then it’s not clear that this helps. E.g. outer alignment solution that doesn’t deal with deceptive alignment.
Here I probably agree with you, on reflection. My initial thought was that the [plausible-flawed-solution] makes things considerably worse, but I don’t think I was accounting for the scenario where people believe a robust alignment solution will be needed at some point, but just not yet—because you see this system isn’t a dangerous one...
Does this make up the majority of your TAI-catastrophe-probability? I.e. it’s mostly “we don’t need to worry yet… Foom” rather than “we don’t ever need to worry about (e.g.) deceptive alignment… Foom”.
I definitely agree, but I do think it’s important not to approach the question in black-and-white terms. For some information it may make sense to share with say 5 or 20 people (at least at first).
I might consider narrower sharing if:
I have wide error bars on the [alignment : capabilities] ratio of my work. (in particular, approaches that only boost average-case performance may fit better as “capabilities” here, even if they’re boosting performance through better alignment [essentially this is one of Critch/Krueger’s points in ARCHES])
I have high confidence my work solves [some narrow problem], high confidence it won’t help in solving the harder alignment problems, and an expectation that it may be mistaken for a sufficient solution.
Personally, I find it implausible I’d ever decide to share tangible progress with no-one, but I can easily imagine being cautious in sharing publicly.
On the other hand, doomy default expectations argue for ramping up the variance in the hope of bumping into ‘miracles’. I’d guess that increased sharing of alignment work boosts the miracle rate more significantly than capabilities.
Sort of. Basically, I just think that a lot of actors are likely to floor the gas pedal no matter what (including all leading labs that currently exist). We will be lucky if they include any serious precautions even if such precautions are public knowledge, not to mention exhibit the security mindset to implement them properly. An actor that truly internalized the risk would not be pushing the SOTA in AI capabilities until the alignment problem is as solved as conceivably possible in advance.