If there’s no difference between capability work and alignment work, then how is it possible to influence anything at all? If capability and alignment go hand in hand, then either transformative capability corresponds to sufficient alignment (in which case there is no technical problem) or it doesn’t (in which case we’re doomed).
The only world in which secrecy makes sense, AFAICT, is if you’re going to solve alignment and capability all by yourself. I am extremely skeptical of this approach.
One view could be that to do good alignment work, you need to have a good grasp on how to get to superhuman capabilities and then add constraints to the training process in the attempt to produce alignment. On this view, it doesn’t make a lot of sense to think about alignment without also thinking about capabilities because you don’t know what to put the “constraints” on top of, and what sorts of constraints you’re going to need.
In this scenario we are double-doomed, since you can’t make progress on alignment before reaching the point where you already teeter on the precipice. I don’t think this is the case, and I would be surprised if Yudkowsky endorses this view. On the other hand, I am already confused, so who knows.
I think both capabilities and alignment are too broad to be useful, and may benefit by going specific.
For example, Chris Olah’s interpretability work seems great for understanding NN’s and how they make decisions. This could be used to get NN’s to do what we specifically want and verify their behavior. It seems like a stretch to call this capabilities.
On the other hand, finding architectures that are more sample-efficient doesn’t directly help us avoid typical alignment issues.
It may be useful to write a post on this topic going into many specific examples because I’m still confused on how they relate, and I’m seeing that confusion crop up in this thread.
I’m not sure what you’re arguing exactly. Would we somehow benefit if Olah’s work was done in secret?
A lot of work contributes towards both capability and alignment, and we should only publish work whose alignment : capability ratio is high enough s.t. it improves the trajectory in expectation. But, usually if the alignment : capability ratio is not high enough then we shouldn’t work on this in the first place.
I’m saying I partially agree with Vaniver that the “capabilties and alignment as independent” frame is wrong. Your alignment : capability ratio framing is a better fit, and I partially agree with
If the alignment: capability ratio is low enough to warrant secrecy as EY wants, then they should scrap those projects and work on ones that have a much higher ratio.
I say “partially” for both because I notice I’m confused and predict that after working through examples and talking with someone/writing a post, I will have a different opinion.
and we should only publish work whose alignment : capability ratio is high enough s.t. it improves the trajectory in expectation
Broadly I’d agree, but I think there are cases where this framing doesn’t work. Specifically, it doesn’t account for others’ inaccurate assessments of what constitutes a robust alignment solution. [and of course here I’m not disagreeing with “publish work [that] improves the trajectory in expectation”, but rather with the idea that a high enough alignment : capability ratio ensures this]
Suppose I have a fully implementable alignment approach which solves a large part of the problem (but boosts capability not at all). Suppose also that I expect various well-resourced organisations to conclude (incorrectly) that my approach is sufficient as a full alignment solution. If I expect capabilities to reach a critical threshold before the rest of the alignment problem will be solved, or before I’m able to persuade all such organisations that my approach isn’t sufficient, it can make sense to hide my partial solution (or at least to be very careful about sharing it).
For example, take a full outer alignment solution that doesn’t address deceptive alignment. It’s far from clear to me that it’d make sense to publish such work openly.
While I’m generally all for the research productivity benefits of sharing, I think there’s a very real danger that the last parts of the problem to be solved may be deep, highly non-obvious problems. Before that point, the wide availability of plausible-but-tragically-flawed alignment approaches might be a huge negative. (unless you already assume that some organisation will launch an AGI, even with no plausible alignment solution)
Hmm, this might apply to some hypothetical situation, but it’s pretty hard for me to believe that it applies in practice in the present day.
First, you can share your solution while also writing about its flaws.
Second, I think “some TAI will be built whether there is any solution or not” is more likely than “TAI will be built iff a solution is available, even if the solution is flawed”.
Third, I just don’t see the path to success that goes through secrecy. If you don’t publish partial solutions, then you’re staking everything on being able to produce a complete solution by yourself, without any collaboration or external critique. This seems highly dubious, unless maybe if you literally gathered the brightest minds on the planet Manhattan project style, but no AI safety org is anywhere near that.
I can see secrecy being valuable in the endgame, when it does become plausible to create a full solution by yourself, or even build TAI by yourself, but I don’t see it in the present day.
Again, I broadly agree—usually I expect sharing to be best. My point is mostly that there’s more to account for in general than an [alignment : capability] ratio.
Some thoughts on your specific points:
First, you can share your solution while also writing about its flaws.
Sure, but if the ‘flaw’ is of the form [doesn’t address problem various people don’t believe really matters/exists], then it’s not clear that this helps. E.g. outer alignment solution that doesn’t deal with deceptive alignment.
Second, I think “some TAI will be built whether there is any solution or not” is more likely than “TAI will be built iff a solution is available, even if the solution is flawed”.
Here I probably agree with you, on reflection. My initial thought was that the [plausible-flawed-solution] makes things considerably worse, but I don’t think I was accounting for the scenario where people believe a robust alignment solution will be needed at some point, but just not yet—because you see this system isn’t a dangerous one...
Does this make up the majority of your TAI-catastrophe-probability? I.e. it’s mostly “we don’t need to worry yet… Foom” rather than “we don’t ever need to worry about (e.g.) deceptive alignment… Foom”.
Third, I just don’t see the path to success that goes through secrecy...
I definitely agree, but I do think it’s important not to approach the question in black-and-white terms. For some information it may make sense to share with say 5 or 20 people (at least at first).
I might consider narrower sharing if:
I have wide error bars on the [alignment : capabilities] ratio of my work. (in particular, approaches that only boost average-case performance may fit better as “capabilities” here, even if they’re boosting performance through better alignment [essentially this is one of Critch/Krueger’s points in ARCHES])
I have high confidence my work solves [some narrow problem], high confidence it won’t help in solving the harder alignment problems, and an expectation that it may be mistaken for a sufficient solution.
Personally, I find it implausible I’d ever decide to share tangible progress with no-one, but I can easily imagine being cautious in sharing publicly.
On the other hand, doomy default expectations argue for ramping up the variance in the hope of bumping into ‘miracles’. I’d guess that increased sharing of alignment work boosts the miracle rate more significantly than capabilities.
Does this make up the majority of your TAI-catastrophe-probability? I.e. it’s mostly “we don’t need to worry yet… Foom” rather than “we don’t ever need to worry about (e.g.) deceptive alignment… Foom”.
Sort of. Basically, I just think that a lot of actors are likely to floor the gas pedal no matter what (including all leading labs that currently exist). We will be lucky if they include any serious precautions even if such precautions are public knowledge, not to mention exhibit the security mindset to implement them properly. An actor that truly internalized the risk would not be pushing the SOTA in AI capabilities until the alignment problem is as solved as conceivably possible in advance.
If there’s no difference between capability work and alignment work, then how is it possible to influence anything at all? If capability and alignment go hand in hand, then either transformative capability corresponds to sufficient alignment (in which case there is no technical problem) or it doesn’t (in which case we’re doomed).
The only world in which secrecy makes sense, AFAICT, is if you’re going to solve alignment and capability all by yourself. I am extremely skeptical of this approach.
One view could be that to do good alignment work, you need to have a good grasp on how to get to superhuman capabilities and then add constraints to the training process in the attempt to produce alignment. On this view, it doesn’t make a lot of sense to think about alignment without also thinking about capabilities because you don’t know what to put the “constraints” on top of, and what sorts of constraints you’re going to need.
In this scenario we are double-doomed, since you can’t make progress on alignment before reaching the point where you already teeter on the precipice. I don’t think this is the case, and I would be surprised if Yudkowsky endorses this view. On the other hand, I am already confused, so who knows.
I think both capabilities and alignment are too broad to be useful, and may benefit by going specific.
For example, Chris Olah’s interpretability work seems great for understanding NN’s and how they make decisions. This could be used to get NN’s to do what we specifically want and verify their behavior. It seems like a stretch to call this capabilities.
On the other hand, finding architectures that are more sample-efficient doesn’t directly help us avoid typical alignment issues.
It may be useful to write a post on this topic going into many specific examples because I’m still confused on how they relate, and I’m seeing that confusion crop up in this thread.
I’m not sure what you’re arguing exactly. Would we somehow benefit if Olah’s work was done in secret?
A lot of work contributes towards both capability and alignment, and we should only publish work whose alignment : capability ratio is high enough s.t. it improves the trajectory in expectation. But, usually if the alignment : capability ratio is not high enough then we shouldn’t work on this in the first place.
I don’t think Olah’s work should be secret.
I’m saying I partially agree with Vaniver that the “capabilties and alignment as independent” frame is wrong. Your alignment : capability ratio framing is a better fit, and I partially agree with
I say “partially” for both because I notice I’m confused and predict that after working through examples and talking with someone/writing a post, I will have a different opinion.
Broadly I’d agree, but I think there are cases where this framing doesn’t work. Specifically, it doesn’t account for others’ inaccurate assessments of what constitutes a robust alignment solution. [and of course here I’m not disagreeing with “publish work [that] improves the trajectory in expectation”, but rather with the idea that a high enough alignment : capability ratio ensures this]
Suppose I have a fully implementable alignment approach which solves a large part of the problem (but boosts capability not at all). Suppose also that I expect various well-resourced organisations to conclude (incorrectly) that my approach is sufficient as a full alignment solution.
If I expect capabilities to reach a critical threshold before the rest of the alignment problem will be solved, or before I’m able to persuade all such organisations that my approach isn’t sufficient, it can make sense to hide my partial solution (or at least to be very careful about sharing it).
For example, take a full outer alignment solution that doesn’t address deceptive alignment.
It’s far from clear to me that it’d make sense to publish such work openly.
While I’m generally all for the research productivity benefits of sharing, I think there’s a very real danger that the last parts of the problem to be solved may be deep, highly non-obvious problems. Before that point, the wide availability of plausible-but-tragically-flawed alignment approaches might be a huge negative. (unless you already assume that some organisation will launch an AGI, even with no plausible alignment solution)
Hmm, this might apply to some hypothetical situation, but it’s pretty hard for me to believe that it applies in practice in the present day.
First, you can share your solution while also writing about its flaws.
Second, I think “some TAI will be built whether there is any solution or not” is more likely than “TAI will be built iff a solution is available, even if the solution is flawed”.
Third, I just don’t see the path to success that goes through secrecy. If you don’t publish partial solutions, then you’re staking everything on being able to produce a complete solution by yourself, without any collaboration or external critique. This seems highly dubious, unless maybe if you literally gathered the brightest minds on the planet Manhattan project style, but no AI safety org is anywhere near that.
I can see secrecy being valuable in the endgame, when it does become plausible to create a full solution by yourself, or even build TAI by yourself, but I don’t see it in the present day.
Again, I broadly agree—usually I expect sharing to be best. My point is mostly that there’s more to account for in general than an [alignment : capability] ratio.
Some thoughts on your specific points:
Sure, but if the ‘flaw’ is of the form [doesn’t address problem various people don’t believe really matters/exists], then it’s not clear that this helps. E.g. outer alignment solution that doesn’t deal with deceptive alignment.
Here I probably agree with you, on reflection. My initial thought was that the [plausible-flawed-solution] makes things considerably worse, but I don’t think I was accounting for the scenario where people believe a robust alignment solution will be needed at some point, but just not yet—because you see this system isn’t a dangerous one...
Does this make up the majority of your TAI-catastrophe-probability? I.e. it’s mostly “we don’t need to worry yet… Foom” rather than “we don’t ever need to worry about (e.g.) deceptive alignment… Foom”.
I definitely agree, but I do think it’s important not to approach the question in black-and-white terms. For some information it may make sense to share with say 5 or 20 people (at least at first).
I might consider narrower sharing if:
I have wide error bars on the [alignment : capabilities] ratio of my work. (in particular, approaches that only boost average-case performance may fit better as “capabilities” here, even if they’re boosting performance through better alignment [essentially this is one of Critch/Krueger’s points in ARCHES])
I have high confidence my work solves [some narrow problem], high confidence it won’t help in solving the harder alignment problems, and an expectation that it may be mistaken for a sufficient solution.
Personally, I find it implausible I’d ever decide to share tangible progress with no-one, but I can easily imagine being cautious in sharing publicly.
On the other hand, doomy default expectations argue for ramping up the variance in the hope of bumping into ‘miracles’. I’d guess that increased sharing of alignment work boosts the miracle rate more significantly than capabilities.
Sort of. Basically, I just think that a lot of actors are likely to floor the gas pedal no matter what (including all leading labs that currently exist). We will be lucky if they include any serious precautions even if such precautions are public knowledge, not to mention exhibit the security mindset to implement them properly. An actor that truly internalized the risk would not be pushing the SOTA in AI capabilities until the alignment problem is as solved as conceivably possible in advance.