Thanks for writing this, it’s great to see people’s reasons for optimism/pessimism!
My views on alignment are similar to (my understanding of) Nate Soares’.
I’m surprised by this sentence in conjunction with the rest of this post: the views in this post seem very different from my Nate model. This is based only on what I’ve read on LessWrong, so it feels a bit weird to write about what I think Nate thinks, but it still seems important to mention. If someone more qualified wants to jump in, all the better. Non-comprehensive list:
I think the key differences are because I don’t think there’s enough evidence to confidently predict the difficulty of future problems, and I do think it’s possible for careful labs to avoid active commission of catastrophe.
Not as important as the other points, but I’m not even sure how much you disagree here. E.g. Nate on difficulty, from the sharp left turn post:
Although these look hard, I’m a believer in the capacity of humanity to solve technical problems at this level of difficulty when we put our minds to it. My concern is that I currently don’t think the field is trying to solve this problem.
And on the point of labs, I would have guessed Nate agrees with the literal statement, just thinks current labs aren’t careful enough, and won’t be?
Language model interventions work pretty well
My Nate model doesn’t view this as especially informative about how AGI will go. In particular:
HHH omits several vital pieces of the full alignment problem, but if it leads to AI that always shuts down on command and never causes a catastrophe I’ll be pretty happy.
If I understand you correctly, the “vital pieces” that are missing are not ones that make it shut down and never cause catastrophe? (Not entirely sure what they are about instead). My Nate model agrees that vital pieces are missing, and that never causing a catastrophe would be great, but crucially thinks that the pieces that are missing are needed to never cause a catastrophe.
Few attempts to align ML systems
In my Nate model, empirical work with pre-AGI/pre-sharp-left-turn systems can only get you so far. If we can now do more empirical alignment work, that still won’t help with what are probably the deadliest problems. Once we can empirically work on those, there’s very little time left.
Interpretability is promising!
Nate has said he’s in favor of interpretability research, and I have no idea if he’s been positively surprised by the rate of progress. But I would still guess you are way more optimistic in absolute terms about how helpful interpretability is going to be (see his comments here).
If a major lab saw something which really scared them, I think others would in fact agree to a moratorium on further capabilities until it could be thoroughly investigated.
Nate wrote a post which I understand to argue against more or less this claim.
I don’t expect a ‘sharp left turn’
Nate has of course written about how he does expect one. My impression is that this isn’t just some minor difference in what you think AGI will look like, but points at some pretty deep and important disagreements (that are upstream of some other ones).
Maybe you’re aware of all those disagreements and would still call your views “similar”, or maybe you have a better Nate model, in which case great! But otherwise, it seems pretty important to at least be aware there are big disagreements, even if that doesn’t end up changing your position much.
I’m basing my impression here on having read much of Nate’s public writing on AI, and a conversation over shared lunch at a conference a few months ago. His central estimate for P(doom) is certainly substantially higher than mine, but as I remember it we have pretty similar views of the underlying dynamics to date, somewhat diverging about the likelihood of catastrophe with very capable systems, and both hope that future evidence favors the less-doom view.
Unfortunately I agree that “shut down” and “no catastrophe” are still missing pieces. I’m more optimistic than my model of Nate that the HHH research agenda constitutes any progress towards this goal though.
I think labs correctly assess that they’re neither working with or at non-trivial immediate risk of creating x-risky models, nor yet cautious enough to do so safely. If labs invested in this, I think they could probably avoid accidentally creating an x-risky system without abandoning ML research before seeing warning signs.
I agree that pre-AGI empirical alignment work only gets you so far, and that you probably get very little time for direct empirical work on the deadliest problems (two years if very fortunate, days to seconds if you’re really not). But I’d guess my estimate of “only so far” is substantially further than Nate’s, largely off different credence in a sharp left turn.
I was struck by how similarly we assessed the current situation and evidence available so far, but that is a big difference and maybe I shouldn’t describe our views as similar.
I generally agree with Nate’s warning shots post, and with some comments (e.g.), but the “others” I was thinking would likely agree to a moratorium were other labs, not governments (c.f.), which could buy say 6--24 vital months.
Thanks for writing this, it’s great to see people’s reasons for optimism/pessimism!
I’m surprised by this sentence in conjunction with the rest of this post: the views in this post seem very different from my Nate model. This is based only on what I’ve read on LessWrong, so it feels a bit weird to write about what I think Nate thinks, but it still seems important to mention. If someone more qualified wants to jump in, all the better. Non-comprehensive list:
Not as important as the other points, but I’m not even sure how much you disagree here. E.g. Nate on difficulty, from the sharp left turn post:
And on the point of labs, I would have guessed Nate agrees with the literal statement, just thinks current labs aren’t careful enough, and won’t be?
My Nate model doesn’t view this as especially informative about how AGI will go. In particular:
If I understand you correctly, the “vital pieces” that are missing are not ones that make it shut down and never cause catastrophe? (Not entirely sure what they are about instead). My Nate model agrees that vital pieces are missing, and that never causing a catastrophe would be great, but crucially thinks that the pieces that are missing are needed to never cause a catastrophe.
In my Nate model, empirical work with pre-AGI/pre-sharp-left-turn systems can only get you so far. If we can now do more empirical alignment work, that still won’t help with what are probably the deadliest problems. Once we can empirically work on those, there’s very little time left.
Nate has said he’s in favor of interpretability research, and I have no idea if he’s been positively surprised by the rate of progress. But I would still guess you are way more optimistic in absolute terms about how helpful interpretability is going to be (see his comments here).
Nate wrote a post which I understand to argue against more or less this claim.
Nate has of course written about how he does expect one. My impression is that this isn’t just some minor difference in what you think AGI will look like, but points at some pretty deep and important disagreements (that are upstream of some other ones).
Maybe you’re aware of all those disagreements and would still call your views “similar”, or maybe you have a better Nate model, in which case great! But otherwise, it seems pretty important to at least be aware there are big disagreements, even if that doesn’t end up changing your position much.
I’m basing my impression here on having read much of Nate’s public writing on AI, and a conversation over shared lunch at a conference a few months ago. His central estimate for P(doom) is certainly substantially higher than mine, but as I remember it we have pretty similar views of the underlying dynamics to date, somewhat diverging about the likelihood of catastrophe with very capable systems, and both hope that future evidence favors the less-doom view.
Unfortunately I agree that “shut down” and “no catastrophe” are still missing pieces. I’m more optimistic than my model of Nate that the HHH research agenda constitutes any progress towards this goal though.
I think labs correctly assess that they’re neither working with or at non-trivial immediate risk of creating x-risky models, nor yet cautious enough to do so safely. If labs invested in this, I think they could probably avoid accidentally creating an x-risky system without abandoning ML research before seeing warning signs.
I agree that pre-AGI empirical alignment work only gets you so far, and that you probably get very little time for direct empirical work on the deadliest problems (two years if very fortunate, days to seconds if you’re really not). But I’d guess my estimate of “only so far” is substantially further than Nate’s, largely off different credence in a sharp left turn.
I was struck by how similarly we assessed the current situation and evidence available so far, but that is a big difference and maybe I shouldn’t describe our views as similar.
I generally agree with Nate’s warning shots post, and with some comments (e.g.), but the “others” I was thinking would likely agree to a moratorium were other labs, not governments (c.f.), which could buy say 6--24 vital months.