Yeah, getting specific unpause requirements seems high value for convincing people who would not otherwise want a pause, but I can’t imagine actually getting it in time in any reasonable way, instead it would need to look like technical specification. “Once we have developed x, y, and z, then it is safe to unpause” kind of thing. Just we need to figure out what the x, y, and z requirements are. Then we can estimate how long it will take to develop x, y, and z, and this will get more refined and accurate as more progress is made, but since the requirements are likely to involve unknown unknowns in theory building, it seems likely that any estimate would be more of a wild guess, and it seems like it would be better to be honest about that rather than saying “yeah, sure, ten years” and then after ten years if the progress hasn’t been made saying “whoops, looks like it’s going to take a little longer!”
As for odds of survival, my personal estimates feel more like 1% chance of some kind of “alignment by default / human in the loop with prosaic scaling” scheme working, as opposed to maybe more like 50% if we took the time to try to get a “aligned before you turn it on” scheme set up, so that would be improving our odds by about 5000%. Though I think you were thinking of adding rather than scaling odds with your 25%, so 49%, but I don’t think that’s a good habit for thinking about probability. Also I feel hopelessly uncalibrated for this kind of question… I doubt I would trust anyone’s estimates, it’s part of what makes the situation so spooky.
How do you think public acceptance would be of a “pause until we meet target x and you are allowed to help us reach target x as much as you want” as opposed to “pause for some set period of time”?
Agreed that scaling rather than addition is usually the better way to think about probabilities. In this case we’ve done so little work on alignment that I think it might actually be more like additive, from 1% to 26% or 50% to 75% with ten extra years relative to the real current odds if we press ahead—which nobody knows.
I’m pretty sure it would be an error to trust anyone’s estimate at this time, because people with roughly equal expertise and wisdom (e.g., Yudkowsky and Christiano) give such wildly different odds. And the discussions between those viewpoints always trail off into differing intuitions.
I also give alignment by default very poor odds, and prosaic alignment as it’s usually discussed. But there are some pretty obvious techniques that are so low-tax that I think they’ll be implemented even by orgs that don’t take safety very seriously.
I’m curious if you’ve read my Instruction-following AGI is easier and more likely than value aligned AGI and/or Internal independent review for language model agent alignment posts. Instruction-following is human-in-the-loop so that may already be what you’re referring to. But some of the techniques in the independent review post (which is also a review of multiple methods) go beyond prosaic alignment to apply specifically to foundation model agents. And wisely-used instruction-following gives corrigibility with a flexible level of oversight.
I’m curious what you think about those techniques if you’ve got time to look.
I think public acceptance of a pause is only part of the issue. The Chinese might actually not pursue AGI if they didn’t have to race the US. But Russia and North Korea will most certainly pursue it (although they’ve got very limited resources and technical chops to make lots of progress in new foundation models, but they still might get to real AGI based on turning next-gen (which there’s not time to pause) foundation models into scaffolded cognitive architectures.
But yes, I do think there’s a chance we could get the US and European public to support a pause using some of the framings you suggest. But we’d better be sure that’s a good idea. Lots of people, notably Russians and North Koreans, are genuinely way less cautious even than Americans—and absolutely will not honor agreements to pause.
Those are some specifics; in general I think it’s only useful to talk about what “we” “should” do in the context of what particular actors actually are likely to do in different scenarios. Humanity is far from aligned, and that’s a problem.
“we’ve done so little work on alignment that I think it might actually be more like additive, from 1% to 26% or 50% to 75% with ten extra years relative to the real current odds if we press ahead—which nobody knows.”
😭🤣 I really want “We’ve done so little work the probabilities are additive” to be a meme. I feel like I do get where you’re coming from.
I agree about pause concern. I also really feel that any delay to friendly SI represents an enormous amount of suffering that could be prevented if we got to friendly SI sooner. It should not be taken lightly. And being realistic about how difficult it is to align humans seems worthwhile. When I talk to math ppl about what work I think we need to do to solve this though, “impossible” or “hundreds of years of work” seem to be the vibe. I think math is a cool field because more than other fields, it feels like work from hundreds of years ago is still very relevant. Problems are hard and progress is slow in a way that I don’t know if people involved in other things really “get”. I feel like in math crowds I’m saying “no, don’t give up, maybe with a hundred years we can do it!” And in other crowds I’m like “c’mon guys, could we have at least 10 years, maybe?” Anyway, I’m rambling a bit, but the point is that my vibe is very much, “if the Russians defect, everyone dies”. “If the North Koreans defect, everyone dies”. “If Americans can’t bring themselves to trust other countries and don’t even try themselves, everyone dies”. So I’m currently feeling very “everyone slightly sane should commit and signal commitment as hard as they can” cause I know it will be hard to get humanity on the same page about something. Basically impossible, never been done before. But so is ASI alignment.
I haven’t read those links. I’ll check em out, thanks : ) I’ve read a few things by Drexler about, like, automated plan generation and then humans audit and enact the plan. It makes me feel better about the situation. I think we could go farther safer with careful techniques like that, but that is both empowering us and bringing us closer to danger, and I don’t think it scales to SI, and unless we are really serious about using it to map RSI boundaries, it doesn’t even prevent misaligned decision systems from going RSI and killing us.
Yes, the math crowd is saying something like “give us a hundred years and we can do it!”. And nobody is going to give them that in the world we live in.
Fortunately, math isn’t the best tool to solve alignment. Foundation models are already trained to follow instructions given in natural language. If we make sure this is the dominant factor in foundation model agents, and use it carefully (don’t say dumb things like “’go solve cancer, don’t bug me with the hows and whys, just git er done as you see fit”, etc), this could work.
We can probably achieve technical intent alignment if we’re even modestly careful and pay a modest alignment tax. You’ve now read my other posts making those arguments.
Unfortunately, it’s not even clear the relevant actors are willing to be reasonably cautious or pay a modest alignment tax.
The other threads are addressed in responses to your comments on my linked posts.
Yes, you’ve written more extensively on this than I realized, thanks for pointing out other relevant posts, sorry for not having taken the time to find them myself, I’m trying to err more on the side of communication than I have in the past.
I think math is the best tool to solve alignment. It might be emotional, I’ve been manipulated and hurt by natural language and the people who prefer it to math and have always found engaging with math to be soothing or at least sobering. It could also be that I truly believe that the engineering rigor that comes with understanding something enough to do math to it is extremely worthwhile for building a thing of the importance we are discussing.
Part of me wants to die on this hill and tell everyone who will listen “I know its impossible but we need to find ways to make it possible to give the math people the hundred years they need because if we don’t then everyone dies so theres no point in aiming for anything less and its unfortunate because it means it’s likely we are doomed but that’s the truth as I see it.” I just wonder how much of that part of me is my oppositional defiance disorder and how much is my strategizing for best outcome.
I’ll be reading your other posts. Thanks for engaging with me : )
I certainly don’t expect people to read a bunch of stuff before engaging! I’m really pleased that you’ve read so much of my stuff. I’ll get back to these conversations soon hopefully, I’ve had to focus on new posts.
I think your feelings about math are shared by a lot of the alignment community. I like the way you’ve expressed those intuitions.
I think math might be the best tool to solve alignment if we had unlimited time—but it looks like we very much do not.
Yeah, getting specific unpause requirements seems high value for convincing people who would not otherwise want a pause, but I can’t imagine actually getting it in time in any reasonable way, instead it would need to look like technical specification. “Once we have developed x, y, and z, then it is safe to unpause” kind of thing. Just we need to figure out what the x, y, and z requirements are. Then we can estimate how long it will take to develop x, y, and z, and this will get more refined and accurate as more progress is made, but since the requirements are likely to involve unknown unknowns in theory building, it seems likely that any estimate would be more of a wild guess, and it seems like it would be better to be honest about that rather than saying “yeah, sure, ten years” and then after ten years if the progress hasn’t been made saying “whoops, looks like it’s going to take a little longer!” As for odds of survival, my personal estimates feel more like 1% chance of some kind of “alignment by default / human in the loop with prosaic scaling” scheme working, as opposed to maybe more like 50% if we took the time to try to get a “aligned before you turn it on” scheme set up, so that would be improving our odds by about 5000%. Though I think you were thinking of adding rather than scaling odds with your 25%, so 49%, but I don’t think that’s a good habit for thinking about probability. Also I feel hopelessly uncalibrated for this kind of question… I doubt I would trust anyone’s estimates, it’s part of what makes the situation so spooky. How do you think public acceptance would be of a “pause until we meet target x and you are allowed to help us reach target x as much as you want” as opposed to “pause for some set period of time”?
Agreed that scaling rather than addition is usually the better way to think about probabilities. In this case we’ve done so little work on alignment that I think it might actually be more like additive, from 1% to 26% or 50% to 75% with ten extra years relative to the real current odds if we press ahead—which nobody knows.
I’m pretty sure it would be an error to trust anyone’s estimate at this time, because people with roughly equal expertise and wisdom (e.g., Yudkowsky and Christiano) give such wildly different odds. And the discussions between those viewpoints always trail off into differing intuitions.
I also give alignment by default very poor odds, and prosaic alignment as it’s usually discussed. But there are some pretty obvious techniques that are so low-tax that I think they’ll be implemented even by orgs that don’t take safety very seriously.
I’m curious if you’ve read my Instruction-following AGI is easier and more likely than value aligned AGI and/or Internal independent review for language model agent alignment posts. Instruction-following is human-in-the-loop so that may already be what you’re referring to. But some of the techniques in the independent review post (which is also a review of multiple methods) go beyond prosaic alignment to apply specifically to foundation model agents. And wisely-used instruction-following gives corrigibility with a flexible level of oversight.
I’m curious what you think about those techniques if you’ve got time to look.
I think public acceptance of a pause is only part of the issue. The Chinese might actually not pursue AGI if they didn’t have to race the US. But Russia and North Korea will most certainly pursue it (although they’ve got very limited resources and technical chops to make lots of progress in new foundation models, but they still might get to real AGI based on turning next-gen (which there’s not time to pause) foundation models into scaffolded cognitive architectures.
But yes, I do think there’s a chance we could get the US and European public to support a pause using some of the framings you suggest. But we’d better be sure that’s a good idea. Lots of people, notably Russians and North Koreans, are genuinely way less cautious even than Americans—and absolutely will not honor agreements to pause.
Those are some specifics; in general I think it’s only useful to talk about what “we” “should” do in the context of what particular actors actually are likely to do in different scenarios. Humanity is far from aligned, and that’s a problem.
“we’ve done so little work on alignment that I think it might actually be more like additive, from 1% to 26% or 50% to 75% with ten extra years relative to the real current odds if we press ahead—which nobody knows.” 😭🤣 I really want “We’ve done so little work the probabilities are additive” to be a meme. I feel like I do get where you’re coming from.
I agree about pause concern. I also really feel that any delay to friendly SI represents an enormous amount of suffering that could be prevented if we got to friendly SI sooner. It should not be taken lightly. And being realistic about how difficult it is to align humans seems worthwhile. When I talk to math ppl about what work I think we need to do to solve this though, “impossible” or “hundreds of years of work” seem to be the vibe. I think math is a cool field because more than other fields, it feels like work from hundreds of years ago is still very relevant. Problems are hard and progress is slow in a way that I don’t know if people involved in other things really “get”. I feel like in math crowds I’m saying “no, don’t give up, maybe with a hundred years we can do it!” And in other crowds I’m like “c’mon guys, could we have at least 10 years, maybe?” Anyway, I’m rambling a bit, but the point is that my vibe is very much, “if the Russians defect, everyone dies”. “If the North Koreans defect, everyone dies”. “If Americans can’t bring themselves to trust other countries and don’t even try themselves, everyone dies”. So I’m currently feeling very “everyone slightly sane should commit and signal commitment as hard as they can” cause I know it will be hard to get humanity on the same page about something. Basically impossible, never been done before. But so is ASI alignment.
I haven’t read those links. I’ll check em out, thanks : ) I’ve read a few things by Drexler about, like, automated plan generation and then humans audit and enact the plan. It makes me feel better about the situation. I think we could go farther safer with careful techniques like that, but that is both empowering us and bringing us closer to danger, and I don’t think it scales to SI, and unless we are really serious about using it to map RSI boundaries, it doesn’t even prevent misaligned decision systems from going RSI and killing us.
Yes, the math crowd is saying something like “give us a hundred years and we can do it!”. And nobody is going to give them that in the world we live in.
Fortunately, math isn’t the best tool to solve alignment. Foundation models are already trained to follow instructions given in natural language. If we make sure this is the dominant factor in foundation model agents, and use it carefully (don’t say dumb things like “’go solve cancer, don’t bug me with the hows and whys, just git er done as you see fit”, etc), this could work.
We can probably achieve technical intent alignment if we’re even modestly careful and pay a modest alignment tax. You’ve now read my other posts making those arguments.
Unfortunately, it’s not even clear the relevant actors are willing to be reasonably cautious or pay a modest alignment tax.
The other threads are addressed in responses to your comments on my linked posts.
Yes, you’ve written more extensively on this than I realized, thanks for pointing out other relevant posts, sorry for not having taken the time to find them myself, I’m trying to err more on the side of communication than I have in the past.
I think math is the best tool to solve alignment. It might be emotional, I’ve been manipulated and hurt by natural language and the people who prefer it to math and have always found engaging with math to be soothing or at least sobering. It could also be that I truly believe that the engineering rigor that comes with understanding something enough to do math to it is extremely worthwhile for building a thing of the importance we are discussing.
Part of me wants to die on this hill and tell everyone who will listen “I know its impossible but we need to find ways to make it possible to give the math people the hundred years they need because if we don’t then everyone dies so theres no point in aiming for anything less and its unfortunate because it means it’s likely we are doomed but that’s the truth as I see it.” I just wonder how much of that part of me is my oppositional defiance disorder and how much is my strategizing for best outcome.
I’ll be reading your other posts. Thanks for engaging with me : )
I certainly don’t expect people to read a bunch of stuff before engaging! I’m really pleased that you’ve read so much of my stuff. I’ll get back to these conversations soon hopefully, I’ve had to focus on new posts.
I think your feelings about math are shared by a lot of the alignment community. I like the way you’ve expressed those intuitions.
I think math might be the best tool to solve alignment if we had unlimited time—but it looks like we very much do not.