āweāve done so little work on alignment that I think it might actually be more like additive, from 1% to 26% or 50% to 75% with ten extra years relative to the real current odds if we press aheadāwhich nobody knows.ā
šš¤£ I really want āWeāve done so little work the probabilities are additiveā to be a meme. I feel like I do get where youāre coming from.
I agree about pause concern. I also really feel that any delay to friendly SI represents an enormous amount of suffering that could be prevented if we got to friendly SI sooner. It should not be taken lightly. And being realistic about how difficult it is to align humans seems worthwhile. When I talk to math ppl about what work I think we need to do to solve this though, āimpossibleā or āhundreds of years of workā seem to be the vibe. I think math is a cool field because more than other fields, it feels like work from hundreds of years ago is still very relevant. Problems are hard and progress is slow in a way that I donāt know if people involved in other things really āgetā. I feel like in math crowds Iām saying āno, donāt give up, maybe with a hundred years we can do it!ā And in other crowds Iām like ācāmon guys, could we have at least 10 years, maybe?ā Anyway, Iām rambling a bit, but the point is that my vibe is very much, āif the Russians defect, everyone diesā. āIf the North Koreans defect, everyone diesā. āIf Americans canāt bring themselves to trust other countries and donāt even try themselves, everyone diesā. So Iām currently feeling very āeveryone slightly sane should commit and signal commitment as hard as they canā cause I know it will be hard to get humanity on the same page about something. Basically impossible, never been done before. But so is ASI alignment.
I havenāt read those links. Iāll check em out, thanks : ) Iāve read a few things by Drexler about, like, automated plan generation and then humans audit and enact the plan. It makes me feel better about the situation. I think we could go farther safer with careful techniques like that, but that is both empowering us and bringing us closer to danger, and I donāt think it scales to SI, and unless we are really serious about using it to map RSI boundaries, it doesnāt even prevent misaligned decision systems from going RSI and killing us.
Yes, the math crowd is saying something like āgive us a hundred years and we can do it!ā. And nobody is going to give them that in the world we live in.
Fortunately, math isnāt the best tool to solve alignment. Foundation models are already trained to follow instructions given in natural language. If we make sure this is the dominant factor in foundation model agents, and use it carefully (donāt say dumb things like āāgo solve cancer, donāt bug me with the hows and whys, just git er done as you see fitā, etc), this could work.
We can probably achieve technical intent alignment if weāre even modestly careful and pay a modest alignment tax. Youāve now read my other posts making those arguments.
Unfortunately, itās not even clear the relevant actors are willing to be reasonably cautious or pay a modest alignment tax.
The other threads are addressed in responses to your comments on my linked posts.
Yes, youāve written more extensively on this than I realized, thanks for pointing out other relevant posts, sorry for not having taken the time to find them myself, Iām trying to err more on the side of communication than I have in the past.
I think math is the best tool to solve alignment. It might be emotional, Iāve been manipulated and hurt by natural language and the people who prefer it to math and have always found engaging with math to be soothing or at least sobering. It could also be that I truly believe that the engineering rigor that comes with understanding something enough to do math to it is extremely worthwhile for building a thing of the importance we are discussing.
Part of me wants to die on this hill and tell everyone who will listen āI know its impossible but we need to find ways to make it possible to give the math people the hundred years they need because if we donāt then everyone dies so theres no point in aiming for anything less and its unfortunate because it means itās likely we are doomed but thatās the truth as I see it.ā I just wonder how much of that part of me is my oppositional defiance disorder and how much is my strategizing for best outcome.
Iāll be reading your other posts. Thanks for engaging with me : )
I certainly donāt expect people to read a bunch of stuff before engaging! Iām really pleased that youāve read so much of my stuff. Iāll get back to these conversations soon hopefully, Iāve had to focus on new posts.
I think your feelings about math are shared by a lot of the alignment community. I like the way youāve expressed those intuitions.
I think math might be the best tool to solve alignment if we had unlimited timeābut it looks like we very much do not.
Let me throw in a a third viewpoint as well as math and psychology/āneuroscience: physics. Or more specifically, calculus and non-linear systems. Let me give you an example: Value Learning. Human values are complex, and even though LLMs are good at understanding human complexity, alignment is hard and weāre unlikely to get it perfect on the first shot. But AGi, by definition, isnāt dumb, so it will understand that. If it is sufficiently close to aligned, it will want to do what we want, so it will regard not being perfectly aligned as a flaw in itself, and want to get better, or create a better successor. If itās capable enough, it can improve alignment, or help us do so. Now you have an iterative system that wants to converge, and you can apply the approach of calculus and nonlinear systems (albeit in a very high-dimensional space whose important latent space is a collection of abstractions) to figuring out whether it will converge, to what, how large the region of convergence is, and so forth. With this approach, we donāt need to get alignment perfect on the first try, we just need to get it good enough that weāre confident weāre inside the region of convergence of value learning. And here, the extremely high x-risk stakes of allignment actually help: to a first approximation, all be need for convergence is understanding the importance of not-kill-everyoneism, and sufficient capabilities that if the AI tries to make progrees in Value Learning, it makes progress in a forward direction. Even GPT 3.5 had enough moral sense to know that killing everyone is a bad thing, and pretty-much by definition, if AI doesnāt have sufficient capabilities for this, itās unlikely to be a a Transformative Artificial General Intelligence.
So I actually see this as my biggest crux with many people in the MIRI school ā not that AI will be an (approximate) utility maximizer, but that it wonāt be able to understand that it is flawed, itās utility function is flawed, and that it can and should be improved. Value Learning is not a new idea, I understand it was first suggested in 2011 by Daniel Dewey of MIRI in āLearning What to Valueā, before Nick Bostom popularized it. So itās well over a decade old, from MIRI, and Iām rather puzzled that a lot of the MIRI school still donāt seem to have updated their thinking in light of it. Yes, mathematically, studying simple systems subject to precise axioms is easy and elegant: but such systems require unlimited computation resources to actually create. Any real physical instantiation of Bayesianism is going to be a resource-constrained approximation, and if itās even slightly smart, itās going to know and understand that itās a resource-constrained approximation and be able to reason accordingly. That includes reasoning about the possibility that itās utility estimates are wrong, and could and should be improved.
That then leaves the question of āimproved by what criteria or basis?ā ā which is where I think biology comes into this. Or specifically, evolutionary theory, evolutionary psychology, and evolutionary ethics. Humans are living evolved organisms: their values, psychology. and ethics are molded (imperfectly, as Yudkowski has explored in detail) by evolution. Not-kill-everyoneism is trivially derivable from evolutionary theory ā driving an species extinct is disastrous for all members of that species. AI is not alive, nor evolved: its status in evolutionary theory is comparable to that of a spiderās web or a beaverās dam. So clearly, in evolutionary terms, its intended purpose is to help its living creators: the utility it should be maximizing is our human utility. That still leaves a lot of details to be defined, along the lines of Coherent Extrapolated Volition, as well as questions about exactly which set of humans AI is maximizing utility on behalf of, weighted how. But the theoretical basis of a criterion for improvement here is clear, and the thorny āphilosophicalā questions of ethical morality (things like ought-from-is, and moral realism versus relativism) have a clear, biological answer in evolutionary ethics.
āweāve done so little work on alignment that I think it might actually be more like additive, from 1% to 26% or 50% to 75% with ten extra years relative to the real current odds if we press aheadāwhich nobody knows.ā šš¤£ I really want āWeāve done so little work the probabilities are additiveā to be a meme. I feel like I do get where youāre coming from.
I agree about pause concern. I also really feel that any delay to friendly SI represents an enormous amount of suffering that could be prevented if we got to friendly SI sooner. It should not be taken lightly. And being realistic about how difficult it is to align humans seems worthwhile. When I talk to math ppl about what work I think we need to do to solve this though, āimpossibleā or āhundreds of years of workā seem to be the vibe. I think math is a cool field because more than other fields, it feels like work from hundreds of years ago is still very relevant. Problems are hard and progress is slow in a way that I donāt know if people involved in other things really āgetā. I feel like in math crowds Iām saying āno, donāt give up, maybe with a hundred years we can do it!ā And in other crowds Iām like ācāmon guys, could we have at least 10 years, maybe?ā Anyway, Iām rambling a bit, but the point is that my vibe is very much, āif the Russians defect, everyone diesā. āIf the North Koreans defect, everyone diesā. āIf Americans canāt bring themselves to trust other countries and donāt even try themselves, everyone diesā. So Iām currently feeling very āeveryone slightly sane should commit and signal commitment as hard as they canā cause I know it will be hard to get humanity on the same page about something. Basically impossible, never been done before. But so is ASI alignment.
I havenāt read those links. Iāll check em out, thanks : ) Iāve read a few things by Drexler about, like, automated plan generation and then humans audit and enact the plan. It makes me feel better about the situation. I think we could go farther safer with careful techniques like that, but that is both empowering us and bringing us closer to danger, and I donāt think it scales to SI, and unless we are really serious about using it to map RSI boundaries, it doesnāt even prevent misaligned decision systems from going RSI and killing us.
Yes, the math crowd is saying something like āgive us a hundred years and we can do it!ā. And nobody is going to give them that in the world we live in.
Fortunately, math isnāt the best tool to solve alignment. Foundation models are already trained to follow instructions given in natural language. If we make sure this is the dominant factor in foundation model agents, and use it carefully (donāt say dumb things like āāgo solve cancer, donāt bug me with the hows and whys, just git er done as you see fitā, etc), this could work.
We can probably achieve technical intent alignment if weāre even modestly careful and pay a modest alignment tax. Youāve now read my other posts making those arguments.
Unfortunately, itās not even clear the relevant actors are willing to be reasonably cautious or pay a modest alignment tax.
The other threads are addressed in responses to your comments on my linked posts.
Yes, youāve written more extensively on this than I realized, thanks for pointing out other relevant posts, sorry for not having taken the time to find them myself, Iām trying to err more on the side of communication than I have in the past.
I think math is the best tool to solve alignment. It might be emotional, Iāve been manipulated and hurt by natural language and the people who prefer it to math and have always found engaging with math to be soothing or at least sobering. It could also be that I truly believe that the engineering rigor that comes with understanding something enough to do math to it is extremely worthwhile for building a thing of the importance we are discussing.
Part of me wants to die on this hill and tell everyone who will listen āI know its impossible but we need to find ways to make it possible to give the math people the hundred years they need because if we donāt then everyone dies so theres no point in aiming for anything less and its unfortunate because it means itās likely we are doomed but thatās the truth as I see it.ā I just wonder how much of that part of me is my oppositional defiance disorder and how much is my strategizing for best outcome.
Iāll be reading your other posts. Thanks for engaging with me : )
I certainly donāt expect people to read a bunch of stuff before engaging! Iām really pleased that youāve read so much of my stuff. Iāll get back to these conversations soon hopefully, Iāve had to focus on new posts.
I think your feelings about math are shared by a lot of the alignment community. I like the way youāve expressed those intuitions.
I think math might be the best tool to solve alignment if we had unlimited timeābut it looks like we very much do not.
Let me throw in a a third viewpoint as well as math and psychology/āneuroscience: physics. Or more specifically, calculus and non-linear systems. Let me give you an example: Value Learning. Human values are complex, and even though LLMs are good at understanding human complexity, alignment is hard and weāre unlikely to get it perfect on the first shot. But AGi, by definition, isnāt dumb, so it will understand that. If it is sufficiently close to aligned, it will want to do what we want, so it will regard not being perfectly aligned as a flaw in itself, and want to get better, or create a better successor. If itās capable enough, it can improve alignment, or help us do so. Now you have an iterative system that wants to converge, and you can apply the approach of calculus and nonlinear systems (albeit in a very high-dimensional space whose important latent space is a collection of abstractions) to figuring out whether it will converge, to what, how large the region of convergence is, and so forth. With this approach, we donāt need to get alignment perfect on the first try, we just need to get it good enough that weāre confident weāre inside the region of convergence of value learning. And here, the extremely high x-risk stakes of allignment actually help: to a first approximation, all be need for convergence is understanding the importance of not-kill-everyoneism, and sufficient capabilities that if the AI tries to make progrees in Value Learning, it makes progress in a forward direction. Even GPT 3.5 had enough moral sense to know that killing everyone is a bad thing, and pretty-much by definition, if AI doesnāt have sufficient capabilities for this, itās unlikely to be a a Transformative Artificial General Intelligence.
So I actually see this as my biggest crux with many people in the MIRI school ā not that AI will be an (approximate) utility maximizer, but that it wonāt be able to understand that it is flawed, itās utility function is flawed, and that it can and should be improved. Value Learning is not a new idea, I understand it was first suggested in 2011 by Daniel Dewey of MIRI in āLearning What to Valueā, before Nick Bostom popularized it. So itās well over a decade old, from MIRI, and Iām rather puzzled that a lot of the MIRI school still donāt seem to have updated their thinking in light of it. Yes, mathematically, studying simple systems subject to precise axioms is easy and elegant: but such systems require unlimited computation resources to actually create. Any real physical instantiation of Bayesianism is going to be a resource-constrained approximation, and if itās even slightly smart, itās going to know and understand that itās a resource-constrained approximation and be able to reason accordingly. That includes reasoning about the possibility that itās utility estimates are wrong, and could and should be improved.
That then leaves the question of āimproved by what criteria or basis?ā ā which is where I think biology comes into this. Or specifically, evolutionary theory, evolutionary psychology, and evolutionary ethics. Humans are living evolved organisms: their values, psychology. and ethics are molded (imperfectly, as Yudkowski has explored in detail) by evolution. Not-kill-everyoneism is trivially derivable from evolutionary theory ā driving an species extinct is disastrous for all members of that species. AI is not alive, nor evolved: its status in evolutionary theory is comparable to that of a spiderās web or a beaverās dam. So clearly, in evolutionary terms, its intended purpose is to help its living creators: the utility it should be maximizing is our human utility. That still leaves a lot of details to be defined, along the lines of Coherent Extrapolated Volition, as well as questions about exactly which set of humans AI is maximizing utility on behalf of, weighted how. But the theoretical basis of a criterion for improvement here is clear, and the thorny āphilosophicalā questions of ethical morality (things like ought-from-is, and moral realism versus relativism) have a clear, biological answer in evolutionary ethics.