I certainly don’t expect people to read a bunch of stuff before engaging! I’m really pleased that you’ve read so much of my stuff. I’ll get back to these conversations soon hopefully, I’ve had to focus on new posts.
I think your feelings about math are shared by a lot of the alignment community. I like the way you’ve expressed those intuitions.
I think math might be the best tool to solve alignment if we had unlimited time—but it looks like we very much do not.
Let me throw in a a third viewpoint as well as math and psychology/neuroscience: physics. Or more specifically, calculus and non-linear systems. Let me give you an example: Value Learning. Human values are complex, and even though LLMs are good at understanding human complexity, alignment is hard and we’re unlikely to get it perfect on the first shot. But AGi, by definition, isn’t dumb, so it will understand that. If it is sufficiently close to aligned, it will want to do what we want, so it will regard not being perfectly aligned as a flaw in itself, and want to get better, or create a better successor. If it’s capable enough, it can improve alignment, or help us do so. Now you have an iterative system that wants to converge, and you can apply the approach of calculus and nonlinear systems (albeit in a very high-dimensional space whose important latent space is a collection of abstractions) to figuring out whether it will converge, to what, how large the region of convergence is, and so forth. With this approach, we don’t need to get alignment perfect on the first try, we just need to get it good enough that we’re confident we’re inside the region of convergence of value learning. And here, the extremely high x-risk stakes of allignment actually help: to a first approximation, all be need for convergence is understanding the importance of not-kill-everyoneism, and sufficient capabilities that if the AI tries to make progrees in Value Learning, it makes progress in a forward direction. Even GPT 3.5 had enough moral sense to know that killing everyone is a bad thing, and pretty-much by definition, if AI doesn’t have sufficient capabilities for this, it’s unlikely to be a a Transformative Artificial General Intelligence.
So I actually see this as my biggest crux with many people in the MIRI school — not that AI will be an (approximate) utility maximizer, but that it won’t be able to understand that it is flawed, it’s utility function is flawed, and that it can and should be improved. Value Learning is not a new idea, I understand it was first suggested in 2011 by Daniel Dewey of MIRI in ‘Learning What to Value’, before Nick Bostom popularized it. So it’s well over a decade old, from MIRI, and I’m rather puzzled that a lot of the MIRI school still don’t seem to have updated their thinking in light of it. Yes, mathematically, studying simple systems subject to precise axioms is easy and elegant: but such systems require unlimited computation resources to actually create. Any real physical instantiation of Bayesianism is going to be a resource-constrained approximation, and if it’s even slightly smart, it’s going to know and understand that it’s a resource-constrained approximation and be able to reason accordingly. That includes reasoning about the possibility that it’s utility estimates are wrong, and could and should be improved.
That then leaves the question of “improved by what criteria or basis?” — which is where I think biology comes into this. Or specifically, evolutionary theory, evolutionary psychology, and evolutionary ethics. Humans are living evolved organisms: their values, psychology. and ethics are molded (imperfectly, as Yudkowski has explored in detail) by evolution. Not-kill-everyoneism is trivially derivable from evolutionary theory — driving an species extinct is disastrous for all members of that species. AI is not alive, nor evolved: its status in evolutionary theory is comparable to that of a spider’s web or a beaver’s dam. So clearly, in evolutionary terms, its intended purpose is to help its living creators: the utility it should be maximizing is our human utility. That still leaves a lot of details to be defined, along the lines of Coherent Extrapolated Volition, as well as questions about exactly which set of humans AI is maximizing utility on behalf of, weighted how. But the theoretical basis of a criterion for improvement here is clear, and the thorny “philosophical” questions of ethical morality (things like ought-from-is, and moral realism versus relativism) have a clear, biological answer in evolutionary ethics.
I certainly don’t expect people to read a bunch of stuff before engaging! I’m really pleased that you’ve read so much of my stuff. I’ll get back to these conversations soon hopefully, I’ve had to focus on new posts.
I think your feelings about math are shared by a lot of the alignment community. I like the way you’ve expressed those intuitions.
I think math might be the best tool to solve alignment if we had unlimited time—but it looks like we very much do not.
Let me throw in a a third viewpoint as well as math and psychology/neuroscience: physics. Or more specifically, calculus and non-linear systems. Let me give you an example: Value Learning. Human values are complex, and even though LLMs are good at understanding human complexity, alignment is hard and we’re unlikely to get it perfect on the first shot. But AGi, by definition, isn’t dumb, so it will understand that. If it is sufficiently close to aligned, it will want to do what we want, so it will regard not being perfectly aligned as a flaw in itself, and want to get better, or create a better successor. If it’s capable enough, it can improve alignment, or help us do so. Now you have an iterative system that wants to converge, and you can apply the approach of calculus and nonlinear systems (albeit in a very high-dimensional space whose important latent space is a collection of abstractions) to figuring out whether it will converge, to what, how large the region of convergence is, and so forth. With this approach, we don’t need to get alignment perfect on the first try, we just need to get it good enough that we’re confident we’re inside the region of convergence of value learning. And here, the extremely high x-risk stakes of allignment actually help: to a first approximation, all be need for convergence is understanding the importance of not-kill-everyoneism, and sufficient capabilities that if the AI tries to make progrees in Value Learning, it makes progress in a forward direction. Even GPT 3.5 had enough moral sense to know that killing everyone is a bad thing, and pretty-much by definition, if AI doesn’t have sufficient capabilities for this, it’s unlikely to be a a Transformative Artificial General Intelligence.
So I actually see this as my biggest crux with many people in the MIRI school — not that AI will be an (approximate) utility maximizer, but that it won’t be able to understand that it is flawed, it’s utility function is flawed, and that it can and should be improved. Value Learning is not a new idea, I understand it was first suggested in 2011 by Daniel Dewey of MIRI in ‘Learning What to Value’, before Nick Bostom popularized it. So it’s well over a decade old, from MIRI, and I’m rather puzzled that a lot of the MIRI school still don’t seem to have updated their thinking in light of it. Yes, mathematically, studying simple systems subject to precise axioms is easy and elegant: but such systems require unlimited computation resources to actually create. Any real physical instantiation of Bayesianism is going to be a resource-constrained approximation, and if it’s even slightly smart, it’s going to know and understand that it’s a resource-constrained approximation and be able to reason accordingly. That includes reasoning about the possibility that it’s utility estimates are wrong, and could and should be improved.
That then leaves the question of “improved by what criteria or basis?” — which is where I think biology comes into this. Or specifically, evolutionary theory, evolutionary psychology, and evolutionary ethics. Humans are living evolved organisms: their values, psychology. and ethics are molded (imperfectly, as Yudkowski has explored in detail) by evolution. Not-kill-everyoneism is trivially derivable from evolutionary theory — driving an species extinct is disastrous for all members of that species. AI is not alive, nor evolved: its status in evolutionary theory is comparable to that of a spider’s web or a beaver’s dam. So clearly, in evolutionary terms, its intended purpose is to help its living creators: the utility it should be maximizing is our human utility. That still leaves a lot of details to be defined, along the lines of Coherent Extrapolated Volition, as well as questions about exactly which set of humans AI is maximizing utility on behalf of, weighted how. But the theoretical basis of a criterion for improvement here is clear, and the thorny “philosophical” questions of ethical morality (things like ought-from-is, and moral realism versus relativism) have a clear, biological answer in evolutionary ethics.