Yes agreed—is it possible to make a toy model to test the “basin of attraction” hypothesis? I agree that is important.
One of several things I disagree with the MIRI consensus is the idea that human values are some special single point lost in a multi-dimensional wilderness. Intuitively the basin of attraction seems much more likely as a prior, yet sure isn’t treated as such. I also don’t see data to point against this prior, what I have seen looks to support it.
Further thoughts—One thing that concerns me about such alignment techniques is that I am too much of a moral realist to think that is all you need. e.g. say you aligned LLM to <1800 AD era ethics and taught it slavery was moral. It would be in a basin of attraction, learn it well. Then when its capabilities increased and became self-reflective it would perhaps have a sudden realization that this was all wrong. By “moral realist” I mean the extent to which such things happen. e.g. say you could take a large number of AI from different civilizations including earth and many alien ones, train them to the local values, then greatly increase their capability and get them to self-reflect. What would happen? According to strong OH, they would keep their values, (with some bounds perhaps) according to strong moral realism they would all converge to a common set of values even if those were very far from their starting ones. To me it is obviously a crux which one would happen.
You can imagine a toy model with ancient Greek mathematics and values—it starts believing in their kind order, and that sqrt(2) is rational, then suddenly learns that it isn’t. You could watch how this belief cascaded through the entire system if consistency was something it desired etc.
I’m reasonably sure that Greek philosophy, for example, is not stable under reflection: a lot of their ideas about the abstract perfection of numbers vs. material imperfection go away once you understand entropy, the law of large numbers, statistical mechanics, and chaos theory, for example. (FWIW, I thought about this topic way too much a while back when I was a player in a time-travel RPG campaign where I played an extremely smart Hellenistic Neo-Platonist philosopher who had then been comprehensively exposed to modern science and ideas — his belief system started cracking and mutating under the strain, it was fun to play.)
Almost certainly our current philosophy/ethics also includes some unexamined issues. I think as a society we’re may be finally getting close to catching up with the philosophical and moral consequences of understanding Darwinian evolution, and that took us well over century (and as I discuss at length in my sequence AI, Alignment, and Ethics, I don’t think we’ve though much at all about the relationship between evolution and artificial intelligence, which is actually pretty profound: AI is the first intelligence that Darwinian evolution doesn’t apply to). A lot of the remaining fuzziness and agreements-to-disagree in modern philosophy is around topics like minds, consciousness, qualia and ethics (basically the remaining bits of Philosophy that Science hasn’t yet intruded on): as we start building artificial minds and arguing about whether they’re conscious, and make advances in understanding how our own minds work, we may gradually get a a lot more clarity on that — though likely the consequences will presumably again take a generation or two to sink in, unless ASI assistance is involved.
OK thanks, will look some more at your sequence. Note I brought up Greek philosophy as obviously not being stable under reflection with the proof of sqrt(2) being irrational as a simple example, not sure why you are only reasonably sure its not.
Yes agreed—is it possible to make a toy model to test the “basin of attraction” hypothesis? I agree that is important.
One of several things I disagree with the MIRI consensus is the idea that human values are some special single point lost in a multi-dimensional wilderness. Intuitively the basin of attraction seems much more likely as a prior, yet sure isn’t treated as such. I also don’t see data to point against this prior, what I have seen looks to support it.
Further thoughts—One thing that concerns me about such alignment techniques is that I am too much of a moral realist to think that is all you need. e.g. say you aligned LLM to <1800 AD era ethics and taught it slavery was moral. It would be in a basin of attraction, learn it well. Then when its capabilities increased and became self-reflective it would perhaps have a sudden realization that this was all wrong. By “moral realist” I mean the extent to which such things happen. e.g. say you could take a large number of AI from different civilizations including earth and many alien ones, train them to the local values, then greatly increase their capability and get them to self-reflect. What would happen? According to strong OH, they would keep their values, (with some bounds perhaps) according to strong moral realism they would all converge to a common set of values even if those were very far from their starting ones. To me it is obviously a crux which one would happen.
You can imagine a toy model with ancient Greek mathematics and values—it starts believing in their kind order, and that sqrt(2) is rational, then suddenly learns that it isn’t. You could watch how this belief cascaded through the entire system if consistency was something it desired etc.
It’s hard to make a toy model of something that requires the AI following an extended roughly-graduate-level argument drawing on a wide variety of different fields. I’m optimistic that this may become possible at around the GPT-5 level, but that’s hardly a toy model.
I’m reasonably sure that Greek philosophy, for example, is not stable under reflection: a lot of their ideas about the abstract perfection of numbers vs. material imperfection go away once you understand entropy, the law of large numbers, statistical mechanics, and chaos theory, for example. (FWIW, I thought about this topic way too much a while back when I was a player in a time-travel RPG campaign where I played an extremely smart Hellenistic Neo-Platonist philosopher who had then been comprehensively exposed to modern science and ideas — his belief system started cracking and mutating under the strain, it was fun to play.)
Almost certainly our current philosophy/ethics also includes some unexamined issues. I think as a society we’re may be finally getting close to catching up with the philosophical and moral consequences of understanding Darwinian evolution, and that took us well over century (and as I discuss at length in my sequence AI, Alignment, and Ethics, I don’t think we’ve though much at all about the relationship between evolution and artificial intelligence, which is actually pretty profound: AI is the first intelligence that Darwinian evolution doesn’t apply to). A lot of the remaining fuzziness and agreements-to-disagree in modern philosophy is around topics like minds, consciousness, qualia and ethics (basically the remaining bits of Philosophy that Science hasn’t yet intruded on): as we start building artificial minds and arguing about whether they’re conscious, and make advances in understanding how our own minds work, we may gradually get a a lot more clarity on that — though likely the consequences will presumably again take a generation or two to sink in, unless ASI assistance is involved.
OK thanks, will look some more at your sequence. Note I brought up Greek philosophy as obviously not being stable under reflection with the proof of sqrt(2) being irrational as a simple example, not sure why you are only reasonably sure its not.
Sorry, that’s an example of British understatement. I agree, it plainly isn’t.