Regarding point 24: in an earlier comment[0] I tried to pump people’s intuition about this. What is the minimum viable alignment effort that we could construct for a system of values on our first try and know that we got it right? I can only think of three outcomes depending on how good/lucky we are:
Prove that alignment is indifferent over outcomes of the system. Under the hypothesis that Life Gliders have no coherent values we should be able to prove that they do not. This would be a fundamental result in its own right, encompassing a theory of internal experience.
Prove that alignment preserves a status quo, neither harming nor helping the system in question. Perhaps planaria or bacteria values are so aligned with maximizing relative inclusive fitness that the AGI provably doesn’t have to intervene. Equivalent to proving that values have already coherently converged, hopefully simpler than an algorithm for assuring they converge.
Prove that alignment is (or will settle on) the full coherent extrapolation of a system’s values.
I think we have a non-negligible shot at achieving 1 and/or 2 for toy systems, and perhaps the insight would help on clarifying whether there are additional possibilities between 2 and 3 that we could aim for with some likelihood of success on a first try at human value alignment.
If we’re stuck with only the three, then the full difficulty of option 3 remains, unfortunately.
Addendum: I don’t think we should be able to prove that Life Gliders lack values, merely because they have none. That might sound credible, but it may also violate the Von Neumann-Morgenstern Utility Theorem. Or did you mean we should be able to prove it from analyzing their actual causal structure, not just by looking at behavior?
Even then, while the fact that gliders appear to lack values does happen to be connected to their lack of qualia or “internal experience,” those look like logically distinct concepts. I’m not sure where you’re going with this.
I don’t think planaria have values, whether you view that truth as a “cop-out” or not. Even if we replace your example with the ‘minimal’ nervous system capable of having qualia—supposing the organism in question doesn’t also have speech in the usual sense—I still think that’s a terrible analogy. The reason humans can’t understand worms’ philosophies of value is because there aren’t any. The reason we can’t understand what planaria say about their values is that they can’t talk, not because they’re alien. When we put our minds to understanding an animal like a cat which evolved for (some) social interaction, we can do so—I taught a cat to signal hunger by jumping up on a particular surface, and Buddhist monks with lots of time have taught cats many more tricks. People are currently teaching them to hold English conversations (apparently) by pushing buttons which trigger voice recordings. Unsurprisingly, it looks like cats value outcomes like food in their mouths and a lack of irritating noises, not some alien goal that Stephen Hawking could never understand.
If you think that a superhuman AGI would have a lot of trouble inferring your desires or those of others, even given the knowledge it should rapidly develop about evolution—congratulations, you’re autistic.
Regarding point 24: in an earlier comment[0] I tried to pump people’s intuition about this. What is the minimum viable alignment effort that we could construct for a system of values on our first try and know that we got it right? I can only think of three outcomes depending on how good/lucky we are:
Prove that alignment is indifferent over outcomes of the system. Under the hypothesis that Life Gliders have no coherent values we should be able to prove that they do not. This would be a fundamental result in its own right, encompassing a theory of internal experience.
Prove that alignment preserves a status quo, neither harming nor helping the system in question. Perhaps planaria or bacteria values are so aligned with maximizing relative inclusive fitness that the AGI provably doesn’t have to intervene. Equivalent to proving that values have already coherently converged, hopefully simpler than an algorithm for assuring they converge.
Prove that alignment is (or will settle on) the full coherent extrapolation of a system’s values.
I think we have a non-negligible shot at achieving 1 and/or 2 for toy systems, and perhaps the insight would help on clarifying whether there are additional possibilities between 2 and 3 that we could aim for with some likelihood of success on a first try at human value alignment.
If we’re stuck with only the three, then the full difficulty of option 3 remains, unfortunately.
[0] https://www.lesswrong.com/posts/34Gkqus9vusXRevR8/late-2021-miri-conversations-ama-discussion?commentId=iwb7NK5KZLRMBKteg
Addendum: I don’t think we should be able to prove that Life Gliders lack values, merely because they have none. That might sound credible, but it may also violate the Von Neumann-Morgenstern Utility Theorem. Or did you mean we should be able to prove it from analyzing their actual causal structure, not just by looking at behavior?
Even then, while the fact that gliders appear to lack values does happen to be connected to their lack of qualia or “internal experience,” those look like logically distinct concepts. I’m not sure where you’re going with this.
I don’t think planaria have values, whether you view that truth as a “cop-out” or not. Even if we replace your example with the ‘minimal’ nervous system capable of having qualia—supposing the organism in question doesn’t also have speech in the usual sense—I still think that’s a terrible analogy. The reason humans can’t understand worms’ philosophies of value is because there aren’t any. The reason we can’t understand what planaria say about their values is that they can’t talk, not because they’re alien. When we put our minds to understanding an animal like a cat which evolved for (some) social interaction, we can do so—I taught a cat to signal hunger by jumping up on a particular surface, and Buddhist monks with lots of time have taught cats many more tricks. People are currently teaching them to hold English conversations (apparently) by pushing buttons which trigger voice recordings. Unsurprisingly, it looks like cats value outcomes like food in their mouths and a lack of irritating noises, not some alien goal that Stephen Hawking could never understand.
If you think that a superhuman AGI would have a lot of trouble inferring your desires or those of others, even given the knowledge it should rapidly develop about evolution—congratulations, you’re autistic.