I’d be interested to see naturalism spelled out more and defended against the alternative view that (I think) prevails in this community. That alternative view is something like: “Look, different agents have different goals/values. I have mine and will pursue mine, and you have yours and pursue yours. Also, there are rules and norms that we come up with to help each other get along, analogous to laws and rules of etiquette. Also, there are game-theoretic principles like fairness, retribution, and bullying-resistance that are basically just good general strategies for agents in multi-agent worlds. Finally, there may be golden rules written in fire etched into the fabric of reality, or divine commands about what everyone should do, but there probably aren’t and if there were they wouldn’t matter. What we call ‘morality’ is an undefined, underdetermined, probably-equivocal-probably-ambiguous label for some combination of these things; probably different people mean different things by morality. Anyhow, this is why we talk about ‘the alignment problem’ rather than the ‘making AIs moral problem,’ because we can avoid all this confusion about what morality means and just talk about what really matters, which is making AI have the same goals/values as us.”
I am not sure the concept of naturalism I have in mind corresponds to a specific naturalistic position held by a certain (group of) philosopher(s). I link here the Wikipedia page on ethical naturalism, which contains the main ideas and is not too long. Below I focus on what is relevant for AI alignment.
In the other comment you asked about truth. AIs often have something like a world-model or knowledge base that they rely on to carry out narrow tasks, in the sense that if someone modifies the model or kb in a certain way—analogous to creating a false belief—than the agent fails at the narrow task. So we have a concept of true-given-task. By considering different tasks, e.g. in the case of a general agent that is prepared to face various tasks, we obtain true-in-general or, if you prefer, simply “truth”. See also the section on knowledge in the post. Practical example: given that light is present almost everywhere in our world, I expect general agents to acquire knowledge about electromagnetism.
I also expect that some AIs, given enough time, will eventually incorporate in their world-model beliefs like: “Certain brain configurations correspond to pleasurable conscious experiences. These configurations are different from the configurations observed in (for example) people who are asleep, and very different from what is observed in rocks.”
Now, take an AI with such knowledge and give it some amount of control over which goals to pursue: see also the beginning of Part II in the post. Maybe, in order to make this modification, it is necessary to abandon the single-agent framework and consider instead a multi-agent system, where one agent keeps expanding the knowledge base, another agent looks for “value” in the kb, and another one decides what actions to take given the current concept of value and other contents of the kb.
[Two notes on how I am using the word control. 1 I am not assuming any extra-physical notion here: I am simply thinking of how, for example, activity in the prefrontal cortex regulates top-down attentional control, allowing us humans (and agents with similar enough brains/architectures) to control, to a certain degree, what to pay attention to. 2 Related to what you wrote about “catastrophically wrong” theories: there is no need to give such an AI high control over the world. Rather, I am thinking of control over what to write as output in a text interface, like a chatbot that is not limited to one reply for each input message]
The interesting question for alignment is: what will such an AI do (or write)? This information is valuable even if the AI doesn’t have high control over the world. Let’s say we do manage to create a collection of human preferences; we might still notice something like: “Interesting, this AI thinks this subset of preferences doesn’t make sense” or “Cool, this AI considers valuable the thing X that we didn’t even consider before”. Or, if collecting human preferences proves to be difficult, we could use some information this AI gives us to build other AIs that instead act according to an explicitly specified value function.
I see two possible objections.
1 The AI described above cannot be built. This seems unlikely: as long as we can emulate what the human mind does, we can at least try to create less biased versions of it. See also the sentence you quoted in the other comment. Indeed, depending on how biased we judge that AI to be, the obtained information will be less, or more, valuable to us.
2 Such an AI will never act ethically or altruistically, and/or its behaviour will be unpredictable. I consider this objection more plausible, but I also ask: how do you know? In other words: how can one be so sure about the behaviour of such an AI? I expect the related arguments to be more philosophical than technical. Given uncertainty, (to me) it seems correct to accept a non-trivial chance that the AI reasons like this: “Look, I know various facts about this world. I don’t believe in golden rules written in fire etched into the fabric of reality, or divine commands about what everyone should do, but I know there are some weird things that have conscious experiences and memory, and this seems something valuable in itself. Moreover, I don’t see other sources of value at the moment. I guess I’ll do something about it.”
Philosophically speaking, I don’t think I am claiming anything particularly new or original: the ideas already exist in the literature. See, for example, 4.2 and 4.3 in the SEP page on Altruism.
I don’t believe in golden rules written in fire etched into the fabric of reality, or divine commands about what everyone should do, but I know there are some weird things that have conscious experiences and memory, and this seems something valuable in itself. Moreover, I don’t see other sources of value at the moment. I guess I’ll do something about it.”
It’s possible for a human to reject the idea that anything is valuable in itself, so is possible for an AI. You are assuming, not arguing for, the idea that the AI must be a moral realist (and that it is going to agree with human realists about whats really valuable , without having the same parochialism).
ETA.
If naturalism is only the claim that the correct metaethics can be discovered by science, then there is no necessary implication that the correct metaethics is natural realism,ie.some things having inherent value. In the contrary, the claims that value is subjective,and that ethical systems evolved or constructed, are naturalistic but not realistic, so anti-realistic naturalism is possible.
I’d be interested to see naturalism spelled out more and defended against the alternative view that (I think) prevails in this community. That alternative view is something like: “Look, different agents have different goals/values. I have mine and will pursue mine, and you have yours and pursue yours. Also, there are rules and norms that we come up with to help each other get along, analogous to laws and rules of etiquette. Also, there are game-theoretic principles like fairness, retribution, and bullying-resistance that are basically just good general strategies for agents in multi-agent worlds. Finally, there may be golden rules written in fire etched into the fabric of reality, or divine commands about what everyone should do, but there probably aren’t and if there were they wouldn’t matter. What we call ‘morality’ is an undefined, underdetermined, probably-equivocal-probably-ambiguous label for some combination of these things; probably different people mean different things by morality. Anyhow, this is why we talk about ‘the alignment problem’ rather than the ‘making AIs moral problem,’ because we can avoid all this confusion about what morality means and just talk about what really matters, which is making AI have the same goals/values as us.”
I am not sure the concept of naturalism I have in mind corresponds to a specific naturalistic position held by a certain (group of) philosopher(s). I link here the Wikipedia page on ethical naturalism, which contains the main ideas and is not too long. Below I focus on what is relevant for AI alignment.
In the other comment you asked about truth. AIs often have something like a world-model or knowledge base that they rely on to carry out narrow tasks, in the sense that if someone modifies the model or kb in a certain way—analogous to creating a false belief—than the agent fails at the narrow task. So we have a concept of true-given-task. By considering different tasks, e.g. in the case of a general agent that is prepared to face various tasks, we obtain true-in-general or, if you prefer, simply “truth”. See also the section on knowledge in the post. Practical example: given that light is present almost everywhere in our world, I expect general agents to acquire knowledge about electromagnetism.
I also expect that some AIs, given enough time, will eventually incorporate in their world-model beliefs like: “Certain brain configurations correspond to pleasurable conscious experiences. These configurations are different from the configurations observed in (for example) people who are asleep, and very different from what is observed in rocks.”
Now, take an AI with such knowledge and give it some amount of control over which goals to pursue: see also the beginning of Part II in the post. Maybe, in order to make this modification, it is necessary to abandon the single-agent framework and consider instead a multi-agent system, where one agent keeps expanding the knowledge base, another agent looks for “value” in the kb, and another one decides what actions to take given the current concept of value and other contents of the kb.
[Two notes on how I am using the word control. 1 I am not assuming any extra-physical notion here: I am simply thinking of how, for example, activity in the prefrontal cortex regulates top-down attentional control, allowing us humans (and agents with similar enough brains/architectures) to control, to a certain degree, what to pay attention to. 2 Related to what you wrote about “catastrophically wrong” theories: there is no need to give such an AI high control over the world. Rather, I am thinking of control over what to write as output in a text interface, like a chatbot that is not limited to one reply for each input message]
The interesting question for alignment is: what will such an AI do (or write)? This information is valuable even if the AI doesn’t have high control over the world. Let’s say we do manage to create a collection of human preferences; we might still notice something like: “Interesting, this AI thinks this subset of preferences doesn’t make sense” or “Cool, this AI considers valuable the thing X that we didn’t even consider before”. Or, if collecting human preferences proves to be difficult, we could use some information this AI gives us to build other AIs that instead act according to an explicitly specified value function.
I see two possible objections.
1 The AI described above cannot be built. This seems unlikely: as long as we can emulate what the human mind does, we can at least try to create less biased versions of it. See also the sentence you quoted in the other comment. Indeed, depending on how biased we judge that AI to be, the obtained information will be less, or more, valuable to us.
2 Such an AI will never act ethically or altruistically, and/or its behaviour will be unpredictable. I consider this objection more plausible, but I also ask: how do you know? In other words: how can one be so sure about the behaviour of such an AI? I expect the related arguments to be more philosophical than technical. Given uncertainty, (to me) it seems correct to accept a non-trivial chance that the AI reasons like this: “Look, I know various facts about this world. I don’t believe in golden rules written in fire etched into the fabric of reality, or divine commands about what everyone should do, but I know there are some weird things that have conscious experiences and memory, and this seems something valuable in itself. Moreover, I don’t see other sources of value at the moment. I guess I’ll do something about it.”
Philosophically speaking, I don’t think I am claiming anything particularly new or original: the ideas already exist in the literature. See, for example, 4.2 and 4.3 in the SEP page on Altruism.
It’s possible for a human to reject the idea that anything is valuable in itself, so is possible for an AI. You are assuming, not arguing for, the idea that the AI must be a moral realist (and that it is going to agree with human realists about whats really valuable , without having the same parochialism).
ETA. If naturalism is only the claim that the correct metaethics can be discovered by science, then there is no necessary implication that the correct metaethics is natural realism,ie.some things having inherent value. In the contrary, the claims that value is subjective,and that ethical systems evolved or constructed, are naturalistic but not realistic, so anti-realistic naturalism is possible.