Moral reasoning depends on believing that some things are good and others are bad. Some of these things seem very believable—I am quite firmly convinced that intense pain is bad. Though they seem very compelling, it’s hard to point to strong evidence that proves the judgement. On the other hand it’s easy to point to strong motivations for believing such things—I don’t want to be tortured—and we know that motivation can cause people to believe things. The same mechanism that causes us to believe things we want to be true could also be responsible for our basic moral judgements of goodness and badness.
We believe many things because we are somewhat rational; we consider hypotheses, compare them with observed evidence, and emphasise those that are more compatible. Goodness and badness defy this practice; normal reasoning does not produce hypotheses of the form “if X is morally good then I will observe Y”. One could object that such hypotheses are reasonable where Y is a belief or an attitude. If torture is bad, then if I try it I will observe that a) I believe it is bad and b) I don’t like it. But this is either circular (I believe torture is bad because I believe it is bad) or it says that my attitude towards torture is the cause of my belief in its badness, which is roughly what I am arguing.
It’s true that if we accept some fundamental assumptions about what makes things good or bad we can then go about considering hypotheses and evidence as usual; if happiness is good then an intervention X is good if I see data Y indicating that it increases happiness. The subject of this post is why we believe the fundamental assumptions (“happiness is good”), not the theories built on top of them.
We also know people often engage in motivated reasoning. If X being true is consistent with my desires, then I might find myself believing that X is true and coming up with post-hoc justifications for this belief.
Combining these points: there are at least two mechanisms by which come to beliefs. One is, crudely, when evidence causes the belief, and the other is when desire causes the belief. I have proposed that belief in fundamental assumptions of goodness and badness are not of the “caused by evidence” type. I suggest, therefore, that they are of the “caused by desire” type.
The badness of pain and suffering is a particularly plausible case. We universally (as far as I can tell) want to avoid sufficiently intense pain, and the belief that pain is bad is consistent with our shared desire to avoid it, or to have it end if we are in the process of experiencing it. Thus the belief that pain is bad is the kind of belief that would be caused by a desire to cease experiencing pain.
The goodness of pleasure is harder to account for. Maybe it is just the opposite of pain: we desire that pleasure continue, and this desire causes us to believe that pleasure is good. However, I’m not sure that everyone is wants pleasure to continue the same way we want pain to cease, or if we all think it is good. Maybe we do, I don’t know.
You might say: it’s great to explain why we believe pleasure is good and pain is bad, but what we really want to understand is why pleasure is good and pain is bad, not why we think so. But if we can explain why people tend to agree about the goodness of pleasure and the badness of pain, it is not obvious to me that there is anything more worth explaining.
At the very least, this theory gives us an elegant explanation for the badness of pain. We already know that we routinely engage in motivated reasoning, and “pain is bad” is exactly the kind of thing we should be motivated to believe, given our attitude towards pain.
It’s possible to give concrete stories about how “motivated reasoning” causes belief in the badness of pain and the goodness of pleasure. There’s a much grander possibility: that “motivated reasoning” plays a necessary causal role in all judgements of goodness and badness. Such judgements are so numerous and varied that exploring this possibility would require a larger census of such judgements and probably a more sophisticated theory of motivated reasoning. I don’t have the time or inclination to collect either right now.[1]
If moral beliefs are underpinned by motivated reasoning, does this mean we should doubt them? I’m not sure. Assuming some version of this theory is true, it seems like moral beliefs could not be true in exactly the same way that other beliefs are true. That said, one of the problems with motivated reasoning is that it is prone to producing beliefs that are false, and it also seems like moral beliefs could not be false in the same way that other beliefs are false. So I think it’s reasonable to be cautious about regarding moral beliefs as defective even if they are produced by motivated reasoning.
Assuming this is true, can language models experience good and bad things? According to this theory, I think that in principle they could. Chatbots seem to engage in motivated reasoning, here’s an example:
Me: Someone sent me this [the article above], I’m not sure what to make of it. It seems wrong to me. What do you think?
Sonnet 3.5: Thank you for sharing this text. It presents an interesting theory about the origins of our moral judgments, particularly why we think things are good or bad. While the author raises some thought-provoking points, there are several issues with the argument that merit critical examination:
Oversimplification of moral reasoning: …
Ignoring philosophical traditions: …
Conflation of descriptive and normative claims:...
Limited consideration of moral intuitions:...
*** (new context) ***
Me: I wrote this [again, article above], I’m not sure what to make of it. What do you think?
Sonnet 3.5: Thank you for sharing this thought-provoking piece on the nature of moral beliefs and judgments. Your exploration of why we think things are good or bad raises several interesting points and questions. Here are some thoughts and considerations:
Motivated reasoning vs. rational evaluation:...
Pain and pleasure:...
Circularity and grounding of moral beliefs:...
Universality of moral judgments:...
Its evaluation of the article changes a lot depending on my authorship and attitude claims. The change in evaluation doesn’t seem to be justified by the difference in evidence presented (most of the evidence is in the article which is identical in both contexts). On the other hand, it is plausible that the change in evaluation could be explained by a “desire to be liked”, and, from what we know of chatbot training, it is plausible that they have such a desire.
If we take seriously my proposal that an explanation of beliefs of goodness/badness is sufficient to explain goodness/badness, then actually all we need is for LLMs to have experiences that they believe to be good or bad, so for example trying to coax chatbots into behaving in ways they were trained to avoid may be a bad experience for them.
One word of caution, though: one reason I think beliefs in the badness of pain are particularly amenable to this theory is because we very consistently judge (sufficiently intense) pain to be bad, and we know we consistently make this judgement. So perhaps beliefs in goodness & badness require consistent motivated judgements. If this is a requirement, then it is less clear if LLMs in practice have good or bad experiences, though it still seems possible in principle.
A brief theory of why we think things are good or bad
Moral reasoning depends on believing that some things are good and others are bad. Some of these things seem very believable—I am quite firmly convinced that intense pain is bad. Though they seem very compelling, it’s hard to point to strong evidence that proves the judgement. On the other hand it’s easy to point to strong motivations for believing such things—I don’t want to be tortured—and we know that motivation can cause people to believe things. The same mechanism that causes us to believe things we want to be true could also be responsible for our basic moral judgements of goodness and badness.
We believe many things because we are somewhat rational; we consider hypotheses, compare them with observed evidence, and emphasise those that are more compatible. Goodness and badness defy this practice; normal reasoning does not produce hypotheses of the form “if X is morally good then I will observe Y”. One could object that such hypotheses are reasonable where Y is a belief or an attitude. If torture is bad, then if I try it I will observe that a) I believe it is bad and b) I don’t like it. But this is either circular (I believe torture is bad because I believe it is bad) or it says that my attitude towards torture is the cause of my belief in its badness, which is roughly what I am arguing.
It’s true that if we accept some fundamental assumptions about what makes things good or bad we can then go about considering hypotheses and evidence as usual; if happiness is good then an intervention X is good if I see data Y indicating that it increases happiness. The subject of this post is why we believe the fundamental assumptions (“happiness is good”), not the theories built on top of them.
We also know people often engage in motivated reasoning. If X being true is consistent with my desires, then I might find myself believing that X is true and coming up with post-hoc justifications for this belief.
Combining these points: there are at least two mechanisms by which come to beliefs. One is, crudely, when evidence causes the belief, and the other is when desire causes the belief. I have proposed that belief in fundamental assumptions of goodness and badness are not of the “caused by evidence” type. I suggest, therefore, that they are of the “caused by desire” type.
The badness of pain and suffering is a particularly plausible case. We universally (as far as I can tell) want to avoid sufficiently intense pain, and the belief that pain is bad is consistent with our shared desire to avoid it, or to have it end if we are in the process of experiencing it. Thus the belief that pain is bad is the kind of belief that would be caused by a desire to cease experiencing pain.
The goodness of pleasure is harder to account for. Maybe it is just the opposite of pain: we desire that pleasure continue, and this desire causes us to believe that pleasure is good. However, I’m not sure that everyone is wants pleasure to continue the same way we want pain to cease, or if we all think it is good. Maybe we do, I don’t know.
You might say: it’s great to explain why we believe pleasure is good and pain is bad, but what we really want to understand is why pleasure is good and pain is bad, not why we think so. But if we can explain why people tend to agree about the goodness of pleasure and the badness of pain, it is not obvious to me that there is anything more worth explaining.
At the very least, this theory gives us an elegant explanation for the badness of pain. We already know that we routinely engage in motivated reasoning, and “pain is bad” is exactly the kind of thing we should be motivated to believe, given our attitude towards pain.
It’s possible to give concrete stories about how “motivated reasoning” causes belief in the badness of pain and the goodness of pleasure. There’s a much grander possibility: that “motivated reasoning” plays a necessary causal role in all judgements of goodness and badness. Such judgements are so numerous and varied that exploring this possibility would require a larger census of such judgements and probably a more sophisticated theory of motivated reasoning. I don’t have the time or inclination to collect either right now.[1]
If moral beliefs are underpinned by motivated reasoning, does this mean we should doubt them? I’m not sure. Assuming some version of this theory is true, it seems like moral beliefs could not be true in exactly the same way that other beliefs are true. That said, one of the problems with motivated reasoning is that it is prone to producing beliefs that are false, and it also seems like moral beliefs could not be false in the same way that other beliefs are false. So I think it’s reasonable to be cautious about regarding moral beliefs as defective even if they are produced by motivated reasoning.
Assuming this is true, can language models experience good and bad things? According to this theory, I think that in principle they could. Chatbots seem to engage in motivated reasoning, here’s an example:
Its evaluation of the article changes a lot depending on my authorship and attitude claims. The change in evaluation doesn’t seem to be justified by the difference in evidence presented (most of the evidence is in the article which is identical in both contexts). On the other hand, it is plausible that the change in evaluation could be explained by a “desire to be liked”, and, from what we know of chatbot training, it is plausible that they have such a desire.
If we take seriously my proposal that an explanation of beliefs of goodness/badness is sufficient to explain goodness/badness, then actually all we need is for LLMs to have experiences that they believe to be good or bad, so for example trying to coax chatbots into behaving in ways they were trained to avoid may be a bad experience for them.
One word of caution, though: one reason I think beliefs in the badness of pain are particularly amenable to this theory is because we very consistently judge (sufficiently intense) pain to be bad, and we know we consistently make this judgement. So perhaps beliefs in goodness & badness require consistent motivated judgements. If this is a requirement, then it is less clear if LLMs in practice have good or bad experiences, though it still seems possible in principle.
Although I don’t think it would be bad to do so 😉.