This runs headfirst into the problem of radical translation (which in AI is called “AI interpretability.” Only slightly joking.)
Inside our Scientist AI it’s not going to say “murder is bad,” it’s going to say “feature 1000 1101 1111 1101 is connected to feature 0000 1110 1110 1101.” At at first you might think this isn’t so bad, after all, AI interpretability is a flourishing field, let’s just look at some examples and visualizations and try to figure out what these things are. But there’s no guarantee that these features correspond neatly to their closest single-world English equivalent, especially after you try to generalize them to new situations. See also Kaj’s posts on concept safety. Nor are we guaranteed uniqueness—the Scientist AI doesn’t have to form one single feature for “murder” that has no neighbors, there might be a couple hundred near-equivalents that are difficult for us to tease apart.
Edit: reading comprehension is hard, see Michele’s reply.
I am aware of interpretability issues. This is why, for AI alignment, I am more interested in the agent described at the beginning of Part II than Scientist AI.
Thanks for the link to the sequence on concepts, I found it interesting!
Wow, I’m really sorry for my bad reading comprehension.
Anyhow, I’m skeptical that scientist AI part 2 would end up doing the right thing (regardless of our ability to interpret it). I’m curious if you think this could be settled without building a superintelligent AI of uncertain goals, or if you’d really want to see the “full scale” test.
If there is a superintelligent AI that ends up being aligned as I’ve written, probably there is also a less intelligent agent that does the same thing. Something comparable to human-level might be enough.
From another point of view: some philosophers are convinced that caring about conscious experiences is the rational thing to do. If it’s possible to write an algorithm that works in a similar way to how their mind works, we already have an (imperfect, biased, etc.) agent that is somewhat aligned, and is likely to stay aligned after further reflection.
One could argue that these philosophers are fooling themselves, that no really intelligent agent will end up with such weird beliefs. So far, I haven’t seen convincing arguments in favour of this; it goes back to the metaethical discussion. I quote a sentence I have written in the post:
Depending on one’s background knowledge of philosophy and AI, the idea that rationality plays a role in reasoning about goals and can lead to disinterested (not game-theoretic or instrumental) altruism may seem plain wrong or highly speculative to some, and straightforward to others.
From another point of view: some philosophers are convinced that caring about conscious experiences is the rational thing to do. If it’s possible to write an algorithm that works in a similar way to how their mind works, we already have an (imperfect, biased, etc.) agent that is somewhat aligned, and is likely to stay aligned after further reflection.
I think this is an interesting point—but I don’t conclude optimism from it as you do. Humans engage in explicit reasoning about what they should do, and they theorize and systematize, and some of them really enjoy doing this and become philosophers so they can do it a lot, and some of them conclude things like “The thing to do is maximize total happiness” or “You can do whatever you want, subject to the constraint that you obey the categorical imperative” or as you say “everyone should care about conscious experiences.”
The problem is that every single one of those theories developed so far has either been (1) catastrophically wrong, (2) too vague, or (3) relative to the speaker’s intuitions somehow (e.g. intuitionism).
By “catastrophically wrong” I mean that if an AI with control of the whole world actually followed through on the theory, they would kill everyone or do something similarly bad. (See e.g. classical utilitarianism as the classic example of this).
Basically… I think you are totally right that some of our early AI systems will do philosophy and come to all sorts of interesting conclusions, but I don’t expect them to be the correct conclusions. (My metaethical views may be lurking in the background here, driving my intuitions about this… see Eliezer’s comment)
Do you have an account of how philosophical reasoning in general, or about morality in particular, is truth-tracking? Can we ensure that the AIs we build reason in a truth-tracking way? If truth isn’t the right concept for thinking about morality, and instead we need to think about e.g. “human values” or “my values,” then this is basically a version of the alignment problem.
This runs headfirst into the problem of radical translation (which in AI is called “AI interpretability.” Only slightly joking.)
Inside our Scientist AI it’s not going to say “murder is bad,” it’s going to say “feature 1000 1101 1111 1101 is connected to feature 0000 1110 1110 1101.” At at first you might think this isn’t so bad, after all, AI interpretability is a flourishing field, let’s just look at some examples and visualizations and try to figure out what these things are. But there’s no guarantee that these features correspond neatly to their closest single-world English equivalent, especially after you try to generalize them to new situations. See also Kaj’s posts on concept safety. Nor are we guaranteed uniqueness—the Scientist AI doesn’t have to form one single feature for “murder” that has no neighbors, there might be a couple hundred near-equivalents that are difficult for us to tease apart.
Edit: reading comprehension is hard, see Michele’s reply.
I am aware of interpretability issues. This is why, for AI alignment, I am more interested in the agent described at the beginning of Part II than Scientist AI.
Thanks for the link to the sequence on concepts, I found it interesting!
Wow, I’m really sorry for my bad reading comprehension.
Anyhow, I’m skeptical that scientist AI part 2 would end up doing the right thing (regardless of our ability to interpret it). I’m curious if you think this could be settled without building a superintelligent AI of uncertain goals, or if you’d really want to see the “full scale” test.
If there is a superintelligent AI that ends up being aligned as I’ve written, probably there is also a less intelligent agent that does the same thing. Something comparable to human-level might be enough.
From another point of view: some philosophers are convinced that caring about conscious experiences is the rational thing to do. If it’s possible to write an algorithm that works in a similar way to how their mind works, we already have an (imperfect, biased, etc.) agent that is somewhat aligned, and is likely to stay aligned after further reflection.
One could argue that these philosophers are fooling themselves, that no really intelligent agent will end up with such weird beliefs. So far, I haven’t seen convincing arguments in favour of this; it goes back to the metaethical discussion. I quote a sentence I have written in the post:
I think this is an interesting point—but I don’t conclude optimism from it as you do. Humans engage in explicit reasoning about what they should do, and they theorize and systematize, and some of them really enjoy doing this and become philosophers so they can do it a lot, and some of them conclude things like “The thing to do is maximize total happiness” or “You can do whatever you want, subject to the constraint that you obey the categorical imperative” or as you say “everyone should care about conscious experiences.”
The problem is that every single one of those theories developed so far has either been (1) catastrophically wrong, (2) too vague, or (3) relative to the speaker’s intuitions somehow (e.g. intuitionism).
By “catastrophically wrong” I mean that if an AI with control of the whole world actually followed through on the theory, they would kill everyone or do something similarly bad. (See e.g. classical utilitarianism as the classic example of this).
Basically… I think you are totally right that some of our early AI systems will do philosophy and come to all sorts of interesting conclusions, but I don’t expect them to be the correct conclusions. (My metaethical views may be lurking in the background here, driving my intuitions about this… see Eliezer’s comment)
Do you have an account of how philosophical reasoning in general, or about morality in particular, is truth-tracking? Can we ensure that the AIs we build reason in a truth-tracking way? If truth isn’t the right concept for thinking about morality, and instead we need to think about e.g. “human values” or “my values,” then this is basically a version of the alignment problem.