The question then is, what would it mean for such an AI to pursue our values?
Why isn’t the answer just that the AI should: 1. Figure out what concepts we have; 2. Adjust those concepts in ways that we’d reflectively endorse; 3. Use those concepts?
The idea that almost none of the things we care about could be adjusted to fit into a more accurate worldview seems like a very strongly skeptical hypothesis. Tables (or happiness) don’t need to be “real in a reductionist sense” for me to want more of them.
Agreed. The problem is with AI designs which don’t do that. It seems to me like this perspective is quite rare. For example, my post Policy Alignment was about something similar to this, but I got a ton of pushback in the comments—it seems to me like a lot of people really think the AI should use better AI concepts, not human concepts. At least they did back in 2018.
As you mention, this is partly due to overly reductionist world-views. If tables/happiness aren’t reductively real, the fact that the AI is using those concepts is evidence that it’s dumb/insane, right?
From an “engineering perspective”, if I was forced to choose something right now, it would be an AI “optimizing human utility according to AI beliefs” but asking for clarification when such choice diverges too much from the “policy-approval”.
Probably most of the problem was that my post didn’t frame things that well—I was mainly talking in terms of “beliefs”, rather than emphasizing ontology, which makes it easy to imagine AI beliefs are about the same concepts but just more accurate. John’s description of the pointers problem might be enough to re-frame things to the point where “you need to start from human concepts, and improve them in ways humans endorse” is bordering on obvious.
(Plus I arguably was too focused on giving a specific mathematical proposal rather than the general idea.)
Why isn’t the answer just that the AI should:
1. Figure out what concepts we have;
2. Adjust those concepts in ways that we’d reflectively endorse;
3. Use those concepts?
The idea that almost none of the things we care about could be adjusted to fit into a more accurate worldview seems like a very strongly skeptical hypothesis. Tables (or happiness) don’t need to be “real in a reductionist sense” for me to want more of them.
Agreed. The problem is with AI designs which don’t do that. It seems to me like this perspective is quite rare. For example, my post Policy Alignment was about something similar to this, but I got a ton of pushback in the comments—it seems to me like a lot of people really think the AI should use better AI concepts, not human concepts. At least they did back in 2018.
As you mention, this is partly due to overly reductionist world-views. If tables/happiness aren’t reductively real, the fact that the AI is using those concepts is evidence that it’s dumb/insane, right?
Illustrative excerpt from a comment there:
Probably most of the problem was that my post didn’t frame things that well—I was mainly talking in terms of “beliefs”, rather than emphasizing ontology, which makes it easy to imagine AI beliefs are about the same concepts but just more accurate. John’s description of the pointers problem might be enough to re-frame things to the point where “you need to start from human concepts, and improve them in ways humans endorse” is bordering on obvious.
(Plus I arguably was too focused on giving a specific mathematical proposal rather than the general idea.)