That said, this does seem to be the value learning approach I am most optimistic about right now.
Thanks! I’m not sure I fully get all your concerns, but I’ll try and answer to the best of my understanding.
1-4 (and a little bit of 6): this is why I started looking at semantics vs syntax. Consider the small model “If someone is drowning, I should help them (if it’s an easy thing to do)”. Then “someone”, “downing”, “I”, and “help them” are vague labels for complex categories (as re most of there rest of the terms, really). The semantics of these categories need to be established before the AI can do anything. And the central examples of the categories will be clearer than the fuzzy edges. Therefore the AI can model me as having a strong preferences in the central example of the categories, which become much weaker as we move to the edges (the meta-preferences will start to become very relevant in the edge cases). I expect that “I should help them” further decomposes into “they should be helped” and “I should get the credit for helping them”.
Therefore, it seems to me, that an AI should be able to establish that if someone is drowning, it should try and enable me to save them, and if it can’t do that, then it should save them itself (using nanotechnology or anything else). It doesn’t seem that it would be seeing the issue from my narrow perspective, because I don’t see the issue just from my narrow perspective.
5: I am pretty sure that we could use neuroscience to establish that, for example, people are truthful when they say that they see the anchoring bias as a bias. But I might have been a bit glib when mentioning neuroscience; that is mainly the “science fiction superpowers” end of the spectrum for the moment.
What I’m hoping, with this technique, is that if we end up using indirect normativity or stated preferences, that my keeping in mind this model of what proto-preferences are, we can better automate the limitations of these techniques (eg when we expect lying), rather than putting them in by hand.
6: Currently I don’t see reflexes as embodying values at all. However, people’s attitudes towards their own reflexes are valid meta-preferences.
Thanks! I’m not sure I fully get all your concerns, but I’ll try and answer to the best of my understanding.
1-4 (and a little bit of 6): this is why I started looking at semantics vs syntax. Consider the small model “If someone is drowning, I should help them (if it’s an easy thing to do)”. Then “someone”, “downing”, “I”, and “help them” are vague labels for complex categories (as re most of there rest of the terms, really). The semantics of these categories need to be established before the AI can do anything. And the central examples of the categories will be clearer than the fuzzy edges. Therefore the AI can model me as having a strong preferences in the central example of the categories, which become much weaker as we move to the edges (the meta-preferences will start to become very relevant in the edge cases). I expect that “I should help them” further decomposes into “they should be helped” and “I should get the credit for helping them”.
Therefore, it seems to me, that an AI should be able to establish that if someone is drowning, it should try and enable me to save them, and if it can’t do that, then it should save them itself (using nanotechnology or anything else). It doesn’t seem that it would be seeing the issue from my narrow perspective, because I don’t see the issue just from my narrow perspective.
5: I am pretty sure that we could use neuroscience to establish that, for example, people are truthful when they say that they see the anchoring bias as a bias. But I might have been a bit glib when mentioning neuroscience; that is mainly the “science fiction superpowers” end of the spectrum for the moment.
What I’m hoping, with this technique, is that if we end up using indirect normativity or stated preferences, that my keeping in mind this model of what proto-preferences are, we can better automate the limitations of these techniques (eg when we expect lying), rather than putting them in by hand.
6: Currently I don’t see reflexes as embodying values at all. However, people’s attitudes towards their own reflexes are valid meta-preferences.