It seems that, according to your description, my proto-preferences are my current map of the situation I am in (or ones I have already imagined) along with valence tags. However, the AI is going to be in a different location, so I actually want it to form a different map (otherwise, it would act as if it were in my location, not its location). So what I actually want to get copied is more like a map-building and valence-tagging procedure that can be applied to different contexts, which will take different information into account.
It seems hard for the AI to do significantly better than I could do by, say, controlling the robot. For example, if my ontology about engineering is wrong (in a way that prevents me from inventing nanotech), then the AI is going to also be wrong about engineering in the same way, if it copies my map-building and valence-tagging algorithms, or just my maps and valence tags. (If it doesn’t copy my maps, then how does it translate my values about my maps to its values about its maps?)
Related, if the AI uses my models in ways that subject them to more weird edge cases than I would (e.g. by searching over more actions), then they’re going to give bad answers pretty often.
Also related, these models are embedded in reality; they don’t have all that much meaning except relative to the process that builds and interprets them, which includes my senses, my pattern-recognizers, my reflexes, my tools, my social context, etc. Presumably the AI is going to replace my infrastructure with different infrastructure, but then why would we expect my models to keep working? I’m not sure what would happen if someone with my models woke up with very different sense inputs, actuators, and environment.
Perhaps most concerningly, if you asked a few neuroscientists and cognitive scientists “can we do this / will we be able to do this in 10 years”, I predict they would mostly say “no, our models and data gathering procedures aren’t actually good enough to do this, and aren’t improving super fast either”. (Note that you haven’t yet named specific neuroscience techniques for identifying humans’ models, so the statement that neuroscience has things to say about this seems empty). So a bunch of original cognitive science/neuroscience research is going to have to get done here, in addition to much better data gathering and inference procedures for actually looking inside humans’ algorithms.
There’s still an unidentifiability issue in that you need assumptions about which things are “my models” and “my valence tags”. These things, at the moment, do not have rigorous definitions. For example, if I am modelling you (and therefore running a small copy of you in my brain), then probably my model of you also has models and valence tags, yet these aren’t my models and valence tags (for the purposes of inferring my preferences). You’d also need to make decisions about the extent to which e.g. reflexes are embodying values. So there are a bunch of modelling choices required, which could be made with cognitive science models that are much, much better than those available right now.
That said, this does seem to be the value learning approach I am most optimistic about right now.
That said, this does seem to be the value learning approach I am most optimistic about right now.
Thanks! I’m not sure I fully get all your concerns, but I’ll try and answer to the best of my understanding.
1-4 (and a little bit of 6): this is why I started looking at semantics vs syntax. Consider the small model “If someone is drowning, I should help them (if it’s an easy thing to do)”. Then “someone”, “downing”, “I”, and “help them” are vague labels for complex categories (as re most of there rest of the terms, really). The semantics of these categories need to be established before the AI can do anything. And the central examples of the categories will be clearer than the fuzzy edges. Therefore the AI can model me as having a strong preferences in the central example of the categories, which become much weaker as we move to the edges (the meta-preferences will start to become very relevant in the edge cases). I expect that “I should help them” further decomposes into “they should be helped” and “I should get the credit for helping them”.
Therefore, it seems to me, that an AI should be able to establish that if someone is drowning, it should try and enable me to save them, and if it can’t do that, then it should save them itself (using nanotechnology or anything else). It doesn’t seem that it would be seeing the issue from my narrow perspective, because I don’t see the issue just from my narrow perspective.
5: I am pretty sure that we could use neuroscience to establish that, for example, people are truthful when they say that they see the anchoring bias as a bias. But I might have been a bit glib when mentioning neuroscience; that is mainly the “science fiction superpowers” end of the spectrum for the moment.
What I’m hoping, with this technique, is that if we end up using indirect normativity or stated preferences, that my keeping in mind this model of what proto-preferences are, we can better automate the limitations of these techniques (eg when we expect lying), rather than putting them in by hand.
6: Currently I don’t see reflexes as embodying values at all. However, people’s attitudes towards their own reflexes are valid meta-preferences.
Ok, this seems usefully specific. A few concerns:
It seems that, according to your description, my proto-preferences are my current map of the situation I am in (or ones I have already imagined) along with valence tags. However, the AI is going to be in a different location, so I actually want it to form a different map (otherwise, it would act as if it were in my location, not its location). So what I actually want to get copied is more like a map-building and valence-tagging procedure that can be applied to different contexts, which will take different information into account.
It seems hard for the AI to do significantly better than I could do by, say, controlling the robot. For example, if my ontology about engineering is wrong (in a way that prevents me from inventing nanotech), then the AI is going to also be wrong about engineering in the same way, if it copies my map-building and valence-tagging algorithms, or just my maps and valence tags. (If it doesn’t copy my maps, then how does it translate my values about my maps to its values about its maps?)
Related, if the AI uses my models in ways that subject them to more weird edge cases than I would (e.g. by searching over more actions), then they’re going to give bad answers pretty often.
Also related, these models are embedded in reality; they don’t have all that much meaning except relative to the process that builds and interprets them, which includes my senses, my pattern-recognizers, my reflexes, my tools, my social context, etc. Presumably the AI is going to replace my infrastructure with different infrastructure, but then why would we expect my models to keep working? I’m not sure what would happen if someone with my models woke up with very different sense inputs, actuators, and environment.
Perhaps most concerningly, if you asked a few neuroscientists and cognitive scientists “can we do this / will we be able to do this in 10 years”, I predict they would mostly say “no, our models and data gathering procedures aren’t actually good enough to do this, and aren’t improving super fast either”. (Note that you haven’t yet named specific neuroscience techniques for identifying humans’ models, so the statement that neuroscience has things to say about this seems empty). So a bunch of original cognitive science/neuroscience research is going to have to get done here, in addition to much better data gathering and inference procedures for actually looking inside humans’ algorithms.
There’s still an unidentifiability issue in that you need assumptions about which things are “my models” and “my valence tags”. These things, at the moment, do not have rigorous definitions. For example, if I am modelling you (and therefore running a small copy of you in my brain), then probably my model of you also has models and valence tags, yet these aren’t my models and valence tags (for the purposes of inferring my preferences). You’d also need to make decisions about the extent to which e.g. reflexes are embodying values. So there are a bunch of modelling choices required, which could be made with cognitive science models that are much, much better than those available right now.
That said, this does seem to be the value learning approach I am most optimistic about right now.
Thanks! I’m not sure I fully get all your concerns, but I’ll try and answer to the best of my understanding.
1-4 (and a little bit of 6): this is why I started looking at semantics vs syntax. Consider the small model “If someone is drowning, I should help them (if it’s an easy thing to do)”. Then “someone”, “downing”, “I”, and “help them” are vague labels for complex categories (as re most of there rest of the terms, really). The semantics of these categories need to be established before the AI can do anything. And the central examples of the categories will be clearer than the fuzzy edges. Therefore the AI can model me as having a strong preferences in the central example of the categories, which become much weaker as we move to the edges (the meta-preferences will start to become very relevant in the edge cases). I expect that “I should help them” further decomposes into “they should be helped” and “I should get the credit for helping them”.
Therefore, it seems to me, that an AI should be able to establish that if someone is drowning, it should try and enable me to save them, and if it can’t do that, then it should save them itself (using nanotechnology or anything else). It doesn’t seem that it would be seeing the issue from my narrow perspective, because I don’t see the issue just from my narrow perspective.
5: I am pretty sure that we could use neuroscience to establish that, for example, people are truthful when they say that they see the anchoring bias as a bias. But I might have been a bit glib when mentioning neuroscience; that is mainly the “science fiction superpowers” end of the spectrum for the moment.
What I’m hoping, with this technique, is that if we end up using indirect normativity or stated preferences, that my keeping in mind this model of what proto-preferences are, we can better automate the limitations of these techniques (eg when we expect lying), rather than putting them in by hand.
6: Currently I don’t see reflexes as embodying values at all. However, people’s attitudes towards their own reflexes are valid meta-preferences.