I’m pretty confused by what you mean by proto-preferences. I thought by proto-preferences you meant something like “preferences in the moment, not subject to reflection etc.” But you also said there’s a definition. What’s the definition? (The concept is pre-formal, I don’t think you’ll be able to provide a satisfactory definition).
You have written a paper about how preferences are not identifiable. Why, then, do you say that proto-preferences are identifiable, if they are just preferences in the moment? The impossibility results apply word-for-word to this case. If you have an algorithm for identifying them, what is it?
What, specifically, has neuroscience said about this that would let anyone even define what it means for a given brain to have a given set of proto-preferences?
(I don’t know what you mean by “previous Alice post”; regardless, if you’re claiming to have worked out an algorithm that infers people’s proto-preferences pretty well given empirical data, I don’t believe you. The posts on semantics and symbol grounding seem like gesturing in the direction of something that could someday form a solution, with multiple reformulations being necessary along the way; this is nowhere close to an actual solution.)
Oh, I don’t claim to have a full definition yet, but I believe it’s better than pre-formal. Here would be my current definition:
Humans are partially model-based agents. We often generate models (or at least partial models) of situations (real or hypothetical), and, within those models, label certain actions/outcomes/possibilities as better or worse than others (or sometimes just generically “good” or “bad”). This model, along with the label, is what I’d call a proto-preference (or pre-preference).
That’s why neuroscience is relevant, for identifying the mental model human use. The “previous Alice post” I mentioned is here. and was a toy version of this, in the case of an algorithm rather than a human. The reason these get around the No Free Lunch theorem is that they look inside the algorithm (so different algorithms with the same policy can be seen to have different preferences, which breaks NFL), and is making the “normative assumption” that these modelled proto-preferences correspond, (modulo preference synthesis) to the agent’s actual preferences.
Note that that definition puts preferences and meta-preferences into the same type, the only difference being the sort of model being considered.
It seems that, according to your description, my proto-preferences are my current map of the situation I am in (or ones I have already imagined) along with valence tags. However, the AI is going to be in a different location, so I actually want it to form a different map (otherwise, it would act as if it were in my location, not its location). So what I actually want to get copied is more like a map-building and valence-tagging procedure that can be applied to different contexts, which will take different information into account.
It seems hard for the AI to do significantly better than I could do by, say, controlling the robot. For example, if my ontology about engineering is wrong (in a way that prevents me from inventing nanotech), then the AI is going to also be wrong about engineering in the same way, if it copies my map-building and valence-tagging algorithms, or just my maps and valence tags. (If it doesn’t copy my maps, then how does it translate my values about my maps to its values about its maps?)
Related, if the AI uses my models in ways that subject them to more weird edge cases than I would (e.g. by searching over more actions), then they’re going to give bad answers pretty often.
Also related, these models are embedded in reality; they don’t have all that much meaning except relative to the process that builds and interprets them, which includes my senses, my pattern-recognizers, my reflexes, my tools, my social context, etc. Presumably the AI is going to replace my infrastructure with different infrastructure, but then why would we expect my models to keep working? I’m not sure what would happen if someone with my models woke up with very different sense inputs, actuators, and environment.
Perhaps most concerningly, if you asked a few neuroscientists and cognitive scientists “can we do this / will we be able to do this in 10 years”, I predict they would mostly say “no, our models and data gathering procedures aren’t actually good enough to do this, and aren’t improving super fast either”. (Note that you haven’t yet named specific neuroscience techniques for identifying humans’ models, so the statement that neuroscience has things to say about this seems empty). So a bunch of original cognitive science/neuroscience research is going to have to get done here, in addition to much better data gathering and inference procedures for actually looking inside humans’ algorithms.
There’s still an unidentifiability issue in that you need assumptions about which things are “my models” and “my valence tags”. These things, at the moment, do not have rigorous definitions. For example, if I am modelling you (and therefore running a small copy of you in my brain), then probably my model of you also has models and valence tags, yet these aren’t my models and valence tags (for the purposes of inferring my preferences). You’d also need to make decisions about the extent to which e.g. reflexes are embodying values. So there are a bunch of modelling choices required, which could be made with cognitive science models that are much, much better than those available right now.
That said, this does seem to be the value learning approach I am most optimistic about right now.
That said, this does seem to be the value learning approach I am most optimistic about right now.
Thanks! I’m not sure I fully get all your concerns, but I’ll try and answer to the best of my understanding.
1-4 (and a little bit of 6): this is why I started looking at semantics vs syntax. Consider the small model “If someone is drowning, I should help them (if it’s an easy thing to do)”. Then “someone”, “downing”, “I”, and “help them” are vague labels for complex categories (as re most of there rest of the terms, really). The semantics of these categories need to be established before the AI can do anything. And the central examples of the categories will be clearer than the fuzzy edges. Therefore the AI can model me as having a strong preferences in the central example of the categories, which become much weaker as we move to the edges (the meta-preferences will start to become very relevant in the edge cases). I expect that “I should help them” further decomposes into “they should be helped” and “I should get the credit for helping them”.
Therefore, it seems to me, that an AI should be able to establish that if someone is drowning, it should try and enable me to save them, and if it can’t do that, then it should save them itself (using nanotechnology or anything else). It doesn’t seem that it would be seeing the issue from my narrow perspective, because I don’t see the issue just from my narrow perspective.
5: I am pretty sure that we could use neuroscience to establish that, for example, people are truthful when they say that they see the anchoring bias as a bias. But I might have been a bit glib when mentioning neuroscience; that is mainly the “science fiction superpowers” end of the spectrum for the moment.
What I’m hoping, with this technique, is that if we end up using indirect normativity or stated preferences, that my keeping in mind this model of what proto-preferences are, we can better automate the limitations of these techniques (eg when we expect lying), rather than putting them in by hand.
6: Currently I don’t see reflexes as embodying values at all. However, people’s attitudes towards their own reflexes are valid meta-preferences.
So here’s an alternative explanation on what proto-preferences and preferences are, which is to say what is the process that produces something we might meaningfully reify using the “preference” construct.
Preferences are a model for answering questions about “why do this and not that?”. There’s a lot going on in this model, though, because in order to choose what to do we have to even be able to form a this and that to choose between. If we strip away the this and that (the ontological), we are left with not what is (the ontic), but instead the liminal ontology naturally implied by sense contact and the production of phenomena and experience prior to understanding it (e.g. the way you perceive color already creates a separation between what is and what you perceive by encoding interactions with what is in less bits that it would take to express an exact simulation of it). This process is mostly beyond conscious control in humans we so we tend to think of it as automatic, outside the locus-of-control, not part of the self, and thus not part of our felt sense of preference, but it’s important because it’s the first time we “make” a “choice”, and choice is what preference is all about.
So how do these choices get made? There are many principles we might derive to explain why we perceive things one way or another, but the one that to me seems most parsimonious and maximally descriptive is minimization of uncertainty, which to really cache out at this level probably requires some additional effort to deconstruct what that means in a sensible way that doesn’t fall apart the way “minimize description length” seems to because it ignores the way sometimes minimizing uncertainty over a long term requires not minimizing uncertainty over a short term (avoiding local minima) and other caveats that make too simple an explanation incomplete. Although I mostly draw on philosophy I’m not explaining here to come to this point, see Friston’s free energy, perceptual control theory, etc. for related notions and support.
This gives us a kind of low level operation then that can power preferences, which get built up at the next level of ontological abstraction (what we might call feeling or sensation), which is the encoding of a judgement about success or failure at minimizing uncertainty and could either be positive (below some threshold of minimization), negative (over some threshold), or neutral (within error bounds and unable to rule either way). From here we can build up to more complex sorts of preferences over additional levels of abstraction, but they will all be rooted in judgements about whether or not uncertainty was minimized at the perceptual level, keeping in mind that the brain senses itself through circular networks of neurons allowing it to perceive itself and thus apply this same process to perceptions we reify as “thoughts”.
What does this suggest for this discussion? I think it offers a way to dissolve many of the confusions arising from trying to work with our normally reified notions of “preference” or even the simpler but less cleanly bounded notion of “proto-preference”.
(This was a convenient opportunity to work out some of these ideas in writing since this conversation provided a nice germ to build around. I’ll probably refine and expand on this idea elsewhere later.)
I’m pretty confused by what you mean by proto-preferences. I thought by proto-preferences you meant something like “preferences in the moment, not subject to reflection etc.” But you also said there’s a definition. What’s the definition? (The concept is pre-formal, I don’t think you’ll be able to provide a satisfactory definition).
You have written a paper about how preferences are not identifiable. Why, then, do you say that proto-preferences are identifiable, if they are just preferences in the moment? The impossibility results apply word-for-word to this case. If you have an algorithm for identifying them, what is it?
What, specifically, has neuroscience said about this that would let anyone even define what it means for a given brain to have a given set of proto-preferences?
(I don’t know what you mean by “previous Alice post”; regardless, if you’re claiming to have worked out an algorithm that infers people’s proto-preferences pretty well given empirical data, I don’t believe you. The posts on semantics and symbol grounding seem like gesturing in the direction of something that could someday form a solution, with multiple reformulations being necessary along the way; this is nowhere close to an actual solution.)
Oh, I don’t claim to have a full definition yet, but I believe it’s better than pre-formal. Here would be my current definition:
Humans are partially model-based agents. We often generate models (or at least partial models) of situations (real or hypothetical), and, within those models, label certain actions/outcomes/possibilities as better or worse than others (or sometimes just generically “good” or “bad”). This model, along with the label, is what I’d call a proto-preference (or pre-preference).
That’s why neuroscience is relevant, for identifying the mental model human use. The “previous Alice post” I mentioned is here. and was a toy version of this, in the case of an algorithm rather than a human. The reason these get around the No Free Lunch theorem is that they look inside the algorithm (so different algorithms with the same policy can be seen to have different preferences, which breaks NFL), and is making the “normative assumption” that these modelled proto-preferences correspond, (modulo preference synthesis) to the agent’s actual preferences.
Note that that definition puts preferences and meta-preferences into the same type, the only difference being the sort of model being considered.
Ok, this seems usefully specific. A few concerns:
It seems that, according to your description, my proto-preferences are my current map of the situation I am in (or ones I have already imagined) along with valence tags. However, the AI is going to be in a different location, so I actually want it to form a different map (otherwise, it would act as if it were in my location, not its location). So what I actually want to get copied is more like a map-building and valence-tagging procedure that can be applied to different contexts, which will take different information into account.
It seems hard for the AI to do significantly better than I could do by, say, controlling the robot. For example, if my ontology about engineering is wrong (in a way that prevents me from inventing nanotech), then the AI is going to also be wrong about engineering in the same way, if it copies my map-building and valence-tagging algorithms, or just my maps and valence tags. (If it doesn’t copy my maps, then how does it translate my values about my maps to its values about its maps?)
Related, if the AI uses my models in ways that subject them to more weird edge cases than I would (e.g. by searching over more actions), then they’re going to give bad answers pretty often.
Also related, these models are embedded in reality; they don’t have all that much meaning except relative to the process that builds and interprets them, which includes my senses, my pattern-recognizers, my reflexes, my tools, my social context, etc. Presumably the AI is going to replace my infrastructure with different infrastructure, but then why would we expect my models to keep working? I’m not sure what would happen if someone with my models woke up with very different sense inputs, actuators, and environment.
Perhaps most concerningly, if you asked a few neuroscientists and cognitive scientists “can we do this / will we be able to do this in 10 years”, I predict they would mostly say “no, our models and data gathering procedures aren’t actually good enough to do this, and aren’t improving super fast either”. (Note that you haven’t yet named specific neuroscience techniques for identifying humans’ models, so the statement that neuroscience has things to say about this seems empty). So a bunch of original cognitive science/neuroscience research is going to have to get done here, in addition to much better data gathering and inference procedures for actually looking inside humans’ algorithms.
There’s still an unidentifiability issue in that you need assumptions about which things are “my models” and “my valence tags”. These things, at the moment, do not have rigorous definitions. For example, if I am modelling you (and therefore running a small copy of you in my brain), then probably my model of you also has models and valence tags, yet these aren’t my models and valence tags (for the purposes of inferring my preferences). You’d also need to make decisions about the extent to which e.g. reflexes are embodying values. So there are a bunch of modelling choices required, which could be made with cognitive science models that are much, much better than those available right now.
That said, this does seem to be the value learning approach I am most optimistic about right now.
Thanks! I’m not sure I fully get all your concerns, but I’ll try and answer to the best of my understanding.
1-4 (and a little bit of 6): this is why I started looking at semantics vs syntax. Consider the small model “If someone is drowning, I should help them (if it’s an easy thing to do)”. Then “someone”, “downing”, “I”, and “help them” are vague labels for complex categories (as re most of there rest of the terms, really). The semantics of these categories need to be established before the AI can do anything. And the central examples of the categories will be clearer than the fuzzy edges. Therefore the AI can model me as having a strong preferences in the central example of the categories, which become much weaker as we move to the edges (the meta-preferences will start to become very relevant in the edge cases). I expect that “I should help them” further decomposes into “they should be helped” and “I should get the credit for helping them”.
Therefore, it seems to me, that an AI should be able to establish that if someone is drowning, it should try and enable me to save them, and if it can’t do that, then it should save them itself (using nanotechnology or anything else). It doesn’t seem that it would be seeing the issue from my narrow perspective, because I don’t see the issue just from my narrow perspective.
5: I am pretty sure that we could use neuroscience to establish that, for example, people are truthful when they say that they see the anchoring bias as a bias. But I might have been a bit glib when mentioning neuroscience; that is mainly the “science fiction superpowers” end of the spectrum for the moment.
What I’m hoping, with this technique, is that if we end up using indirect normativity or stated preferences, that my keeping in mind this model of what proto-preferences are, we can better automate the limitations of these techniques (eg when we expect lying), rather than putting them in by hand.
6: Currently I don’t see reflexes as embodying values at all. However, people’s attitudes towards their own reflexes are valid meta-preferences.
So here’s an alternative explanation on what proto-preferences and preferences are, which is to say what is the process that produces something we might meaningfully reify using the “preference” construct.
Preferences are a model for answering questions about “why do this and not that?”. There’s a lot going on in this model, though, because in order to choose what to do we have to even be able to form a this and that to choose between. If we strip away the this and that (the ontological), we are left with not what is (the ontic), but instead the liminal ontology naturally implied by sense contact and the production of phenomena and experience prior to understanding it (e.g. the way you perceive color already creates a separation between what is and what you perceive by encoding interactions with what is in less bits that it would take to express an exact simulation of it). This process is mostly beyond conscious control in humans we so we tend to think of it as automatic, outside the locus-of-control, not part of the self, and thus not part of our felt sense of preference, but it’s important because it’s the first time we “make” a “choice”, and choice is what preference is all about.
So how do these choices get made? There are many principles we might derive to explain why we perceive things one way or another, but the one that to me seems most parsimonious and maximally descriptive is minimization of uncertainty, which to really cache out at this level probably requires some additional effort to deconstruct what that means in a sensible way that doesn’t fall apart the way “minimize description length” seems to because it ignores the way sometimes minimizing uncertainty over a long term requires not minimizing uncertainty over a short term (avoiding local minima) and other caveats that make too simple an explanation incomplete. Although I mostly draw on philosophy I’m not explaining here to come to this point, see Friston’s free energy, perceptual control theory, etc. for related notions and support.
This gives us a kind of low level operation then that can power preferences, which get built up at the next level of ontological abstraction (what we might call feeling or sensation), which is the encoding of a judgement about success or failure at minimizing uncertainty and could either be positive (below some threshold of minimization), negative (over some threshold), or neutral (within error bounds and unable to rule either way). From here we can build up to more complex sorts of preferences over additional levels of abstraction, but they will all be rooted in judgements about whether or not uncertainty was minimized at the perceptual level, keeping in mind that the brain senses itself through circular networks of neurons allowing it to perceive itself and thus apply this same process to perceptions we reify as “thoughts”.
What does this suggest for this discussion? I think it offers a way to dissolve many of the confusions arising from trying to work with our normally reified notions of “preference” or even the simpler but less cleanly bounded notion of “proto-preference”.
(This was a convenient opportunity to work out some of these ideas in writing since this conversation provided a nice germ to build around. I’ll probably refine and expand on this idea elsewhere later.)