Great post! I agree with everything up to the recursive quantilizer stuff (not that I disagree with that, I just don’t feel that I get it enough to voice an opinion). I thinks it’s a very useful post, and I’ll definitely go back to it and try to work out more details soon.
In general, it’s possible to get very rich types of feedback, but very sparse: humans get all sorts of feedback, including not only instruction on how to act, but also how to think.
I suppose there is a typo and the correct sentence goes “but also very sparse”?
In other words, the overriding feature of normativity which I’m trying to point at is that nothing is ever 100%. Correct grammar is not defined by any (known) rules or set of text, nor is it (quite) just whatever humans judge it is.
I think you’re unto something, but I wonder how much of this actually comes from the fact that language usage evolves? If the language stayed static, I think rules would work better. For an example outside English, in French we have the Académie Française, which is the official authority on usage of french. If the usage never changed, they would probably have a pretty nice set of rules (although not really that easily programmable) for French. But as things go, French, like any language, changes, and so they must adapt to it and try to reign it.
That being said, this changing nature of language is probably a part of normativity. It just felt implicit in your post.
Wireheading and human manipulation can’t be eliminated through object-level feedback, but we could point out examples of the wrong and right types of reasoning.
You don’t put any citation for that. Is this an actual result, or just what you think really strongly?
You don’t put any citation for that. Is this an actual result, or just what you think really strongly?
Yeah, sorry, I thought there might be an appropriate citation but I didn’t find one. My thinking here is: in model-based RL, the best model you can have to fit the data is one which correctly identifies the reward signal as coming from the reward button (or whatever the actual physical reward source is). Whereas the desired model (what we want the system to learn) is one which, while perhaps being less predictively accurate, models reward as coming from some kind of values. If you couple RL with process-level feedback, you could directly discourage modeling reward as coming from the actual reward system, and encourage identifying it with other things—overcoming the incentive to model it accurately.
Similarly, human manipulation comes from a “mistaken” (but predictively accurate) model which says that the human values are precisely whatever-the-human-feedback-says-they-are (IE that humans are in some sense the final judge of their values, so that any “manipulation” still reveals legitimate preferences by definition). Humans can provide feedback against this model, favoring models in which the human feedback can be corrupted by various means including manipulation.
That being said, this changing nature of language is probably a part of normativity. It just felt implicit in your post.
This is true. I wasn’t thinking about this. My initial reaction to your point was to think, no, even if we froze English usage today, we’d still have a “normativity” phenomenon, where we (1) can’t perfectly represent the rules via statistical occurrence, (2) can say more about the rules, but can’t state all of them, and can make mistatkes, (3) can say more about what good reasoning-about-the-rules would look like, … etc.
But if we apply all the meta-meta-meta reasoning, what we ultimately get is evolution of the language at least in a very straightforward sense of a changing object-level usage and changing first-meta-level opinions about proper usage (and so on), even if we think of it as merely correcting imperfections rather than really changing. And, the meta-meta-meta consensus would probably include provisions that the language should be adaptive!
Great post! I agree with everything up to the recursive quantilizer stuff (not that I disagree with that, I just don’t feel that I get it enough to voice an opinion). I thinks it’s a very useful post, and I’ll definitely go back to it and try to work out more details soon.
I suppose there is a typo and the correct sentence goes “but also very sparse”?
I think you’re unto something, but I wonder how much of this actually comes from the fact that language usage evolves? If the language stayed static, I think rules would work better. For an example outside English, in French we have the Académie Française, which is the official authority on usage of french. If the usage never changed, they would probably have a pretty nice set of rules (although not really that easily programmable) for French. But as things go, French, like any language, changes, and so they must adapt to it and try to reign it.
That being said, this changing nature of language is probably a part of normativity. It just felt implicit in your post.
You don’t put any citation for that. Is this an actual result, or just what you think really strongly?
Yeah, sorry, I thought there might be an appropriate citation but I didn’t find one. My thinking here is: in model-based RL, the best model you can have to fit the data is one which correctly identifies the reward signal as coming from the reward button (or whatever the actual physical reward source is). Whereas the desired model (what we want the system to learn) is one which, while perhaps being less predictively accurate, models reward as coming from some kind of values. If you couple RL with process-level feedback, you could directly discourage modeling reward as coming from the actual reward system, and encourage identifying it with other things—overcoming the incentive to model it accurately.
Similarly, human manipulation comes from a “mistaken” (but predictively accurate) model which says that the human values are precisely whatever-the-human-feedback-says-they-are (IE that humans are in some sense the final judge of their values, so that any “manipulation” still reveals legitimate preferences by definition). Humans can provide feedback against this model, favoring models in which the human feedback can be corrupted by various means including manipulation.
This is true. I wasn’t thinking about this. My initial reaction to your point was to think, no, even if we froze English usage today, we’d still have a “normativity” phenomenon, where we (1) can’t perfectly represent the rules via statistical occurrence, (2) can say more about the rules, but can’t state all of them, and can make mistatkes, (3) can say more about what good reasoning-about-the-rules would look like, … etc.
But if we apply all the meta-meta-meta reasoning, what we ultimately get is evolution of the language at least in a very straightforward sense of a changing object-level usage and changing first-meta-level opinions about proper usage (and so on), even if we think of it as merely correcting imperfections rather than really changing. And, the meta-meta-meta consensus would probably include provisions that the language should be adaptive!