Thanks for your detailed response. Before I dive in, I’ll just mention I added a bullet point about Goodhart because somehow when I wrote this up initially I forgot to include it.
× But our real problem is on the meta-level: we want to understand value learning so that we can build an AI that learns human values even without starting with a precise model waiting to be filled in.
_ × We can trust AI to discover that structure for us even though we couldn’t verify the result, because out human values is because we need to give the AI a precise instruction based on a very vague human concept. The structure is vague for the same reasons as the content.
I don’t exactly disagree with you, other than to say that I think if we don’t understand enough about human values (for some yet undetermined amount that is “enough”) we’d fail to build something that we could trust, but I also don’t expect we have to solve the whole problem. Thus I think we need to know enough about the structure to get there, but I don’t know how much enough is, so for now I work on the assumption that we have to know it all, but maybe we’ll get lucky and can get there with less. But if we don’t at least know something of the structure, such as at the fairly abstract level I consider here, I don’t think we can precisely specify what we mean by “alignment” to not fail to build aligned AI.
So it’s perhaps best to understand my position as a conservative one that is trying to solve problems that I think might be issues but are not guaranteed to be issues because I don’t want to find ourselves in a world where we wished we had solved a problem, didn’t, and then suffer negative consequences for it.
_ × Merely causing events (in the physical level of description) is not sufficient to say we’re acting (in the agent level of description). We need some notion of “could have done something else,” which is an abstraction about agents, not something fundamentally physical.
_ × Similar quibbles apply to the other parts—there is no physically special decision process, we can only find one by changing our level of description of the world to one where we posit such a structure.
_ × The point: Everything in the basic model is a statistical regularity we can observe over the behavior of a physical system. You need a bit more nuanced way to place preferences and meta-preferences.
I don’t think I have any specific response other than to say that you’re right, this is a first pass and there’s a lot of hand waving going on still. One difficulty is that we want to build models of the world that will usefully help us work with it while the world also doesn’t itself contain the modeled things as themselves, it just contains a soup of stuff interacting with other stuff. What’s exciting to me is to get more specific on where my new model breaks down because I expect that to lead the way to become yet less confused.
_ _ × But I think if one applies this patch, then it’s a big mistake to use loaded words like “values” to describe the inputs (all inputs?) to the decision-generation process, which are, after all, at a level of description below the level where we can talk about preferences. I think this conflicts with the extensive definitions from earlier.
So this is a common difficulty in this kind of work. There is a category we sort of see in the world, we give it a name, and then we look to understand how that category shows up at different levels of abstraction in our models because it’s typically expressed both at a very high level of abstraction and made up of gears moving at lower levels of abstraction. I’m sympathetic to this argument that talking about “values” or any other word in common use is a mistake because it invites confusion, but when I’ve done the opposite and used technical terminology it’s equally confusing but in a different direction, so I no longer think word choice is really the issue here. People are going to be confused because I’m confused, and we’re on this ride of being confused together as we try to unknot our tangled models.
× If we recognize that we’re talking about different levels of description, then preferences are not either causally after or causally before decisions-on-the-basic-model-level-of-abstraction. They’re regular patterns that we can use to model decisions at a slightly higher level of abstraction.
This is probably correct and so in my effort to make clear what I see as the problem with preference models maybe I claim too much. There’s a lot to be confused about here.
_ × But I still don’t agree that this makes valence human values. I mean values in the sense of “the cluster we sometimes also point at with words like value, preference, affinity, taste, aesthetic, intention, and axiology.” So I don’t think we’re left with a neuroscience problem, I still think what we want the AI to learn is on that higher level of abstraction where preferences live.
I don’t know how to make the best case for valence. To me it seems good model because it fits with a lot of other models I have of the world, like that the interesting thing about consciousness is feedback and so lots of things are conscious (in the sense of having the fundamental feature that separates things with subjective experience from those without).
Also to be clear I think we are not left with only a neuroscience problem but also a neuroscience problem. What happens at higher levels of abstraction is meaningful, but I also think it’s insufficient on its own and requires us to additionally address questions of how neurons behave to generate what we recognize at a human level as “value”.
Thanks for your detailed response. Before I dive in, I’ll just mention I added a bullet point about Goodhart because somehow when I wrote this up initially I forgot to include it.
I don’t exactly disagree with you, other than to say that I think if we don’t understand enough about human values (for some yet undetermined amount that is “enough”) we’d fail to build something that we could trust, but I also don’t expect we have to solve the whole problem. Thus I think we need to know enough about the structure to get there, but I don’t know how much enough is, so for now I work on the assumption that we have to know it all, but maybe we’ll get lucky and can get there with less. But if we don’t at least know something of the structure, such as at the fairly abstract level I consider here, I don’t think we can precisely specify what we mean by “alignment” to not fail to build aligned AI.
So it’s perhaps best to understand my position as a conservative one that is trying to solve problems that I think might be issues but are not guaranteed to be issues because I don’t want to find ourselves in a world where we wished we had solved a problem, didn’t, and then suffer negative consequences for it.
I don’t think I have any specific response other than to say that you’re right, this is a first pass and there’s a lot of hand waving going on still. One difficulty is that we want to build models of the world that will usefully help us work with it while the world also doesn’t itself contain the modeled things as themselves, it just contains a soup of stuff interacting with other stuff. What’s exciting to me is to get more specific on where my new model breaks down because I expect that to lead the way to become yet less confused.
So this is a common difficulty in this kind of work. There is a category we sort of see in the world, we give it a name, and then we look to understand how that category shows up at different levels of abstraction in our models because it’s typically expressed both at a very high level of abstraction and made up of gears moving at lower levels of abstraction. I’m sympathetic to this argument that talking about “values” or any other word in common use is a mistake because it invites confusion, but when I’ve done the opposite and used technical terminology it’s equally confusing but in a different direction, so I no longer think word choice is really the issue here. People are going to be confused because I’m confused, and we’re on this ride of being confused together as we try to unknot our tangled models.
This is probably correct and so in my effort to make clear what I see as the problem with preference models maybe I claim too much. There’s a lot to be confused about here.
I don’t know how to make the best case for valence. To me it seems good model because it fits with a lot of other models I have of the world, like that the interesting thing about consciousness is feedback and so lots of things are conscious (in the sense of having the fundamental feature that separates things with subjective experience from those without).
Also to be clear I think we are not left with only a neuroscience problem but also a neuroscience problem. What happens at higher levels of abstraction is meaningful, but I also think it’s insufficient on its own and requires us to additionally address questions of how neurons behave to generate what we recognize at a human level as “value”.