I’m gonna try to summarize and then you can tell me what I’m missing:
In weird out-of-distribution situations, my preferences / values are ill-defined
We can operationalize that by having an ensemble of models of my preferences / values, and seeing that they give different, mutually-incompatible predictions in these weird out-of-distribution situations
One thing we can do to help is set up our AI to avoid taking us into weird out-of-distribution situations where my preferences are ill-defined.
Another thing we can do to help is have meta-preferences about how to deal with situations where my preferences are ill-defined, and have the AI learn those meta-preferences.
Another thing is, we implicitly trust our own future preferences in weird out-of-distribution situations, because what else can we do? So we can build an AI that we trust for a similar reason: either (A) it’s transparent, and we train it to do human-like things for human-like reasons, or (B) it’s trained to imitate human cognition.
Is that fair? I’m not agreeing or disagreeing, just parsing.
One thing we can do to help is set up our AI to avoid taking us into weird out-of-distribution situations where my preferences are ill-defined.
Another thing we can do to help is have meta-preferences about how to deal with situations where my preferences are ill-defined, and have the AI learn those meta-preferences.
And in fact, “We don’t actually want to go to the sort of extreme value that you can coax a model of us into outputting in weird out-of-distribution situations” is itself a meta-preference, and so we might expect something that does a good job of learning about my meta-preferences to either learn about that, or to find it consistent with its starting meta-preferences.
Another thing is, we implicitly trust our own future preferences in weird out-of-distribution situations, because what else can we do? So we can build an AI that we trust for a similar reason: either (A) it’s transparent, and we train it to do human-like things for human-like reasons, or (B) it’s trained to imitate human cognition.
This is the only bit I’d disagree with.
I wouldn’t trust my own evaluations in weird out-of-distribution situations, certainly not if “weird” means “chosen specifically so that Charlie’s evaluations are really weird.” If we build an AI that we trust, I’m going to trust it to take a look at those weird OOD situations and then not go there.
If it’s supervised by humans, humans need to notice that it’s e.g. trying to change the environment in ways that break down the concept of human agency and stop it. If it’s imitating human reasoning, it needs to imitate the same sort of reasoning I’ve used just now.
This is super similar to a lot of Stuart Armstrong’s stuff. Human preferences are under-defined, there’s a “non-obvious” part of what we think of as Goodhart’s law that’s related to this under-definition, but it’s okay, we can just pick something that seems good to us—these are all Stuart Armstrong ideas more than Charlie Steiner ideas.
The biggest contrast is pointed to by the fact that I didn’t use the word “utility” all sequence long (iirc). In general, I think I’m less interested than him in trying to jump into constructing imperfect models of humans with the tools at hand, and more interested in (or at least more focused on) new technologies and insights that would enable learning the entire structure of those models. I think we also have different ideas about how to do more a priori thinking to get better at evaluating proposals for value learning, but it’s hard to articulate.
I guess I was just thinking, sometimes every option is out-of-distribution, because the future is different than the past, especially when we want AGIs to invent new technologies etc.
I agree that adversarially-chosen OOD hypotheticals are very problematic.
I think Stuart Armstrong thinks the end goal has to be a utility function because utility-maximizers are in reflective equilibrium in a way that other systems aren’t; he talks about that here.
I’m gonna try to summarize and then you can tell me what I’m missing:
In weird out-of-distribution situations, my preferences / values are ill-defined
We can operationalize that by having an ensemble of models of my preferences / values, and seeing that they give different, mutually-incompatible predictions in these weird out-of-distribution situations
One thing we can do to help is set up our AI to avoid taking us into weird out-of-distribution situations where my preferences are ill-defined.
Another thing we can do to help is have meta-preferences about how to deal with situations where my preferences are ill-defined, and have the AI learn those meta-preferences.
Another thing is, we implicitly trust our own future preferences in weird out-of-distribution situations, because what else can we do? So we can build an AI that we trust for a similar reason: either (A) it’s transparent, and we train it to do human-like things for human-like reasons, or (B) it’s trained to imitate human cognition.
Is that fair? I’m not agreeing or disagreeing, just parsing.
I’d also be interested in a compare/contrast with, say, this Stuart Armstrong post.
And in fact, “We don’t actually want to go to the sort of extreme value that you can coax a model of us into outputting in weird out-of-distribution situations” is itself a meta-preference, and so we might expect something that does a good job of learning about my meta-preferences to either learn about that, or to find it consistent with its starting meta-preferences.
This is the only bit I’d disagree with.
I wouldn’t trust my own evaluations in weird out-of-distribution situations, certainly not if “weird” means “chosen specifically so that Charlie’s evaluations are really weird.” If we build an AI that we trust, I’m going to trust it to take a look at those weird OOD situations and then not go there.
If it’s supervised by humans, humans need to notice that it’s e.g. trying to change the environment in ways that break down the concept of human agency and stop it. If it’s imitating human reasoning, it needs to imitate the same sort of reasoning I’ve used just now.
This is super similar to a lot of Stuart Armstrong’s stuff. Human preferences are under-defined, there’s a “non-obvious” part of what we think of as Goodhart’s law that’s related to this under-definition, but it’s okay, we can just pick something that seems good to us—these are all Stuart Armstrong ideas more than Charlie Steiner ideas.
The biggest contrast is pointed to by the fact that I didn’t use the word “utility” all sequence long (iirc). In general, I think I’m less interested than him in trying to jump into constructing imperfect models of humans with the tools at hand, and more interested in (or at least more focused on) new technologies and insights that would enable learning the entire structure of those models. I think we also have different ideas about how to do more a priori thinking to get better at evaluating proposals for value learning, but it’s hard to articulate.
Thanks!
I guess I was just thinking, sometimes every option is out-of-distribution, because the future is different than the past, especially when we want AGIs to invent new technologies etc.
I agree that adversarially-chosen OOD hypotheticals are very problematic.
I think Stuart Armstrong thinks the end goal has to be a utility function because utility-maximizers are in reflective equilibrium in a way that other systems aren’t; he talks about that here.