This seems interesting but I don’t really understand what you’re proposing.
The last section is more aspirational and underdevelopped; the main point is noticing that Goodhart can be defeated in certain circumstances, and speculating how that could be extended. I’ll get back to this at a later date (or others can work on it!)
Also, Jessica Taylor’s A first look at the hard problem of corrigibility went over a few different ways that an AI could formalize the fact that humans are uncertain about their utility functions, and concluded that none of them would solve the problem of corrigibility.
This is not a design for corrigible agents (if anything, it’s more a design for low impact agents). The aim of this approach is not to have an AI that puts together the best U, but one that doesn’t go maximising a narrow V, and has wide enough uncertainty to include a decent U among the possible utility functions, and that doesn’t behave too badly.
This is not a design for corrigible agents (if anything, it’s more a design for low impact agents). The aim of this approach is not to have an AI that puts together the best
U, but one that doesn’t go maximising a narrow V, and has wide enough uncertainty to include a decent U among the possible utility functions, and that doesn’t behave too badly.
Ok, understood, but I think this approach might run into similar problems as the attempts to formalize value uncertainty in Jessica’s post. Have you read it to see if one of those ways to formalize value uncertainty would work for your purposes, and if not, what would you do instead?
I did read it. The main difference is that I don’t assume that humans know their utility function, or that “observing it over time” will converge on a single point. The AI is expected to draw boundaries between concepts; boundaries that humans don’t know and can’t know (just as image recognition neural nets do).
What I term uncertainty might better be phrased as “known (or learnt) fuzziness of a concept or statement”. It differs from uncertainty in the Jessica sense in that knowing absolutely everything about the universe, about logic, and about human brains, doesn’t resolve it.
What I term uncertainty might better be phrased as “known (or learnt) fuzziness of a concept or statement”. It differs from uncertainty in the Jessica sense in that knowing absolutely everything about the universe, about logic, and about human brains, doesn’t resolve it.
In this approach, does the uncertainty/fuzziness ever get resolved (if so how?), or is the AI stuck with a “fuzzy” utility function forever? If the latter, why should we not expect that to incur an astronomically high opportunity cost (due to the AI wasting resources optimizing for values that we might have but actually don’t) from the perspective of our real values?
Or is this meant to be a temporary solution, i.e., at some point we shut this AI down and create a new one that is able to resolve the uncertainty/fuzziness?
In this approach, does the uncertainty/fuzziness ever get resolved (if so how?), or is the AI stuck with a “fuzzy” utility function forever? If the latter, why should we not expect that to incur an astronomically high opportunity cost (due to the AI wasting resources optimizing for values that we might have but actually don’t) from the perspective of our real values?
The fuzziness will never get fully resolved. This approach is to deal with Goodhart-style problems without optimising leading to disaster; I’m working on other approaches that could allow the synthesis of the actual values.
The fuzziness will never get fully resolved. This approach is to deal with Goodhart-style problems without optimising leading to disaster;
I’m saying this isn’t clear, because optimizing for a fuzzy utility function instead of the true utility function could lead to astronomical waste or be a form of x-risk, unless you also had a solution to corrigibility such that you could shut down the AI before it used up much of the resources of the universe trying to optimize for the fuzzy utility function. But then the corrigibility solution seems to be doing most of the work of making the AI safe. For example without a corrigibility solution it seems like the AI would not try to help you resolve your own uncertainty/fuzziness about values and would actually impede your own efforts to do so (because then your values would diverge from its values and you’d want to shut it down later or change its utility function).
I’m working on other approaches that could allow the synthesis of the actual values.
Ok, so I’m trying to figure out how these approaches fit together. Are they meant to both go into the same AI (if so how?), or is it more like, “I’m not sure which of these approaches will work out so let’s research them simultaneously and then implement whichever one seems most promising later”?
“I’m not sure which of these approaches will work out so let’s research them simultaneously and then implement whichever one seems most promising later”
That, plus “this approach has progressed as far as it can, there remains uncertainty/fuzziness, so we can now choose to accept the known loss to avoid the likely failure of maximising our current candidate without fuzziness”. This is especially the case if, like me, you feel that human values have diminishing marginal utility to resources. Even without that, the fuzziness can be an acceptable cost, if we assign a high probability to loss to Goodhart-like effects if we maximise the wrong thing without fuzziness.
There’s one other aspect I should emphasise: AIs drawing boundaries we have no clue about (as they do now between pictures of cats and dogs). When an AI draws boundaries between acceptable and unacceptable worlds, we can’t describe this as reducing human uncertainty: the AI is constructing its own concepts, finding patterns in human examples. Trying to make those boundaries work well is, to my eyes, not well described in any Bayesian framework.
It’s very possible that we might get to a point were we could say “we expect that this AI will synthesise a good measure of human preferences. The measure itself has some light fuzziness/uncertainty, but our knowledge of it has a lot of uncertainty”.
So I’m not sure that uncertainty or even fuzziness are necessarily the best ways of describing this.
The last section is more aspirational and underdevelopped; the main point is noticing that Goodhart can be defeated in certain circumstances, and speculating how that could be extended. I’ll get back to this at a later date (or others can work on it!)
This is not a design for corrigible agents (if anything, it’s more a design for low impact agents). The aim of this approach is not to have an AI that puts together the best U, but one that doesn’t go maximising a narrow V, and has wide enough uncertainty to include a decent U among the possible utility functions, and that doesn’t behave too badly.
Ok, understood, but I think this approach might run into similar problems as the attempts to formalize value uncertainty in Jessica’s post. Have you read it to see if one of those ways to formalize value uncertainty would work for your purposes, and if not, what would you do instead?
More on this issue here: https://www.lesswrong.com/posts/QJwnPRBBvgaeFeiLR/uncertainty-versus-fuzziness-versus-extrapolation-desiderata
I did read it. The main difference is that I don’t assume that humans know their utility function, or that “observing it over time” will converge on a single point. The AI is expected to draw boundaries between concepts; boundaries that humans don’t know and can’t know (just as image recognition neural nets do).
What I term uncertainty might better be phrased as “known (or learnt) fuzziness of a concept or statement”. It differs from uncertainty in the Jessica sense in that knowing absolutely everything about the universe, about logic, and about human brains, doesn’t resolve it.
In this approach, does the uncertainty/fuzziness ever get resolved (if so how?), or is the AI stuck with a “fuzzy” utility function forever? If the latter, why should we not expect that to incur an astronomically high opportunity cost (due to the AI wasting resources optimizing for values that we might have but actually don’t) from the perspective of our real values?
Or is this meant to be a temporary solution, i.e., at some point we shut this AI down and create a new one that is able to resolve the uncertainty/fuzziness?
The fuzziness will never get fully resolved. This approach is to deal with Goodhart-style problems without optimising leading to disaster; I’m working on other approaches that could allow the synthesis of the actual values.
I’m saying this isn’t clear, because optimizing for a fuzzy utility function instead of the true utility function could lead to astronomical waste or be a form of x-risk, unless you also had a solution to corrigibility such that you could shut down the AI before it used up much of the resources of the universe trying to optimize for the fuzzy utility function. But then the corrigibility solution seems to be doing most of the work of making the AI safe. For example without a corrigibility solution it seems like the AI would not try to help you resolve your own uncertainty/fuzziness about values and would actually impede your own efforts to do so (because then your values would diverge from its values and you’d want to shut it down later or change its utility function).
Ok, so I’m trying to figure out how these approaches fit together. Are they meant to both go into the same AI (if so how?), or is it more like, “I’m not sure which of these approaches will work out so let’s research them simultaneously and then implement whichever one seems most promising later”?
That, plus “this approach has progressed as far as it can, there remains uncertainty/fuzziness, so we can now choose to accept the known loss to avoid the likely failure of maximising our current candidate without fuzziness”. This is especially the case if, like me, you feel that human values have diminishing marginal utility to resources. Even without that, the fuzziness can be an acceptable cost, if we assign a high probability to loss to Goodhart-like effects if we maximise the wrong thing without fuzziness.
There’s one other aspect I should emphasise: AIs drawing boundaries we have no clue about (as they do now between pictures of cats and dogs). When an AI draws boundaries between acceptable and unacceptable worlds, we can’t describe this as reducing human uncertainty: the AI is constructing its own concepts, finding patterns in human examples. Trying to make those boundaries work well is, to my eyes, not well described in any Bayesian framework.
It’s very possible that we might get to a point were we could say “we expect that this AI will synthesise a good measure of human preferences. The measure itself has some light fuzziness/uncertainty, but our knowledge of it has a lot of uncertainty”.
So I’m not sure that uncertainty or even fuzziness are necessarily the best ways of describing this.