In this approach, does the uncertainty/fuzziness ever get resolved (if so how?), or is the AI stuck with a “fuzzy” utility function forever? If the latter, why should we not expect that to incur an astronomically high opportunity cost (due to the AI wasting resources optimizing for values that we might have but actually don’t) from the perspective of our real values?
The fuzziness will never get fully resolved. This approach is to deal with Goodhart-style problems without optimising leading to disaster; I’m working on other approaches that could allow the synthesis of the actual values.
The fuzziness will never get fully resolved. This approach is to deal with Goodhart-style problems without optimising leading to disaster;
I’m saying this isn’t clear, because optimizing for a fuzzy utility function instead of the true utility function could lead to astronomical waste or be a form of x-risk, unless you also had a solution to corrigibility such that you could shut down the AI before it used up much of the resources of the universe trying to optimize for the fuzzy utility function. But then the corrigibility solution seems to be doing most of the work of making the AI safe. For example without a corrigibility solution it seems like the AI would not try to help you resolve your own uncertainty/fuzziness about values and would actually impede your own efforts to do so (because then your values would diverge from its values and you’d want to shut it down later or change its utility function).
I’m working on other approaches that could allow the synthesis of the actual values.
Ok, so I’m trying to figure out how these approaches fit together. Are they meant to both go into the same AI (if so how?), or is it more like, “I’m not sure which of these approaches will work out so let’s research them simultaneously and then implement whichever one seems most promising later”?
“I’m not sure which of these approaches will work out so let’s research them simultaneously and then implement whichever one seems most promising later”
That, plus “this approach has progressed as far as it can, there remains uncertainty/fuzziness, so we can now choose to accept the known loss to avoid the likely failure of maximising our current candidate without fuzziness”. This is especially the case if, like me, you feel that human values have diminishing marginal utility to resources. Even without that, the fuzziness can be an acceptable cost, if we assign a high probability to loss to Goodhart-like effects if we maximise the wrong thing without fuzziness.
There’s one other aspect I should emphasise: AIs drawing boundaries we have no clue about (as they do now between pictures of cats and dogs). When an AI draws boundaries between acceptable and unacceptable worlds, we can’t describe this as reducing human uncertainty: the AI is constructing its own concepts, finding patterns in human examples. Trying to make those boundaries work well is, to my eyes, not well described in any Bayesian framework.
It’s very possible that we might get to a point were we could say “we expect that this AI will synthesise a good measure of human preferences. The measure itself has some light fuzziness/uncertainty, but our knowledge of it has a lot of uncertainty”.
So I’m not sure that uncertainty or even fuzziness are necessarily the best ways of describing this.
The fuzziness will never get fully resolved. This approach is to deal with Goodhart-style problems without optimising leading to disaster; I’m working on other approaches that could allow the synthesis of the actual values.
I’m saying this isn’t clear, because optimizing for a fuzzy utility function instead of the true utility function could lead to astronomical waste or be a form of x-risk, unless you also had a solution to corrigibility such that you could shut down the AI before it used up much of the resources of the universe trying to optimize for the fuzzy utility function. But then the corrigibility solution seems to be doing most of the work of making the AI safe. For example without a corrigibility solution it seems like the AI would not try to help you resolve your own uncertainty/fuzziness about values and would actually impede your own efforts to do so (because then your values would diverge from its values and you’d want to shut it down later or change its utility function).
Ok, so I’m trying to figure out how these approaches fit together. Are they meant to both go into the same AI (if so how?), or is it more like, “I’m not sure which of these approaches will work out so let’s research them simultaneously and then implement whichever one seems most promising later”?
That, plus “this approach has progressed as far as it can, there remains uncertainty/fuzziness, so we can now choose to accept the known loss to avoid the likely failure of maximising our current candidate without fuzziness”. This is especially the case if, like me, you feel that human values have diminishing marginal utility to resources. Even without that, the fuzziness can be an acceptable cost, if we assign a high probability to loss to Goodhart-like effects if we maximise the wrong thing without fuzziness.
There’s one other aspect I should emphasise: AIs drawing boundaries we have no clue about (as they do now between pictures of cats and dogs). When an AI draws boundaries between acceptable and unacceptable worlds, we can’t describe this as reducing human uncertainty: the AI is constructing its own concepts, finding patterns in human examples. Trying to make those boundaries work well is, to my eyes, not well described in any Bayesian framework.
It’s very possible that we might get to a point were we could say “we expect that this AI will synthesise a good measure of human preferences. The measure itself has some light fuzziness/uncertainty, but our knowledge of it has a lot of uncertainty”.
So I’m not sure that uncertainty or even fuzziness are necessarily the best ways of describing this.