Donald Hobson comments on Clarifying “What failure looks like”

Donald Hobson 24 Sep 2020 9:42 UTC
4 points
I think that this depends on how hard the AI’s are optimising, and how complicated the objectives are. I think that sufficiently moderate optimization for goals sufficiently close to human values will probably end up well.
I also think that optimisation is likely to end up at the physical limits, unless we know how to program an AI that doesn’t want to improve itself, and everyone makes AI’s like that.
Sufficiently moderate AI is just dumb, which is safe. An AI smart enough to stop people producing more AI, yet dumb enough to be safe seems harder.
There is also a question of what “better capturing what humans want” means. A utility function, that when restricted to the space of worlds roughly similar to this one, produces utilities close to the true human utility function, seems easy enough. Suppose we have defined something close to human well being. That definition is in terms of the level of various neurotransmitters near human DNA. Lets suppose this definition would be highly accurate over all history, and would make the right decision over nearly all current political issues. It could still fail completely in a future containing uploaded minds, and neurochemical vats.
Either your approximate utility function needs to be pretty close on all possible futures (even adversarially chosen ones) or you need to know that the AI won’t guide the future towards places that the utility functions differ.