I think there is good reason to think coming up with an actual VNM representation of human preferences would not be a very good approximation. On the other hand as long as you don’t program an AI in that way—with an explicit utility function—then I think it is unlikely to be dangerous even if it does not have exactly human values. This is why I said the most important thing is to make sure that the AI does not have a utility function. I’m trying to do a discussion post on that now but something’s gone wrong (with the posting).
I thought you could map an unbounded function to a bounded one to produce the same behavior, but actually you may be right that this is not really possible since you have to multiply your utilities by probabilities. I would have to think about that more.
It’s awfully suspicious to say that the one goal architecture that is coherent enough to analyse easily is dangerous but that all others are safe. More concretely, humans are not VNM-rational (as you pointed out), and often pose threats to other agents anyway. Also, an AI does not have to be programmed with an explicit utility function in order to be VNM rational, and thus to behave like it has a utility function.
I thought you could map an unbounded function to a bounded one to produce the same behavior, but actually you may be right that this is not really possible since you have to multiply your utilities by probabilities.
You can rescale an unbounded utility function to a bounded one that will have the same preferences over known outcomes, but this will change its preferences over gambles; in particular, agents with bounded utility functions cannot be made to care about arbitrarily small probabilities of arbitrarily good/bad outcomes.
Yes, you’re right about the effect of rescaling an unbounded function.
I don’t see why it’s suspicious that less coherent goal systems are safer. Being less coherent is being closer to having no goals at all, and without goals a thing is not particularly dangerous. For example, take a rock. We could theoretically say that the path a rock takes when it falls is determined by a goal system, but it would not be particularly helpful to describe it as using a utility function, and likewise it is not especially dangerous. It is true that you can get killed if it hits you on the head or something, but it is not going to take over the world.
I described in my top-level post what kind of behavior I would expect of an intelligent goal system that was not programmed using an explicit utility function. You might be able to theoretically describe its behavior with a utility function, but this is not the most helpful description. So for example, if we program a chess playing AI, as long as it is programmed to choose chess moves in a deterministic fashion, optimizing based solely on the present chess game (e.g. not choosing its moves based on what it has learned about the current player or whatever, but only based on the current position), then no matter how intelligent it becomes it will never try to take over the universe. In fact, it will never try to do anything except play chess moves, since it is physically impossible for it to do anything else, just as a rock will never do anything except fall.
Notice that this also is closer to having no goals, since the chess playing AI can’t try to affect the universe in any particular way. (That is why I said based on the game alone—if it can base its moves on the person playing or whatever, then in theory it could secretly have various goals such as e.g. driving someone insane on account of losing chess games etc., even if no one programmed these goals explicitly.) But as long as its moves are generated in a deterministic manner based on the current position alone, it cannot have any huge destructive goal, just like a rock does not.
I think there is good reason to think coming up with an actual VNM representation of human preferences would not be a very good approximation. On the other hand as long as you don’t program an AI in that way—with an explicit utility function—then I think it is unlikely to be dangerous even if it does not have exactly human values. This is why I said the most important thing is to make sure that the AI does not have a utility function. I’m trying to do a discussion post on that now but something’s gone wrong (with the posting).
I thought you could map an unbounded function to a bounded one to produce the same behavior, but actually you may be right that this is not really possible since you have to multiply your utilities by probabilities. I would have to think about that more.
It’s awfully suspicious to say that the one goal architecture that is coherent enough to analyse easily is dangerous but that all others are safe. More concretely, humans are not VNM-rational (as you pointed out), and often pose threats to other agents anyway. Also, an AI does not have to be programmed with an explicit utility function in order to be VNM rational, and thus to behave like it has a utility function.
You can rescale an unbounded utility function to a bounded one that will have the same preferences over known outcomes, but this will change its preferences over gambles; in particular, agents with bounded utility functions cannot be made to care about arbitrarily small probabilities of arbitrarily good/bad outcomes.
Yes, you’re right about the effect of rescaling an unbounded function.
I don’t see why it’s suspicious that less coherent goal systems are safer. Being less coherent is being closer to having no goals at all, and without goals a thing is not particularly dangerous. For example, take a rock. We could theoretically say that the path a rock takes when it falls is determined by a goal system, but it would not be particularly helpful to describe it as using a utility function, and likewise it is not especially dangerous. It is true that you can get killed if it hits you on the head or something, but it is not going to take over the world.
I described in my top-level post what kind of behavior I would expect of an intelligent goal system that was not programmed using an explicit utility function. You might be able to theoretically describe its behavior with a utility function, but this is not the most helpful description. So for example, if we program a chess playing AI, as long as it is programmed to choose chess moves in a deterministic fashion, optimizing based solely on the present chess game (e.g. not choosing its moves based on what it has learned about the current player or whatever, but only based on the current position), then no matter how intelligent it becomes it will never try to take over the universe. In fact, it will never try to do anything except play chess moves, since it is physically impossible for it to do anything else, just as a rock will never do anything except fall.
Notice that this also is closer to having no goals, since the chess playing AI can’t try to affect the universe in any particular way. (That is why I said based on the game alone—if it can base its moves on the person playing or whatever, then in theory it could secretly have various goals such as e.g. driving someone insane on account of losing chess games etc., even if no one programmed these goals explicitly.) But as long as its moves are generated in a deterministic manner based on the current position alone, it cannot have any huge destructive goal, just like a rock does not.