If it’s too hard to make AI systems in this way and we need to have them learn goals from humans, we could at least have them learn from idealized humans rather than real ones.
My interpretation of how the term is used here and elsewhere is that idealized humans are usually in themselves, and when we ignore costs, worse than real ones. For example, they could be based on predictions of human behavior that are not quite accurate, or they may only remain sane for an hour of continuous operation from some initial state. They are only better because they can be used in situations where real humans can’t be used, such as in an infinite HCH, an indirect normativity style definition of AI goals, or a simulation of how a human develops when exposed to a certain environment (training). Their nature as inaccurate predictions may make them much more computationally tractable and actually available in situations where real humans aren’t, and so more useful when we can compensate for the errors. So a better term might be “abstract humans” or “models of humans”.
If these artificial environments with models of humans are good enough, they may also be able to bootstrap more accurate models of humans and put them into environments that produce better decisions, so that the initial errors in prediction won’t affect the eventual outcomes.
Perhaps Wei Dai could clarify, but I thought the point of idealized humans was to avoid problems of value corruption or manipulation, which makes them better than real ones.
I agree that idealized humans have the benefit of making things like infinite HCH possible, but that doesn’t seem to be a main point of this post.
I thought the point of idealized humans was to avoid problems of value corruption or manipulation
Among other things, yes.
which makes them better than real ones
This framing loses the distinction I’m making. More useful when taken together with their environment, but not necessarily better in themselves. These are essentially real humans that behave better because of environments where they operate and lack of direct influence from the outside world, which in some settings could also apply to the environment where they were raised. But they share the same vulnerabilities (to outside influence or unusual situations) as real humans, which can affect them if they are taken outside their safe environments. And in themselves, when abstracted from their environment, they may be worse than real humans, in the sense that they make less aligned or correct decisions, if the idealized humans are inaccurate predictions of hypothetical behavior of real humans.
My interpretation of how the term is used here and elsewhere is that idealized humans are usually in themselves, and when we ignore costs, worse than real ones. For example, they could be based on predictions of human behavior that are not quite accurate, or they may only remain sane for an hour of continuous operation from some initial state. They are only better because they can be used in situations where real humans can’t be used, such as in an infinite HCH, an indirect normativity style definition of AI goals, or a simulation of how a human develops when exposed to a certain environment (training). Their nature as inaccurate predictions may make them much more computationally tractable and actually available in situations where real humans aren’t, and so more useful when we can compensate for the errors. So a better term might be “abstract humans” or “models of humans”.
If these artificial environments with models of humans are good enough, they may also be able to bootstrap more accurate models of humans and put them into environments that produce better decisions, so that the initial errors in prediction won’t affect the eventual outcomes.
Perhaps Wei Dai could clarify, but I thought the point of idealized humans was to avoid problems of value corruption or manipulation, which makes them better than real ones.
I agree that idealized humans have the benefit of making things like infinite HCH possible, but that doesn’t seem to be a main point of this post.
Among other things, yes.
This framing loses the distinction I’m making. More useful when taken together with their environment, but not necessarily better in themselves. These are essentially real humans that behave better because of environments where they operate and lack of direct influence from the outside world, which in some settings could also apply to the environment where they were raised. But they share the same vulnerabilities (to outside influence or unusual situations) as real humans, which can affect them if they are taken outside their safe environments. And in themselves, when abstracted from their environment, they may be worse than real humans, in the sense that they make less aligned or correct decisions, if the idealized humans are inaccurate predictions of hypothetical behavior of real humans.
Yeah, I agree with all of this. How would you rewrite my sentence/paragraph to be clearer, without making it too much longer?