I actually think this is pretty wrong (posts forthcoming, but see here for the starting point). You make a separation between the modeled human values and the real human values, but “real human values” are a theoretical abstraction, not a basic part of the world. In other words, real human values were always a subset of modeled human values.
In the example of designing a transit system, there is an unusually straightforward division between things that actually make the transit system good (by concise human-free metrics like reliability or travel time), and things that make human evaluators wrongly think it’s good. But there’s not such a concise human-free way to write down general human values.
The pitfall of optimization here happens when the AI is searching for an output that has a specific effect on humans. If you can’t remove the fact that there is a model of humans involved, then the AI has to be evaluating its output in some other way than modeling the human’s reaction to it.
As far as I understand the post, a system that wouldn’t contain human values but would still be sufficient to drastically reduce existential risk from AI would not need to execute an action that has a specific effect on humans. If I’m getting the context right, it refers to something like task-directed AGI that would allow the owner to execute a pivotal act – in other words, this is not yet the singleton we want to (maybe) finally build that CEVs us out into the universe, but something that enables us to think long & careful enough to actually build CEV safely (e.g. by giving us molecular nanotechnology or uploading that perhaps doesn’t depend on human values, modeled or otherwise).
I actually think this is pretty wrong (posts forthcoming, but see here for the starting point). You make a separation between the modeled human values and the real human values, but “real human values” are a theoretical abstraction, not a basic part of the world. In other words, real human values were always a subset of modeled human values.
In the example of designing a transit system, there is an unusually straightforward division between things that actually make the transit system good (by concise human-free metrics like reliability or travel time), and things that make human evaluators wrongly think it’s good. But there’s not such a concise human-free way to write down general human values.
The pitfall of optimization here happens when the AI is searching for an output that has a specific effect on humans. If you can’t remove the fact that there is a model of humans involved, then the AI has to be evaluating its output in some other way than modeling the human’s reaction to it.
As far as I understand the post, a system that wouldn’t contain human values but would still be sufficient to drastically reduce existential risk from AI would not need to execute an action that has a specific effect on humans. If I’m getting the context right, it refers to something like task-directed AGI that would allow the owner to execute a pivotal act – in other words, this is not yet the singleton we want to (maybe) finally build that CEVs us out into the universe, but something that enables us to think long & careful enough to actually build CEV safely (e.g. by giving us molecular nanotechnology or uploading that perhaps doesn’t depend on human values, modeled or otherwise).
Or have I misunderstood your comment?