The counter-examples are of that type because the examples are often of that type—presented formally, so vulnerable to a formal solution.
If you’re saying that ”−101010 utility on something like turning on a yellow light” is not a reasonable utility function, then I agree with you, and that’s the very point of this post—we need to define what a “reasonable” utility function is, at least to some extent (“partial preferences...”), to get anywhere with these ideas.
The counter-examples are of that type because the examples are often of that type—presented formally, so vulnerable to a formal solution.
It does not seem to me the cluster of concepts in corrigibility, Clarifying AI Alignment, and my comment on it are presented formally. They feel very, very informal (to the point that I think we should try to make them more formal, though I’m not optimistic about getting them to the level of formality you typically use).
(I still need to get a handle on ascription universality, which might be making these concepts more formal, but from what I understand of it so far it’s still much less formal than you usually work with.)
we need to define what a “reasonable” utility function
My argument is that we don’t need to define this formally; we can reason about it informally and still get justified confidence that we will get good outcomes, though not justified confidence in < 1-in-a-billion chance of failure.
The counter-examples are of that type because the examples are often of that type—presented formally, so vulnerable to a formal solution.
If you’re saying that ”−101010 utility on something like turning on a yellow light” is not a reasonable utility function, then I agree with you, and that’s the very point of this post—we need to define what a “reasonable” utility function is, at least to some extent (“partial preferences...”), to get anywhere with these ideas.
It does not seem to me the cluster of concepts in corrigibility, Clarifying AI Alignment, and my comment on it are presented formally. They feel very, very informal (to the point that I think we should try to make them more formal, though I’m not optimistic about getting them to the level of formality you typically use).
(I still need to get a handle on ascription universality, which might be making these concepts more formal, but from what I understand of it so far it’s still much less formal than you usually work with.)
My argument is that we don’t need to define this formally; we can reason about it informally and still get justified confidence that we will get good outcomes, though not justified confidence in < 1-in-a-billion chance of failure.