The counter-examples are of that type because the examples are often of that type—presented formally, so vulnerable to a formal solution.
It does not seem to me the cluster of concepts in corrigibility, Clarifying AI Alignment, and my comment on it are presented formally. They feel very, very informal (to the point that I think we should try to make them more formal, though I’m not optimistic about getting them to the level of formality you typically use).
(I still need to get a handle on ascription universality, which might be making these concepts more formal, but from what I understand of it so far it’s still much less formal than you usually work with.)
we need to define what a “reasonable” utility function
My argument is that we don’t need to define this formally; we can reason about it informally and still get justified confidence that we will get good outcomes, though not justified confidence in < 1-in-a-billion chance of failure.
It does not seem to me the cluster of concepts in corrigibility, Clarifying AI Alignment, and my comment on it are presented formally. They feel very, very informal (to the point that I think we should try to make them more formal, though I’m not optimistic about getting them to the level of formality you typically use).
(I still need to get a handle on ascription universality, which might be making these concepts more formal, but from what I understand of it so far it’s still much less formal than you usually work with.)
My argument is that we don’t need to define this formally; we can reason about it informally and still get justified confidence that we will get good outcomes, though not justified confidence in < 1-in-a-billion chance of failure.