I guess my point is that your counterexamples/problems all have this very formal no-free-lunch theorem aspect to them, and we solve problems that have no-free-lunch theorems all the time—in fact a lot of the programming languages community is tackling such problems and getting decent results in most cases.
For this reason you could say “okay, while there is a no-free-lunch theorem here, probably when the AI system carves reality at the joints, it ends up with features / cognition that doesn’t consider the −101010 utility on something like turning on a yellow light to be a reasonable utility function”. You seem to be opposed to any reasoning of this sort, and I don’t know why.
The counter-examples are of that type because the examples are often of that type—presented formally, so vulnerable to a formal solution.
If you’re saying that ”−101010 utility on something like turning on a yellow light” is not a reasonable utility function, then I agree with you, and that’s the very point of this post—we need to define what a “reasonable” utility function is, at least to some extent (“partial preferences...”), to get anywhere with these ideas.
The counter-examples are of that type because the examples are often of that type—presented formally, so vulnerable to a formal solution.
It does not seem to me the cluster of concepts in corrigibility, Clarifying AI Alignment, and my comment on it are presented formally. They feel very, very informal (to the point that I think we should try to make them more formal, though I’m not optimistic about getting them to the level of formality you typically use).
(I still need to get a handle on ascription universality, which might be making these concepts more formal, but from what I understand of it so far it’s still much less formal than you usually work with.)
we need to define what a “reasonable” utility function
My argument is that we don’t need to define this formally; we can reason about it informally and still get justified confidence that we will get good outcomes, though not justified confidence in < 1-in-a-billion chance of failure.
My judgements come mainly from trying to make corrigibility/impact measures etc… work, and having similar problems in all cases.
I guess my point is that your counterexamples/problems all have this very formal no-free-lunch theorem aspect to them, and we solve problems that have no-free-lunch theorems all the time—in fact a lot of the programming languages community is tackling such problems and getting decent results in most cases.
For this reason you could say “okay, while there is a no-free-lunch theorem here, probably when the AI system carves reality at the joints, it ends up with features / cognition that doesn’t consider the −101010 utility on something like turning on a yellow light to be a reasonable utility function”. You seem to be opposed to any reasoning of this sort, and I don’t know why.
The counter-examples are of that type because the examples are often of that type—presented formally, so vulnerable to a formal solution.
If you’re saying that ”−101010 utility on something like turning on a yellow light” is not a reasonable utility function, then I agree with you, and that’s the very point of this post—we need to define what a “reasonable” utility function is, at least to some extent (“partial preferences...”), to get anywhere with these ideas.
It does not seem to me the cluster of concepts in corrigibility, Clarifying AI Alignment, and my comment on it are presented formally. They feel very, very informal (to the point that I think we should try to make them more formal, though I’m not optimistic about getting them to the level of formality you typically use).
(I still need to get a handle on ascription universality, which might be making these concepts more formal, but from what I understand of it so far it’s still much less formal than you usually work with.)
My argument is that we don’t need to define this formally; we can reason about it informally and still get justified confidence that we will get good outcomes, though not justified confidence in < 1-in-a-billion chance of failure.