Now, people working in these areas don’t often disagree with this formal argument; they just think it isn’t that important. They feel that getting the right formalism is most of the work, and finding the right U is easier, or just a separate bolt-on that can be added later.
My intuition, formed mainly by my many failure in this area, is that defining the U is absolutely critical, and is much harder than the rest of the problem. Others have different intuitions, and I hope they’re right.
I’m curious if you’re aiming for justified 99.9999999% confidence in having a friendly AI on the first try (i.e. justified belief that there’s no more than a 1 in a billion chance of not-a-friendly-AI-on-the-first-try). I would agree that defining U is necessary to hit that sort of confidence, and that it’s much harder than the rest of the problem.
ETA: The reason I ask is that this post seems very similar to the problem I have with impact measures (briefly: either you fail to prevent catastrophes, or you never do anything useful), but I wouldn’t apply that argument to corrigibility. I think the difference might be that I’m thinking of “natural” things that agents might want, whereas you’re considering the entire space of possible utility functions. I’m trying to figure out why we have this difference.
I guess my point is that your counterexamples/problems all have this very formal no-free-lunch theorem aspect to them, and we solve problems that have no-free-lunch theorems all the time—in fact a lot of the programming languages community is tackling such problems and getting decent results in most cases.
For this reason you could say “okay, while there is a no-free-lunch theorem here, probably when the AI system carves reality at the joints, it ends up with features / cognition that doesn’t consider the −101010 utility on something like turning on a yellow light to be a reasonable utility function”. You seem to be opposed to any reasoning of this sort, and I don’t know why.
The counter-examples are of that type because the examples are often of that type—presented formally, so vulnerable to a formal solution.
If you’re saying that ”−101010 utility on something like turning on a yellow light” is not a reasonable utility function, then I agree with you, and that’s the very point of this post—we need to define what a “reasonable” utility function is, at least to some extent (“partial preferences...”), to get anywhere with these ideas.
The counter-examples are of that type because the examples are often of that type—presented formally, so vulnerable to a formal solution.
It does not seem to me the cluster of concepts in corrigibility, Clarifying AI Alignment, and my comment on it are presented formally. They feel very, very informal (to the point that I think we should try to make them more formal, though I’m not optimistic about getting them to the level of formality you typically use).
(I still need to get a handle on ascription universality, which might be making these concepts more formal, but from what I understand of it so far it’s still much less formal than you usually work with.)
we need to define what a “reasonable” utility function
My argument is that we don’t need to define this formally; we can reason about it informally and still get justified confidence that we will get good outcomes, though not justified confidence in < 1-in-a-billion chance of failure.
I’m curious if you’re aiming for justified 99.9999999% confidence in having a friendly AI on the first try (i.e. justified belief that there’s no more than a 1 in a billion chance of not-a-friendly-AI-on-the-first-try). I would agree that defining U is necessary to hit that sort of confidence, and that it’s much harder than the rest of the problem.
ETA: The reason I ask is that this post seems very similar to the problem I have with impact measures (briefly: either you fail to prevent catastrophes, or you never do anything useful), but I wouldn’t apply that argument to corrigibility. I think the difference might be that I’m thinking of “natural” things that agents might want, whereas you’re considering the entire space of possible utility functions. I’m trying to figure out why we have this difference.
My judgements come mainly from trying to make corrigibility/impact measures etc… work, and having similar problems in all cases.
I guess my point is that your counterexamples/problems all have this very formal no-free-lunch theorem aspect to them, and we solve problems that have no-free-lunch theorems all the time—in fact a lot of the programming languages community is tackling such problems and getting decent results in most cases.
For this reason you could say “okay, while there is a no-free-lunch theorem here, probably when the AI system carves reality at the joints, it ends up with features / cognition that doesn’t consider the −101010 utility on something like turning on a yellow light to be a reasonable utility function”. You seem to be opposed to any reasoning of this sort, and I don’t know why.
The counter-examples are of that type because the examples are often of that type—presented formally, so vulnerable to a formal solution.
If you’re saying that ”−101010 utility on something like turning on a yellow light” is not a reasonable utility function, then I agree with you, and that’s the very point of this post—we need to define what a “reasonable” utility function is, at least to some extent (“partial preferences...”), to get anywhere with these ideas.
It does not seem to me the cluster of concepts in corrigibility, Clarifying AI Alignment, and my comment on it are presented formally. They feel very, very informal (to the point that I think we should try to make them more formal, though I’m not optimistic about getting them to the level of formality you typically use).
(I still need to get a handle on ascription universality, which might be making these concepts more formal, but from what I understand of it so far it’s still much less formal than you usually work with.)
My argument is that we don’t need to define this formally; we can reason about it informally and still get justified confidence that we will get good outcomes, though not justified confidence in < 1-in-a-billion chance of failure.