I am confused what you think I was trying to do with that intuition pump.
I think I’m confused about the intuition pump too! Like, here’s some options I thought up:
The ‘alignment problem’ is really the ‘not enough oversight’ problem. [But then if we solve the ‘enough oversight’ problem, we still have to solve the ‘what we want’ problem, the ‘coordination’ problem, the ‘construct competitively’ problem, etc.]
Bits of the alignment problem can be traded off against each other, most obviously coordination and ‘alignment tax’ (i.e. the additional amount of work you need to do to make a system aligned, or the opposite of ‘competitiveness’, which I didn’t want to use here for ease-of-understanding-by-newbies reasons.) [But it’s basically just coordination and competitiveness; like, you could imagine that oversight gives you a rejection sampling story for trading off time and understanding but I think this is basically not true because you’re also optimizing for finding holes in your transparency regime.]
Like, by analogy, I could imagine someone who uses an intuition pump of “if you had sufficient money, you could solve any problem”, but I wouldn’t use that intuition pump because I don’t believe it. [Sure, ‘by definition’ if the amount of money doesn’t solve the problem, it’s not sufficient. But why are we implicitly positing that there exists a sufficient amount of money instead of thinking about what money cannot buy?]
(After reading the rest of your comment, it seems pretty clear to me that you mean the first bullet, as you say here:)
in which case I’d change it to “powerful oversight would deal with the particular technical problems that we call outer and inner alignment”, but was it really so non-obvious that I was talking about the technical problems
I both 1) didn’t think it was obvious (sorry if I’m being slow on following the change in usage of ‘alignment’ here) and 2) don’t think realistically powerful oversight solves either of those two on its own (outer alignment because of “rejection sampling can get you siren worlds” problem, inner alignment because “rejection sampling isn’t competitive”, but I find that one not very compelling and suspect I’ll eventually develop a better objection).
[EDIT: I note that I also might be doing another unfavorable assumption here, where I’m assuming “unlimited oversight capacity” is something like “perfect transparency”, and so we might not choose to spend all of our oversight capacity, but you might be including things here like “actually it takes no time to understand what the model is doing” or “the oversight capacity is of humans too,” which I think weakens the outer alignment objection pretty substantially.]
If so, I totally agree that it does not in fact include all the things needed for a good future, and it was not meant to be saying that.
Cool! I’m glad we agree on that, and will try to do more “did you mean limited statement X that we more agree about?” in the future.
This just doesn’t seem plausible to me. Where did the information come from? Did the AI system optimize the information to be convincing? If yes, why didn’t we notice that the AI system was doing that? Can we solve this by ensuring that we do due diligence, even if it doesn’t seem necessary?
It came from where we decided to look. While I think it’s possible you can have an AI out to deceive us, by putting information we want to see where we’re going to look and information we don’t want to see where we’re not going to look, I think this is going to happen by default because the human operators will have a smaller checklist than they should have: “Will the AI cure cancer? Yes? Cool, press the button.” instead of “Will the AI cure cancer? Yes? Cool. Will it preserve our ability to generate more AIs in the future to solve additional problems? No? Hmm, let’s take a look at that.”
Like, this is the sort of normal software development story where bugs that cause the system to visibly not work get noticed and fixed, and bugs that cause the system to do things that the programmers don’t intend only get noticed if the programmers anticipated it and wrote a test for it, or a user discovered it in action and reported it to the programmers, or an adversary discovered that it was possible by reading the code / experimenting with the system and deliberately caused it to happen.
I mean, maybe we should just drop this point about the intuition pump, it was a throwaway reference in the original comment. I normally use it to argue against a specific mentality I sometimes see in people, and I guess it doesn’t make sense outside of that context.
(The mentality is “it doesn’t matter what oversight process you use, there’s always a malicious superintelligence that can game it, therefore everyone dies”.)
I think I’m confused about the intuition pump too! Like, here’s some options I thought up:
The ‘alignment problem’ is really the ‘not enough oversight’ problem. [But then if we solve the ‘enough oversight’ problem, we still have to solve the ‘what we want’ problem, the ‘coordination’ problem, the ‘construct competitively’ problem, etc.]
Bits of the alignment problem can be traded off against each other, most obviously coordination and ‘alignment tax’ (i.e. the additional amount of work you need to do to make a system aligned, or the opposite of ‘competitiveness’, which I didn’t want to use here for ease-of-understanding-by-newbies reasons.) [But it’s basically just coordination and competitiveness; like, you could imagine that oversight gives you a rejection sampling story for trading off time and understanding but I think this is basically not true because you’re also optimizing for finding holes in your transparency regime.]
Like, by analogy, I could imagine someone who uses an intuition pump of “if you had sufficient money, you could solve any problem”, but I wouldn’t use that intuition pump because I don’t believe it. [Sure, ‘by definition’ if the amount of money doesn’t solve the problem, it’s not sufficient. But why are we implicitly positing that there exists a sufficient amount of money instead of thinking about what money cannot buy?]
(After reading the rest of your comment, it seems pretty clear to me that you mean the first bullet, as you say here:)
I both 1) didn’t think it was obvious (sorry if I’m being slow on following the change in usage of ‘alignment’ here) and 2) don’t think realistically powerful oversight solves either of those two on its own (outer alignment because of “rejection sampling can get you siren worlds” problem, inner alignment because “rejection sampling isn’t competitive”, but I find that one not very compelling and suspect I’ll eventually develop a better objection).
[EDIT: I note that I also might be doing another unfavorable assumption here, where I’m assuming “unlimited oversight capacity” is something like “perfect transparency”, and so we might not choose to spend all of our oversight capacity, but you might be including things here like “actually it takes no time to understand what the model is doing” or “the oversight capacity is of humans too,” which I think weakens the outer alignment objection pretty substantially.]
Cool! I’m glad we agree on that, and will try to do more “did you mean limited statement X that we more agree about?” in the future.
It came from where we decided to look. While I think it’s possible you can have an AI out to deceive us, by putting information we want to see where we’re going to look and information we don’t want to see where we’re not going to look, I think this is going to happen by default because the human operators will have a smaller checklist than they should have: “Will the AI cure cancer? Yes? Cool, press the button.” instead of “Will the AI cure cancer? Yes? Cool. Will it preserve our ability to generate more AIs in the future to solve additional problems? No? Hmm, let’s take a look at that.”
Like, this is the sort of normal software development story where bugs that cause the system to visibly not work get noticed and fixed, and bugs that cause the system to do things that the programmers don’t intend only get noticed if the programmers anticipated it and wrote a test for it, or a user discovered it in action and reported it to the programmers, or an adversary discovered that it was possible by reading the code / experimenting with the system and deliberately caused it to happen.
I mean, maybe we should just drop this point about the intuition pump, it was a throwaway reference in the original comment. I normally use it to argue against a specific mentality I sometimes see in people, and I guess it doesn’t make sense outside of that context.
(The mentality is “it doesn’t matter what oversight process you use, there’s always a malicious superintelligence that can game it, therefore everyone dies”.)