I can of course imagine a reasonable response to that from you—”ah, resolving philosophical difficulties is the user’s problem, and not one of the things that I mean by alignment”
That is in fact my response. (Though one of the ways in which the intuition pump isn’t fully compelling to me is that, even after understanding the exact program that the AGI implements and its causal history, maybe the overseers can’t correctly predict the consequences of running that program for a long time. Still feels like they’d do fine.)
I do agree that if you go as far as “logical omniscience” then there are “cheating” ways of solving the problem that don’t really tell us much about how hard alignment is in practice.
But like, the thing this reminds me of is something like extrapolating tangents, instead of operating the production function? “If we had an infinitely good engine, we could make the perfect car”, which seems sensible when you’re used to thinking of engine improvements linearly increasing car quality and doesn’t seem sensible when you’re used to thinking of car quality as a product of sigmoids of the input variables.
The car analogy just doesn’t seem sensible. I can tell stories of car doom even if you have infinitely good engines (e.g. the steering breaks). My point is that we struggle to tell stories of doom when imagining a very powerful oversight process that knows everything the model knows.
I’m not thinking “more oversight quality --> more alignment” and then concluding “infinite oversight quality --> alignment solved”. I’m starting with the intuition pump, noticing I can no longer tell a good story of doom, and concluding “infinite oversight quality --> alignment solved”. So I don’t think this has much to do with extrapolating tangents vs. production functions, except inasmuch as production functions encourage you to think about complements to your inputs that you can then posit don’t exist in order to tell a story of doom.
I’m starting with the intuition pump, noticing I can no longer tell a good story of doom, and concluding “infinite oversight quality --> alignment solved”.
I think some of my more alignment-flavored counterexamples look like:
The ‘reengineer it to be safe’ step breaks down / isn’t implemented thru oversight. Like, if we’re positing we spin up a whole Great Reflection to evaluate every action the AI takes, this seems like it’s probably not going to be competitive!
The oversight gives us as much info as we ask for, but the world is a siren world (like what Stuart points to, but a little different), where the initial information we discover about the plans from oversight is so convincing that we decide to go ahead with the AI before discovering the gotchas.
Related to the previous point, the oversight is sufficient to reveal features about the plan that are terrible, but before the ‘reengineer to make it more safe’ plan is executed, the code is stolen and executed by a subset of humanity which thinks the terrible plan is ‘good enough’, for them at least.
That is, it feels to me like we benefit a lot from having 1) a constructive approach to alignment instead of rejection sampling, 2) sufficient security focus that we don’t proceed on EV of known information, but actually do the ‘due diligence’, and 3) sufficient coordination among humans that we don’t leave behind substantial swaths of current human preferences, and I don’t see how we get those thru having arbitrary transparency.
[I also would like to solve the problem of “AI has good outcomes” instead of the smaller problem of “AI isn’t out to get us”, because accidental deaths are deaths too! But I do think it makes sense to focus on that capability problem separately, at least sometimes.]
I obviously do not think this is at all competitive, and I also wanted to ignore the “other people steal your code” case. I am confused what you think I was trying to do with that intuition pump.
I guess I said “powerful oversight would solve alignment” which could be construed to mean that powerful oversight ⇒ great future, in which case I’d change it to “powerful oversight would deal with the particular technical problems that we call outer and inner alignment”, but was it really so non-obvious that I was talking about the technical problems?
Maybe your point is that there are lots of things required for a good future, just as a car needs both steering and an engine, and so the intuition pump is not interesting because it doesn’t talk about all the things needed for a good future? If so, I totally agree that it does not in fact include all the things needed for a good future, and it was not meant to be saying that.
where the initial information we discover about the plans from oversight is so convincing that we decide to go ahead with the AI before discovering the gotchas.
This just doesn’t seem plausible to me. Where did the information come from? Did the AI system optimize the information to be convincing? If yes, why didn’t we notice that the AI system was doing that? Can we solve this by ensuring that we do due diligence, even if it doesn’t seem necessary?
I am confused what you think I was trying to do with that intuition pump.
I think I’m confused about the intuition pump too! Like, here’s some options I thought up:
The ‘alignment problem’ is really the ‘not enough oversight’ problem. [But then if we solve the ‘enough oversight’ problem, we still have to solve the ‘what we want’ problem, the ‘coordination’ problem, the ‘construct competitively’ problem, etc.]
Bits of the alignment problem can be traded off against each other, most obviously coordination and ‘alignment tax’ (i.e. the additional amount of work you need to do to make a system aligned, or the opposite of ‘competitiveness’, which I didn’t want to use here for ease-of-understanding-by-newbies reasons.) [But it’s basically just coordination and competitiveness; like, you could imagine that oversight gives you a rejection sampling story for trading off time and understanding but I think this is basically not true because you’re also optimizing for finding holes in your transparency regime.]
Like, by analogy, I could imagine someone who uses an intuition pump of “if you had sufficient money, you could solve any problem”, but I wouldn’t use that intuition pump because I don’t believe it. [Sure, ‘by definition’ if the amount of money doesn’t solve the problem, it’s not sufficient. But why are we implicitly positing that there exists a sufficient amount of money instead of thinking about what money cannot buy?]
(After reading the rest of your comment, it seems pretty clear to me that you mean the first bullet, as you say here:)
in which case I’d change it to “powerful oversight would deal with the particular technical problems that we call outer and inner alignment”, but was it really so non-obvious that I was talking about the technical problems
I both 1) didn’t think it was obvious (sorry if I’m being slow on following the change in usage of ‘alignment’ here) and 2) don’t think realistically powerful oversight solves either of those two on its own (outer alignment because of “rejection sampling can get you siren worlds” problem, inner alignment because “rejection sampling isn’t competitive”, but I find that one not very compelling and suspect I’ll eventually develop a better objection).
[EDIT: I note that I also might be doing another unfavorable assumption here, where I’m assuming “unlimited oversight capacity” is something like “perfect transparency”, and so we might not choose to spend all of our oversight capacity, but you might be including things here like “actually it takes no time to understand what the model is doing” or “the oversight capacity is of humans too,” which I think weakens the outer alignment objection pretty substantially.]
If so, I totally agree that it does not in fact include all the things needed for a good future, and it was not meant to be saying that.
Cool! I’m glad we agree on that, and will try to do more “did you mean limited statement X that we more agree about?” in the future.
This just doesn’t seem plausible to me. Where did the information come from? Did the AI system optimize the information to be convincing? If yes, why didn’t we notice that the AI system was doing that? Can we solve this by ensuring that we do due diligence, even if it doesn’t seem necessary?
It came from where we decided to look. While I think it’s possible you can have an AI out to deceive us, by putting information we want to see where we’re going to look and information we don’t want to see where we’re not going to look, I think this is going to happen by default because the human operators will have a smaller checklist than they should have: “Will the AI cure cancer? Yes? Cool, press the button.” instead of “Will the AI cure cancer? Yes? Cool. Will it preserve our ability to generate more AIs in the future to solve additional problems? No? Hmm, let’s take a look at that.”
Like, this is the sort of normal software development story where bugs that cause the system to visibly not work get noticed and fixed, and bugs that cause the system to do things that the programmers don’t intend only get noticed if the programmers anticipated it and wrote a test for it, or a user discovered it in action and reported it to the programmers, or an adversary discovered that it was possible by reading the code / experimenting with the system and deliberately caused it to happen.
I mean, maybe we should just drop this point about the intuition pump, it was a throwaway reference in the original comment. I normally use it to argue against a specific mentality I sometimes see in people, and I guess it doesn’t make sense outside of that context.
(The mentality is “it doesn’t matter what oversight process you use, there’s always a malicious superintelligence that can game it, therefore everyone dies”.)
That is in fact my response. (Though one of the ways in which the intuition pump isn’t fully compelling to me is that, even after understanding the exact program that the AGI implements and its causal history, maybe the overseers can’t correctly predict the consequences of running that program for a long time. Still feels like they’d do fine.)
I do agree that if you go as far as “logical omniscience” then there are “cheating” ways of solving the problem that don’t really tell us much about how hard alignment is in practice.
The car analogy just doesn’t seem sensible. I can tell stories of car doom even if you have infinitely good engines (e.g. the steering breaks). My point is that we struggle to tell stories of doom when imagining a very powerful oversight process that knows everything the model knows.
I’m not thinking “more oversight quality --> more alignment” and then concluding “infinite oversight quality --> alignment solved”. I’m starting with the intuition pump, noticing I can no longer tell a good story of doom, and concluding “infinite oversight quality --> alignment solved”. So I don’t think this has much to do with extrapolating tangents vs. production functions, except inasmuch as production functions encourage you to think about complements to your inputs that you can then posit don’t exist in order to tell a story of doom.
I think some of my more alignment-flavored counterexamples look like:
The ‘reengineer it to be safe’ step breaks down / isn’t implemented thru oversight. Like, if we’re positing we spin up a whole Great Reflection to evaluate every action the AI takes, this seems like it’s probably not going to be competitive!
The oversight gives us as much info as we ask for, but the world is a siren world (like what Stuart points to, but a little different), where the initial information we discover about the plans from oversight is so convincing that we decide to go ahead with the AI before discovering the gotchas.
Related to the previous point, the oversight is sufficient to reveal features about the plan that are terrible, but before the ‘reengineer to make it more safe’ plan is executed, the code is stolen and executed by a subset of humanity which thinks the terrible plan is ‘good enough’, for them at least.
That is, it feels to me like we benefit a lot from having 1) a constructive approach to alignment instead of rejection sampling, 2) sufficient security focus that we don’t proceed on EV of known information, but actually do the ‘due diligence’, and 3) sufficient coordination among humans that we don’t leave behind substantial swaths of current human preferences, and I don’t see how we get those thru having arbitrary transparency.
[I also would like to solve the problem of “AI has good outcomes” instead of the smaller problem of “AI isn’t out to get us”, because accidental deaths are deaths too! But I do think it makes sense to focus on that capability problem separately, at least sometimes.]
I obviously do not think this is at all competitive, and I also wanted to ignore the “other people steal your code” case. I am confused what you think I was trying to do with that intuition pump.
I guess I said “powerful oversight would solve alignment” which could be construed to mean that powerful oversight ⇒ great future, in which case I’d change it to “powerful oversight would deal with the particular technical problems that we call outer and inner alignment”, but was it really so non-obvious that I was talking about the technical problems?
Maybe your point is that there are lots of things required for a good future, just as a car needs both steering and an engine, and so the intuition pump is not interesting because it doesn’t talk about all the things needed for a good future? If so, I totally agree that it does not in fact include all the things needed for a good future, and it was not meant to be saying that.
This just doesn’t seem plausible to me. Where did the information come from? Did the AI system optimize the information to be convincing? If yes, why didn’t we notice that the AI system was doing that? Can we solve this by ensuring that we do due diligence, even if it doesn’t seem necessary?
I think I’m confused about the intuition pump too! Like, here’s some options I thought up:
The ‘alignment problem’ is really the ‘not enough oversight’ problem. [But then if we solve the ‘enough oversight’ problem, we still have to solve the ‘what we want’ problem, the ‘coordination’ problem, the ‘construct competitively’ problem, etc.]
Bits of the alignment problem can be traded off against each other, most obviously coordination and ‘alignment tax’ (i.e. the additional amount of work you need to do to make a system aligned, or the opposite of ‘competitiveness’, which I didn’t want to use here for ease-of-understanding-by-newbies reasons.) [But it’s basically just coordination and competitiveness; like, you could imagine that oversight gives you a rejection sampling story for trading off time and understanding but I think this is basically not true because you’re also optimizing for finding holes in your transparency regime.]
Like, by analogy, I could imagine someone who uses an intuition pump of “if you had sufficient money, you could solve any problem”, but I wouldn’t use that intuition pump because I don’t believe it. [Sure, ‘by definition’ if the amount of money doesn’t solve the problem, it’s not sufficient. But why are we implicitly positing that there exists a sufficient amount of money instead of thinking about what money cannot buy?]
(After reading the rest of your comment, it seems pretty clear to me that you mean the first bullet, as you say here:)
I both 1) didn’t think it was obvious (sorry if I’m being slow on following the change in usage of ‘alignment’ here) and 2) don’t think realistically powerful oversight solves either of those two on its own (outer alignment because of “rejection sampling can get you siren worlds” problem, inner alignment because “rejection sampling isn’t competitive”, but I find that one not very compelling and suspect I’ll eventually develop a better objection).
[EDIT: I note that I also might be doing another unfavorable assumption here, where I’m assuming “unlimited oversight capacity” is something like “perfect transparency”, and so we might not choose to spend all of our oversight capacity, but you might be including things here like “actually it takes no time to understand what the model is doing” or “the oversight capacity is of humans too,” which I think weakens the outer alignment objection pretty substantially.]
Cool! I’m glad we agree on that, and will try to do more “did you mean limited statement X that we more agree about?” in the future.
It came from where we decided to look. While I think it’s possible you can have an AI out to deceive us, by putting information we want to see where we’re going to look and information we don’t want to see where we’re not going to look, I think this is going to happen by default because the human operators will have a smaller checklist than they should have: “Will the AI cure cancer? Yes? Cool, press the button.” instead of “Will the AI cure cancer? Yes? Cool. Will it preserve our ability to generate more AIs in the future to solve additional problems? No? Hmm, let’s take a look at that.”
Like, this is the sort of normal software development story where bugs that cause the system to visibly not work get noticed and fixed, and bugs that cause the system to do things that the programmers don’t intend only get noticed if the programmers anticipated it and wrote a test for it, or a user discovered it in action and reported it to the programmers, or an adversary discovered that it was possible by reading the code / experimenting with the system and deliberately caused it to happen.
I mean, maybe we should just drop this point about the intuition pump, it was a throwaway reference in the original comment. I normally use it to argue against a specific mentality I sometimes see in people, and I guess it doesn’t make sense outside of that context.
(The mentality is “it doesn’t matter what oversight process you use, there’s always a malicious superintelligence that can game it, therefore everyone dies”.)