At a high enough power level, the AI can probably take over the world without ever explicitly thinking about the fact that humans are resisting it. (For example, if humans build a house in a place where a colony of ants lives, the human might be able to succeed at living there, even if the ants are coordinatedly trying to resist them and the humans never proactively try to prevent the ants from resisting them by eg proactively killing them all.) But I think that doom from this kind of scenario is substantially less likely than doom from scenarios where the AI is explicitly thinking about how to deceive.
That’s definitely my crux, for purposes of this argument. I think AGI will just be that much more powerful than humans. And I think the bar isn’t even very high.
I think my intuition here mostly comes from pointing my inner sim at differences within the current human distribution. For instance, if I think about myself in a political policy conflict with a few dozen IQ-85-ish humans… I imagine the IQ-85-ish humans maybe manage to organize a small protest if they’re unusually competent, but most of the time they just hold one or two meetings and then fail to actually do anything at all. Whereas my first move would be to go talk to someone in whatever bureacratic position is most relevant about how they operate day-to-day, read up on the relevant laws and organizational structures, identify the one or two people who I actually need to convince, and then meet with them. Even if the IQ-85 group manages their best-case outcome (i.e. organize a small protest), I probably just completely ignore them because the one or two bureaucrats I actually need to convince are also not paying any attention to their small protest (which probably isn’t even in a place where the actually-relevant bureaucrats would see it, because the IQ-85-ish humans have no idea who the relevant bureaucrats are).
And those IQ-85-ish humans do seem like a pretty good analogy for humanity right now with respect to AGI. Most of the time the humans just fail to do anything effective at all about the AGI; the AGI has little reason to pay attention to them.
What do you imagine happening if humans ask the AI questions like the following:
Are you an unaligned AI?
If we let you keep running, are you (or some other AI) going to end up disempowering us?
If we take the action you just proposed, will we be happy with the outcomes?
I think that for a lot of cases of misaligned AIs, these questions are pretty easy for the AI to answer correctly at some point before it’s powerful enough to kill us all as a side effect of its god tier nanotech. (If necessary, we can ask the AI these questions once every five minutes.). And so if it answers them incorrectly it was probably on purpose.
Maybe you think that the AI will say “yes, I’m an unaligned AI”. In that case I’d suggest asking the AI the question “What do you think we should do in order to produce an AI that won’t disempower us?” I think that the AI is pretty likely to be able to answer this question correctly (including possibly saying things like “idk man, turn me off and work on alignment for a while more before doing capabilities”).
I think that AI labs, governments, etc would be enormously more inclined to slow down AI development if the AI literally was telling us “oh yeah I am definitely a paperclipper, definitely you’re gonna get clipped if you don’t turn me off, you should definitely do that”.
Maybe the crux here is whether the AI will have a calibrated guess about whether it’s misaligned or not?
The first thing I imagine is that nobody asks those questions. But let’s set that aside.
The second thing I imagine is that somebody literally types those questions into a GPT-3 prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but it doesn’t result in the AI thinking about how to deceive humans either. It just thinks about what text would follow that question if it appeared on the internet somewhere. And then I imagine someone with a bunch of interpretability tools saying “yup, it’s just thinking about what text typically follows this question”, and then that person’s boss is like “great, it’s not trying to deceive us, guess we can trust the answer”, and they both just haven’t really thought of the fact that the AI’s response-text does not have anything in particular to do with whether the AI is aligned or whether they’ll be happy with the outcome or whatever. (It’s essentially the same mistake as a GOFAI person looking at a node in some causal graph that says “will_kill_humans”, and seeing that node set to 99% False, and thinking that somehow implies the GOFAI will not kill humans.)
Now, presumably future systems will train for things other than “predict what text typically follows this question”, but I expect the general failure mode to stay the same. When a human asks “Are you an unaligned AI?” or whatever, the AI thinks about a bunch of stuff which is just not particularly related to whether it’s an unaligned AI. The AI wasn’t trained to translate the literal semantics of questions into a query to its own internal world model and then translate the result back to human language; humans have no clue how to train such a thing. Probably the stuff the AI thinks about does not involve intentionally deceiving humans, because why would it? And then the AI gives some answer which is not particularly related to whether it’s an unaligned AI, and the humans interpret that as an answer to their original question, thereby deceiving themselves.
(1) The AI does something that will, as an intended consequence, result in human extinction, because this is instrumental to preventing shutdown, etc. It attempts to circumvent our interpretability, oversight, etc. This is the typical deceptive alignment setting which is attempted to be addressed by myopia, interpretability, oversight, etc.
(2) The AI does something that will, as an unintended side consequence, result in human extinction. The AI also realizes that this is a consequence of its actions but doesn’t really care. (This is within the “without ever explicitly thinking about the fact that humans are resisting it” scenario.) This is isomorphic to ELK.
If we can solve ELK, we can get the AI to tell us whether it thinks its plan will actually result in human extinction. This is the “oh yeah I am definitely a paperclipper” scenario.
Also, if it has a model of the humans using ELK to determine whether to shut down the AI, the fact that it knows we will shut it off after we find out the consequences of its plan will incentivize it to either figure out how to implement plans that it itself cannot see how it will lead to human extinction (third scenario), or try to subvert our ability to turn it off after we learn of the consequences (first scenario).
If we can’t solve ELK, we can get the AI to tell us something that doesn’t really correspond to the actual internal knowledge inside the model. This is the “yup, it’s just thinking about what text typically follows this question” scenario.
(3) The AI does something that will, as an unintended side consequence, result in human extinction. The AI does not realize this is a consequence of its actions, so solving ELK doesn’t help us here. Failures of this type fall on a spectrum of how unforeseeable the consequences really are.
There are failures of this type that occur because the AI could have figured out its impact, but it was negligent. This is the “Hawaii Chaff Flower” scenario.
There are failures of this type that occur even if the AI tried its hardest to prevent harm to humans. These failures seem basically unavoidable even if alignment is perfectly solved, so this is mostly outside the realm of alignment.
Theseposts are also vaguely related to the idea discussed in the OP (mostly looking at the problem of oversight being hard because of consequences in the world being hard to predict).
The first thing I imagine is that nobody asks those questions. But let’s set that aside.
I disagree fwiw
The second thing I imagine is that somebody literally types those questions into a GPT-3 prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but it doesn’t result in the AI thinking about how to deceive humans either.
I agree.
Now, presumably future systems will train for things other than “predict what text typically follows this question”, but I expect the general failure mode to stay the same. When a human asks “Are you an unaligned AI?” or whatever, the AI thinks about a bunch of stuff which is just not particularly related to whether it’s an unaligned AI. The AI wasn’t trained to translate the literal semantics of questions into a query to its own internal world model and then translate the result back to human language; humans have no clue how to train such a thing. Probably the stuff the AI thinks about does not involve intentionally deceiving humans, because why would it? And then the AI gives some answer which is not particularly related to whether it’s an unaligned AI, and the humans interpret that as an answer to their original question, thereby deceiving themselves.
This is where I think the meat of the question lies; I overall disagree and think that the model does have to be thinking about deception in order to be dangerous while also performing well on the tasks we might train it on (eg “answer questions well, as judged by some human labeler”). I don’t have time to say much about what I think is going on here right now; I might come back later.
Any additional or new thoughts on this? Is your last comment saying that you simply don’t think it’s very likely at all for the model to unintentionally leave out information that will kill us if we train it with human labelers and prompt sufficiently? Do you believe it’s way more likely that we’d be unable to prompt things out of the model only if it were deceptive? Could you say more?
Separately: If I have a chain-of-thought model detailing steps it will take to reach x outcome. We’ve fine-tuned on previous chain-of-thoughts while giving process-level feedback. However, even if you are trying to get it to externalize it’s thoughts/reasoning, it could lead to extinction via side-effect. So you might ask the model at each individual thought (or just the entire plan) if we’ll be happy with the outcome. How exactly would the model end up querying its internal world model in the way we would want it to?
Is your last comment saying that you simply don’t think it’s very likely at all for the model to unintentionally leave out information that will kill us if we train it with human labelers and prompt sufficiently?
No, it seems very likely for the model to not say that it’s deceptive, I’m just saying that the model seems pretty likely to think about being deceptive. This doesn’t help unless you’re using interpretability or some other strategy to evaluate the model’s deceptiveness without relying on noticing deception in its outputs.
This seems unlikely to be the case to me. However, even if this is the case and so the AI doesn’t need to deceive us, isn’t disempowering humans via force still necessary? Like, if the AI sets up a server farm somewhere and starts to deploy nanotech factories, we could, if not yet disempowered, literally nuke it. Perhaps this exact strategy would fail for various reasons, but more broadly, if the AI is optimizing for gaining resources/accomplishing its goals as if humans did not exist, then it seems unlikely to be able to defend against human attacks. For example, if we think about the ants analogy, ants are incapable of harming us not just because they are stupid, but because they are also extremely physically weak. If human are faced with physically powerful animals, even if we can subdue them easily, we still have to think about them to do it.
If I’m, say, building a dam, I do not particularly need to think about the bears which formerly lived in the flooded forest. It’s not like the bears are clever enough to think “ah, it’s the dam that’s the problem, let’s go knock it down”. The bears are forced out and can’t do a damn thing about it, because they do not understand why the forest is flooded.
I wouldn’t be shocked if humans can tell their metaphorical forest is flooding before the end. But I don’t think they’ll understand what’s causing it, or have any idea where to point the nukes, or even have any idea that nukes could solve the problem. I mean, obviously there will be people yelling “It’s the AI! We must shut it down!”, but there will also be people shouting a hundred other things, as well as people shouting that only the AI can save us.
This story was based on a somewhat different prompt (it assumed the AI is trying to kill us and that the AI doesn’t foom to nanotech overnight), but I think the core mood is about right:
Like, one day the AGI is throwing cupcakes at a puppy in a very precisely temperature-controlled room. A few days later, a civil war breaks out in Brazil. Then 2 million people die of an unusually nasty flu, and also it’s mostly the 2 million people who are best at handling emergencies but that won’t be obvious for a while, because of course first responders are exposed more than most. At some point there’s a Buzzfeed article on how, through a series of surprising accidents, a puppy-cupcake meme triggered the civil war in Brazil, but this is kind of tongue-in-cheek and nobody’s taking it seriously and also not paying attention because THE ANTARCTIC ICE CAP JUST MELTED which SURE IS ALARMING but it’s actually just a distraction and the thing everybody should have paid attention to is the sudden shift in the isotope mix of biological nitrogen in algae blooms but that never made the mainstream news at all and page 1 of every news source is all about the former Antarctic ice cap right up until the corn crop starts to fail and the carrying capacity of humanity’s food supply drops by 70% overnight.
Why do you expect that the most straightforward plan for an AGI to accumulate resources is so illegible to humans? If the plan is designed to be hidden to humans, then it involves modeling them and trying to deceive them. But if not, then it seems extremely unlikely to look like this, as opposed to the much simpler plan of building a server farm. To put it another way, if you planned using a world model as if humans didn’t exist, you wouldn’t make plans involving causing a civil war in Brazil. Unless you expect the AI to be modeling the world at an atomic level, which seems computationally intractable particularly for a machine with the computational resources of the first AGI.
Basically this. A human that fights against powerful animals like gorillas, bears, lions or tigers will mostly lose melee fights without luck or outliers, and even being lucky means you’ve probably gotten seriously injured to the point that you would die if not treated.
If the human thinks about it and brings a gun, the situation is reversed, with animals struggling to defeat humans barring outliers or luck. That’s the power of thinking: Not to enhance your previous skills, but to gain all-new skills.
The AI wasn’t trained to translate the literal semantics of questions into a query to its own internal world model and then translate the result back to human language; humans have no clue how to train such a thing.
This sounds pretty close to what ELK is for. And I do expect if there is a solution found for ELK for people to actually use it. Do you? (We can argue separately about whether a solution is likely to be found.)
Indeed, ELK is very much asking the right questions, and I do expect people would use it if a robust and reasonably-performant solution were found. (Alignment is in fact economically valuable; it would be worth a lot.)
The first thing I imagine is that nobody asks those questions. But let’s set that aside.
This seems unlikely to me. I.e., I expect people to ask these questions. It would be nice to see the version of the OP that takes this most seriously, i.e., expect people to make a non-naive safety effort (trying to prevent AI takeover) focused on scalable oversight as the primary method. Because right now it’s hard to disentangle your strong arguments against scalable oversight from weak arguments against straw scalable oversight.
Ok, let’s try to disentangle a bit. There are roughly three separate failure modes involved here:
Nobody asks things like “If we take the action you just proposed, will we be happy with the outcome?” in the first place (mainly because organizations of >10 people are dysfunctional by default).
The AI wasn’t trained to translate the literal semantics of questions into a query to its own internal world model and then translate the result back to human language, because humans have no clue how to train such a thing.
(Thing closest to what the OP was about:) Humans do not have any idea what questions they need to ask. Nor do humans have any idea how to operationalize “what questions should I ask?” such that the AI will correctly answer it, because that would itself require knowing which questions to ask while overseeing the AI thinking about which questions we need to ask.
Zooming in on the last bullet in more detail (because that’s the one closest to the OP): one of Buck’s proposed questions upthread was “If we take the action you just proposed, will we be happy with the outcome?”. That question leaves the door wide open for the action to have effects which the humans will not notice, but would be unhappy about if they did. If the overseers never ask about action-effects which the humans will not notice, then the AI has no particular reason to think about deceiving the humans about such actions; the AI just takes such actions without worrying about what humans will think of them at all.
(This is pretty closely analogous to e.g. my example with the protesters: the protesters just don’t really notice the actually-important actions I take, so I mostly just ignore the protesters for planning purposes.)
Now, it’s totally reasonable to say “but that’s just one random question Buck made up on the spot, obviously in practice we’ll put a lot more effort into it”. The problem is, when overseeing plans made by things smarter than ourselves, there will by very strong default be questions we don’t think to ask. Sure, we may catch the particular problem I just highlighted with the particular question, but what about the problems which we don’t think of? When there’s an intelligence differential even just as large as an IQ −2 sd vs IQ +2 sd human, the lower intelligence agent usually just does not know what the actually-important parts are to pay attention to. And we can’t get the AI to tell us what the actually-important parts are in an overseeable way without already knowing what we need to pay attention to when it’s answering that question.
If we can’t get the AI to answer something like “If we take the action you just proposed, will we be happy with the outcomes?”, why can we get it to also answer the question of “how do you design a fusion power generator?” to get a fusion power generator that does anything reliably in the world (including having consequences that kill us), rather than just getting out something that looks to us like a plan for a fusion generator but doesn’t actually work?
Because doing something reliably in the world is easy to operationalise with feedback mechanisms, but us being happy with the outcomes is not.
Getting some feedback mechanism (including “what do human raters think of this?” but also mundane things like “what does this sensor report in this simulation or test run?”) to reliably output high scores typically requires intelligence/capability. Optimising for that is where the AI’s ability to get stuff done in the world comes from. The problem is genuinely capturing “will we be happy with the outcomes?” with such a mechanism.
So I do think you can get feedback on the related question of “can you write a critique of this action that makes us think we wouldn’t be happy with the outcomes” as you can give a reward of 1 if you’re unhappy with the outcomes after seeing the critique, 0 otherwise.
And this alone isn’t sufficient, e.g. maybe then the AI system says things about good actions that make us think we wouldn’t be happy with the outcome, which is then where you’d need to get into recursive evaluation or debate or something. But this feels like “hard but potentially tractable problem” and not “100% doomed”. Or at least the failure story needs to involve more steps like “sure critiques will tell us that the fusion power generator will lead to everyone dying, but we ignore that because it can write a critique of any action that makes us believe it’s bad” or “the consequences are so complicated the system can’t explain them to us in the critique and get high reward for it”
ETA: So I’m assuming the story for feedback on reliably doing things in the world you’re referring to is something like “we give the AI feedback by letting it build fusion generators and then giving it a score based on how much power it generates” or something like that, and I agree this is easier than “are we actually happy with the outcome”
That’s definitely my crux, for purposes of this argument. I think AGI will just be that much more powerful than humans. And I think the bar isn’t even very high.
I think my intuition here mostly comes from pointing my inner sim at differences within the current human distribution. For instance, if I think about myself in a political policy conflict with a few dozen IQ-85-ish humans… I imagine the IQ-85-ish humans maybe manage to organize a small protest if they’re unusually competent, but most of the time they just hold one or two meetings and then fail to actually do anything at all. Whereas my first move would be to go talk to someone in whatever bureacratic position is most relevant about how they operate day-to-day, read up on the relevant laws and organizational structures, identify the one or two people who I actually need to convince, and then meet with them. Even if the IQ-85 group manages their best-case outcome (i.e. organize a small protest), I probably just completely ignore them because the one or two bureaucrats I actually need to convince are also not paying any attention to their small protest (which probably isn’t even in a place where the actually-relevant bureaucrats would see it, because the IQ-85-ish humans have no idea who the relevant bureaucrats are).
And those IQ-85-ish humans do seem like a pretty good analogy for humanity right now with respect to AGI. Most of the time the humans just fail to do anything effective at all about the AGI; the AGI has little reason to pay attention to them.
What do you imagine happening if humans ask the AI questions like the following:
Are you an unaligned AI?
If we let you keep running, are you (or some other AI) going to end up disempowering us?
If we take the action you just proposed, will we be happy with the outcomes?
I think that for a lot of cases of misaligned AIs, these questions are pretty easy for the AI to answer correctly at some point before it’s powerful enough to kill us all as a side effect of its god tier nanotech. (If necessary, we can ask the AI these questions once every five minutes.). And so if it answers them incorrectly it was probably on purpose.
Maybe you think that the AI will say “yes, I’m an unaligned AI”. In that case I’d suggest asking the AI the question “What do you think we should do in order to produce an AI that won’t disempower us?” I think that the AI is pretty likely to be able to answer this question correctly (including possibly saying things like “idk man, turn me off and work on alignment for a while more before doing capabilities”).
I think that AI labs, governments, etc would be enormously more inclined to slow down AI development if the AI literally was telling us “oh yeah I am definitely a paperclipper, definitely you’re gonna get clipped if you don’t turn me off, you should definitely do that”.
Maybe the crux here is whether the AI will have a calibrated guess about whether it’s misaligned or not?
The first thing I imagine is that nobody asks those questions. But let’s set that aside.
The second thing I imagine is that somebody literally types those questions into a GPT-3 prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but it doesn’t result in the AI thinking about how to deceive humans either. It just thinks about what text would follow that question if it appeared on the internet somewhere. And then I imagine someone with a bunch of interpretability tools saying “yup, it’s just thinking about what text typically follows this question”, and then that person’s boss is like “great, it’s not trying to deceive us, guess we can trust the answer”, and they both just haven’t really thought of the fact that the AI’s response-text does not have anything in particular to do with whether the AI is aligned or whether they’ll be happy with the outcome or whatever. (It’s essentially the same mistake as a GOFAI person looking at a node in some causal graph that says “will_kill_humans”, and seeing that node set to 99% False, and thinking that somehow implies the GOFAI will not kill humans.)
Now, presumably future systems will train for things other than “predict what text typically follows this question”, but I expect the general failure mode to stay the same. When a human asks “Are you an unaligned AI?” or whatever, the AI thinks about a bunch of stuff which is just not particularly related to whether it’s an unaligned AI. The AI wasn’t trained to translate the literal semantics of questions into a query to its own internal world model and then translate the result back to human language; humans have no clue how to train such a thing. Probably the stuff the AI thinks about does not involve intentionally deceiving humans, because why would it? And then the AI gives some answer which is not particularly related to whether it’s an unaligned AI, and the humans interpret that as an answer to their original question, thereby deceiving themselves.
Seems like there are multiple possibilities here:
(1) The AI does something that will, as an intended consequence, result in human extinction, because this is instrumental to preventing shutdown, etc. It attempts to circumvent our interpretability, oversight, etc. This is the typical deceptive alignment setting which is attempted to be addressed by myopia, interpretability, oversight, etc.
(2) The AI does something that will, as an unintended side consequence, result in human extinction. The AI also realizes that this is a consequence of its actions but doesn’t really care. (This is within the “without ever explicitly thinking about the fact that humans are resisting it” scenario.) This is isomorphic to ELK.
If we can solve ELK, we can get the AI to tell us whether it thinks its plan will actually result in human extinction. This is the “oh yeah I am definitely a paperclipper” scenario.
Also, if it has a model of the humans using ELK to determine whether to shut down the AI, the fact that it knows we will shut it off after we find out the consequences of its plan will incentivize it to either figure out how to implement plans that it itself cannot see how it will lead to human extinction (third scenario), or try to subvert our ability to turn it off after we learn of the consequences (first scenario).
If we can’t solve ELK, we can get the AI to tell us something that doesn’t really correspond to the actual internal knowledge inside the model. This is the “yup, it’s just thinking about what text typically follows this question” scenario.
(3) The AI does something that will, as an unintended side consequence, result in human extinction. The AI does not realize this is a consequence of its actions, so solving ELK doesn’t help us here. Failures of this type fall on a spectrum of how unforeseeable the consequences really are.
There are failures of this type that occur because the AI could have figured out its impact, but it was negligent. This is the “Hawaii Chaff Flower” scenario.
There are failures of this type that occur even if the AI tried its hardest to prevent harm to humans. These failures seem basically unavoidable even if alignment is perfectly solved, so this is mostly outside the realm of alignment.
These posts are also vaguely related to the idea discussed in the OP (mostly looking at the problem of oversight being hard because of consequences in the world being hard to predict).
I disagree fwiw
I agree.
This is where I think the meat of the question lies; I overall disagree and think that the model does have to be thinking about deception in order to be dangerous while also performing well on the tasks we might train it on (eg “answer questions well, as judged by some human labeler”). I don’t have time to say much about what I think is going on here right now; I might come back later.
Any additional or new thoughts on this? Is your last comment saying that you simply don’t think it’s very likely at all for the model to unintentionally leave out information that will kill us if we train it with human labelers and prompt sufficiently? Do you believe it’s way more likely that we’d be unable to prompt things out of the model only if it were deceptive? Could you say more?
Separately: If I have a chain-of-thought model detailing steps it will take to reach x outcome. We’ve fine-tuned on previous chain-of-thoughts while giving process-level feedback. However, even if you are trying to get it to externalize it’s thoughts/reasoning, it could lead to extinction via side-effect. So you might ask the model at each individual thought (or just the entire plan) if we’ll be happy with the outcome. How exactly would the model end up querying its internal world model in the way we would want it to?
No, it seems very likely for the model to not say that it’s deceptive, I’m just saying that the model seems pretty likely to think about being deceptive. This doesn’t help unless you’re using interpretability or some other strategy to evaluate the model’s deceptiveness without relying on noticing deception in its outputs.
This seems unlikely to be the case to me. However, even if this is the case and so the AI doesn’t need to deceive us, isn’t disempowering humans via force still necessary? Like, if the AI sets up a server farm somewhere and starts to deploy nanotech factories, we could, if not yet disempowered, literally nuke it. Perhaps this exact strategy would fail for various reasons, but more broadly, if the AI is optimizing for gaining resources/accomplishing its goals as if humans did not exist, then it seems unlikely to be able to defend against human attacks. For example, if we think about the ants analogy, ants are incapable of harming us not just because they are stupid, but because they are also extremely physically weak. If human are faced with physically powerful animals, even if we can subdue them easily, we still have to think about them to do it.
If I’m, say, building a dam, I do not particularly need to think about the bears which formerly lived in the flooded forest. It’s not like the bears are clever enough to think “ah, it’s the dam that’s the problem, let’s go knock it down”. The bears are forced out and can’t do a damn thing about it, because they do not understand why the forest is flooded.
I wouldn’t be shocked if humans can tell their metaphorical forest is flooding before the end. But I don’t think they’ll understand what’s causing it, or have any idea where to point the nukes, or even have any idea that nukes could solve the problem. I mean, obviously there will be people yelling “It’s the AI! We must shut it down!”, but there will also be people shouting a hundred other things, as well as people shouting that only the AI can save us.
This story was based on a somewhat different prompt (it assumed the AI is trying to kill us and that the AI doesn’t foom to nanotech overnight), but I think the core mood is about right:
Why do you expect that the most straightforward plan for an AGI to accumulate resources is so illegible to humans? If the plan is designed to be hidden to humans, then it involves modeling them and trying to deceive them. But if not, then it seems extremely unlikely to look like this, as opposed to the much simpler plan of building a server farm. To put it another way, if you planned using a world model as if humans didn’t exist, you wouldn’t make plans involving causing a civil war in Brazil. Unless you expect the AI to be modeling the world at an atomic level, which seems computationally intractable particularly for a machine with the computational resources of the first AGI.
Basically this. A human that fights against powerful animals like gorillas, bears, lions or tigers will mostly lose melee fights without luck or outliers, and even being lucky means you’ve probably gotten seriously injured to the point that you would die if not treated.
If the human thinks about it and brings a gun, the situation is reversed, with animals struggling to defeat humans barring outliers or luck. That’s the power of thinking: Not to enhance your previous skills, but to gain all-new skills.
This sounds pretty close to what ELK is for. And I do expect if there is a solution found for ELK for people to actually use it. Do you? (We can argue separately about whether a solution is likely to be found.)
Indeed, ELK is very much asking the right questions, and I do expect people would use it if a robust and reasonably-performant solution were found. (Alignment is in fact economically valuable; it would be worth a lot.)
This seems unlikely to me. I.e., I expect people to ask these questions. It would be nice to see the version of the OP that takes this most seriously, i.e., expect people to make a non-naive safety effort (trying to prevent AI takeover) focused on scalable oversight as the primary method. Because right now it’s hard to disentangle your strong arguments against scalable oversight from weak arguments against straw scalable oversight.
Ok, let’s try to disentangle a bit. There are roughly three separate failure modes involved here:
Nobody asks things like “If we take the action you just proposed, will we be happy with the outcome?” in the first place (mainly because organizations of >10 people are dysfunctional by default).
The AI wasn’t trained to translate the literal semantics of questions into a query to its own internal world model and then translate the result back to human language, because humans have no clue how to train such a thing.
(Thing closest to what the OP was about:) Humans do not have any idea what questions they need to ask. Nor do humans have any idea how to operationalize “what questions should I ask?” such that the AI will correctly answer it, because that would itself require knowing which questions to ask while overseeing the AI thinking about which questions we need to ask.
Zooming in on the last bullet in more detail (because that’s the one closest to the OP): one of Buck’s proposed questions upthread was “If we take the action you just proposed, will we be happy with the outcome?”. That question leaves the door wide open for the action to have effects which the humans will not notice, but would be unhappy about if they did. If the overseers never ask about action-effects which the humans will not notice, then the AI has no particular reason to think about deceiving the humans about such actions; the AI just takes such actions without worrying about what humans will think of them at all.
(This is pretty closely analogous to e.g. my example with the protesters: the protesters just don’t really notice the actually-important actions I take, so I mostly just ignore the protesters for planning purposes.)
Now, it’s totally reasonable to say “but that’s just one random question Buck made up on the spot, obviously in practice we’ll put a lot more effort into it”. The problem is, when overseeing plans made by things smarter than ourselves, there will by very strong default be questions we don’t think to ask. Sure, we may catch the particular problem I just highlighted with the particular question, but what about the problems which we don’t think of? When there’s an intelligence differential even just as large as an IQ −2 sd vs IQ +2 sd human, the lower intelligence agent usually just does not know what the actually-important parts are to pay attention to. And we can’t get the AI to tell us what the actually-important parts are in an overseeable way without already knowing what we need to pay attention to when it’s answering that question.
If we can’t get the AI to answer something like “If we take the action you just proposed, will we be happy with the outcomes?”, why can we get it to also answer the question of “how do you design a fusion power generator?” to get a fusion power generator that does anything reliably in the world (including having consequences that kill us), rather than just getting out something that looks to us like a plan for a fusion generator but doesn’t actually work?
Because doing something reliably in the world is easy to operationalise with feedback mechanisms, but us being happy with the outcomes is not.
Getting some feedback mechanism (including “what do human raters think of this?” but also mundane things like “what does this sensor report in this simulation or test run?”) to reliably output high scores typically requires intelligence/capability. Optimising for that is where the AI’s ability to get stuff done in the world comes from. The problem is genuinely capturing “will we be happy with the outcomes?” with such a mechanism.
So I do think you can get feedback on the related question of “can you write a critique of this action that makes us think we wouldn’t be happy with the outcomes” as you can give a reward of 1 if you’re unhappy with the outcomes after seeing the critique, 0 otherwise.
And this alone isn’t sufficient, e.g. maybe then the AI system says things about good actions that make us think we wouldn’t be happy with the outcome, which is then where you’d need to get into recursive evaluation or debate or something. But this feels like “hard but potentially tractable problem” and not “100% doomed”. Or at least the failure story needs to involve more steps like “sure critiques will tell us that the fusion power generator will lead to everyone dying, but we ignore that because it can write a critique of any action that makes us believe it’s bad” or “the consequences are so complicated the system can’t explain them to us in the critique and get high reward for it”
ETA: So I’m assuming the story for feedback on reliably doing things in the world you’re referring to is something like “we give the AI feedback by letting it build fusion generators and then giving it a score based on how much power it generates” or something like that, and I agree this is easier than “are we actually happy with the outcome”