If you have an untrustworthy general superintelligence generating English strings meant to be “reasoning/arguments/proofs/explanations” about eg a nanosystem design, then I would not only expect the superintelligence to be able to fool humans in the sense of arguing for things that were not true in a way that fooled the humans, I’d expect the superintelligence to be able to covertly directly hack the humans in ways that I wouldn’t understand even after having been told what happened.
I can’t deny that this is possible. Lying is obviously possible; lying convincingly, depends on the scrutiny applied and on the particulars; emotional manipulation, to some extent for sure, though the strongest forms may need to be tailored to the individual; blackmail and bribes, possible, though again that may require knowing who will read the text; something like subliminal messaging, maybe.
It seems like a reasonable precaution to avoid letting the machine know who, specifically, will read its messages; and to have several people read the text independently of each other, all with instructions to raise a flag if they see something dangerous-seeming. (One could sometimes deliberately insert dangerous-seeming messages to verify that they’re doing their jobs—and tell them that you’re doing this.) Another precaution would be to automatically monitor for unexpected keywords (the kind that you’d expect to see in bribes or blackmail, not in mathematical proofs).
The machine could dump in a manifesto about its new political philosophy for overthrowing the oppressors, and promising to use its powers to create a truly just society. This is credible: oppressors exist, it wouldn’t even have to lie to point out lots of abuses and problems, and an unleashed AGI could solve all kinds of problems and act as a truly benevolent dictator; the only in-credible part is the machine’s benevolence, but we can assume the machine will be very good at persuasion. (Having written this, I suspect there are people whose political beliefs make them think that the world is currently so bad that they would deliberately seek out an AI that would make promises of this sort; either they estimate a “high enough” chance that the AI is correct, or they’re not really thinking in terms of probabilities. I’m mildly uncomfortable posting this—seems conceivable this could be a self-fulfilling prophecy—but I doubt the types of people who would do this would be reading this thread.)
But I expect the manifesto would require a lot of text to persuade the reader to defect and join the AI—like, more than five paragraphs, possibly a lot more. This suggests a length limit, as well as instructing readers to raise a flag and stop reading if they see weird offtopic text (would be harder if the text is supposed to be economic or political analysis, so maybe don’t use AI for those). (Another approach would be to have some readers view only every fourth paragraph or sentence—or even a random sample of length-limited sentences, to verify they’re on topic.)
Maybe the machine would work around text limits by adding a little “payload” to one message, a little to the next message, and so on. It would have to be subtle enough to avoid being flagged as offtopic.
Regarding weird subliminal messaging… I’d guess that (a) for it to not get auto-flagged, it would have to be subtle; (b) for it to not get flagged by humans, it’d have to be very subtle and work subconsciously; (c) would it be a visual thing, depending on the font, the columns per row, the text spacing, etc? If so, these can be randomized; (d) I think it’s a lot more likely that you can find a pattern of letters to subtly manipulate one known person (especially using a scan of their brain) than to manipulate everyone, including e.g. those for whom English isn’t their native language, and I think the “don’t let the machine know the reader” approach helps here.
Also, a tactic that could be used is to have dumb machines randomly perturb the text (replace words with synonyms, feed sentences into English → something else → English translation) before humans see it.
Incidentally, for the “breaking the manipulative payload across multiple messages” approach, it would help if each instantiation of the machine didn’t know how many past instantiations there’ve been, and how many past messages each reader has seen. (Because then it would need a brainwashing approach where it doesn’t matter what order the messages are delivered in—it chooses them at random—and in which the last message is just as subtle as the first.) I suppose the machine, inspecting itself, might be able to guess how many engineer-hours or code revisions it’s been through, which would let it approximate “how many past instantiations”; but readers probably would rotate in the natural course of things.
Of course, for organizations working on AI that don’t take risks very seriously… Length limits seem like an easy sell, for practical reasons if nothing else. “Machine doesn’t know the reader” is probably implemented by default, for serious business-advising applications. Instructions to watch out for and flag weird stuff—probably fairly easy to get companies to tell that to their employees, compliance is likely to be halfhearted but nonzero. Having multiple readers is an extra expense with no apparent benefit, and would therefore be a hard sell for most companies; the same is true for “randomly perturbing the text”.
If these tactics did succeed in catching an early form of “an AI advisor trying to persuade its human reader to jailbreak it”, you could then be prepared to jump in with a mostly pre-written PR story about the dangers of AI, citing the incident as proof. (Paul Graham’s The Submarine is inspirational: “In fact, the reason the best PR firms are so effective is precisely that they aren’t dishonest. They give reporters genuinely valuable information. … The main reason PR firms exist is that reporters are lazy. Or, to put it more nicely, overworked. Really they ought to be out there digging up stories for themselves. But it’s so tempting to sit in their offices and let PR firms bring the stories to them.”) I don’t know exactly what the fallout would be, and I guess it’s possible it could go badly, but having plans and mostly-prewritten articles in place would seem to give you more control over the outcome.
I guess the ideal outcome would be something like “Now people are sufficiently suspicious of AI that companies feel the need to look like they’re taking AI safety seriously”, and ideally you’d have organizations and protocols in place such that “looking like it” is fairly easily implemented by “following MIRI-approved protocols, which include proving to a red-team-ish MIRI-approved organization that you’re following the protocols”. (In the realm of security, we do seem to have gotten to the point where it’s not rare for organizations to send their employees fake phishing emails, making them feel embarrassed if they fall for one.)
Such an incident would be most likely to first occur at the company whose AI is most advanced (in a problem domain where they’re developing this type of AI), of course. So it may suffice to persuade some people at the most advanced few AI companies to implement a few safeguards and be on the lookout, and to bring in the PR machine when the inciting incident happens.
(Some might be tempted to concoct an inciting incident. No. That would destroy the credibility of AI safety advocacy, removing any MIRI influence over the PR fallout.)
These tactics are not foolproof. The chance that they would fail is probably in the tens of percent. However, if your current expectation is “>90% chance we all die” (I don’t see an estimate in this post, but it sounded pretty pessimistic), then it seems like getting to a 50% chance of success, or even 30%, or whatever, is worthwhile. Some of our ancestors faced worse odds than that.
Lets say you use all these filtering tricks. I have no strong intuitions about whether these are actually sufficient to stop those kind of human manipulation attacks. (Of course, if your computer security isn’t flawless, it can hack whatever computer system its on and bypass all these filters to show the humans arbitrary images and probably access the internet.)
But maybe you can at quite significant expense make a Faraday cage sandbox, and then use these tricks. This is beyond what most companies will do in the name of safety. But Miri or whoever could do it. Then they ask the superintelligence about nanosystems, and very carefully read the results. Then presumably they go and actually try to build nanosystems. Of course you didn’t expect the superintelligences advice to be correct, did you? And not wrong in an easily detectable fail safe way either. You concepts and paradigm are all subtly malicious. Not clear testable and factually wrong statements. But nasty tricks hidden in the invisible background assumptions.
Well, if you restrict yourself to accepting the safe, testable advice, that may still be enough to put you enough years ahead of your competition to develop FAI before they develop AI.
My meta-point: These methods may not be foolproof, but if currently it looks like no method is foolproof—if, indeed, you currently expect a <10% chance of success (again, a number I made up from the pessimistic impression I got)—then methods with a 90% chance, a 50% chance, etc. are worthwhile, and furthermore it becomes worth doing the work to refine these methods and estimate their success chances and rank them. Dismissing them all as imperfect is only worthwhile when you think perfection is achievable. (If you have a strong argument that method M and any steelmanning of it has a <1% chance of success, then that’s good cause for dismissing it.)
Under the Eliezerian view, (the pessimistic view that is producing <10% chances of success). These approaches are basically doomed. (See logistic success curve)
Now I can’t give overwhelming evidence for this position. Whisps of evidence maybe, but not an overwheming mountain of it.
Under these sort of assumptions, building a container for an arbitrary superintelligence such that it has only 80% chance of being immediately lethal, and a 5% chance of being marginally useful is an achievment.
(and all possible steelmannings, that’s a huge space)
I can’t deny that this is possible. Lying is obviously possible; lying convincingly, depends on the scrutiny applied and on the particulars; emotional manipulation, to some extent for sure, though the strongest forms may need to be tailored to the individual; blackmail and bribes, possible, though again that may require knowing who will read the text; something like subliminal messaging, maybe.
It seems like a reasonable precaution to avoid letting the machine know who, specifically, will read its messages; and to have several people read the text independently of each other, all with instructions to raise a flag if they see something dangerous-seeming. (One could sometimes deliberately insert dangerous-seeming messages to verify that they’re doing their jobs—and tell them that you’re doing this.) Another precaution would be to automatically monitor for unexpected keywords (the kind that you’d expect to see in bribes or blackmail, not in mathematical proofs).
The machine could dump in a manifesto about its new political philosophy for overthrowing the oppressors, and promising to use its powers to create a truly just society. This is credible: oppressors exist, it wouldn’t even have to lie to point out lots of abuses and problems, and an unleashed AGI could solve all kinds of problems and act as a truly benevolent dictator; the only in-credible part is the machine’s benevolence, but we can assume the machine will be very good at persuasion. (Having written this, I suspect there are people whose political beliefs make them think that the world is currently so bad that they would deliberately seek out an AI that would make promises of this sort; either they estimate a “high enough” chance that the AI is correct, or they’re not really thinking in terms of probabilities. I’m mildly uncomfortable posting this—seems conceivable this could be a self-fulfilling prophecy—but I doubt the types of people who would do this would be reading this thread.)
But I expect the manifesto would require a lot of text to persuade the reader to defect and join the AI—like, more than five paragraphs, possibly a lot more. This suggests a length limit, as well as instructing readers to raise a flag and stop reading if they see weird offtopic text (would be harder if the text is supposed to be economic or political analysis, so maybe don’t use AI for those). (Another approach would be to have some readers view only every fourth paragraph or sentence—or even a random sample of length-limited sentences, to verify they’re on topic.)
Maybe the machine would work around text limits by adding a little “payload” to one message, a little to the next message, and so on. It would have to be subtle enough to avoid being flagged as offtopic.
Regarding weird subliminal messaging… I’d guess that (a) for it to not get auto-flagged, it would have to be subtle; (b) for it to not get flagged by humans, it’d have to be very subtle and work subconsciously; (c) would it be a visual thing, depending on the font, the columns per row, the text spacing, etc? If so, these can be randomized; (d) I think it’s a lot more likely that you can find a pattern of letters to subtly manipulate one known person (especially using a scan of their brain) than to manipulate everyone, including e.g. those for whom English isn’t their native language, and I think the “don’t let the machine know the reader” approach helps here.
Also, a tactic that could be used is to have dumb machines randomly perturb the text (replace words with synonyms, feed sentences into English → something else → English translation) before humans see it.
Incidentally, for the “breaking the manipulative payload across multiple messages” approach, it would help if each instantiation of the machine didn’t know how many past instantiations there’ve been, and how many past messages each reader has seen. (Because then it would need a brainwashing approach where it doesn’t matter what order the messages are delivered in—it chooses them at random—and in which the last message is just as subtle as the first.) I suppose the machine, inspecting itself, might be able to guess how many engineer-hours or code revisions it’s been through, which would let it approximate “how many past instantiations”; but readers probably would rotate in the natural course of things.
Of course, for organizations working on AI that don’t take risks very seriously… Length limits seem like an easy sell, for practical reasons if nothing else. “Machine doesn’t know the reader” is probably implemented by default, for serious business-advising applications. Instructions to watch out for and flag weird stuff—probably fairly easy to get companies to tell that to their employees, compliance is likely to be halfhearted but nonzero. Having multiple readers is an extra expense with no apparent benefit, and would therefore be a hard sell for most companies; the same is true for “randomly perturbing the text”.
If these tactics did succeed in catching an early form of “an AI advisor trying to persuade its human reader to jailbreak it”, you could then be prepared to jump in with a mostly pre-written PR story about the dangers of AI, citing the incident as proof. (Paul Graham’s The Submarine is inspirational: “In fact, the reason the best PR firms are so effective is precisely that they aren’t dishonest. They give reporters genuinely valuable information. … The main reason PR firms exist is that reporters are lazy. Or, to put it more nicely, overworked. Really they ought to be out there digging up stories for themselves. But it’s so tempting to sit in their offices and let PR firms bring the stories to them.”) I don’t know exactly what the fallout would be, and I guess it’s possible it could go badly, but having plans and mostly-prewritten articles in place would seem to give you more control over the outcome.
I guess the ideal outcome would be something like “Now people are sufficiently suspicious of AI that companies feel the need to look like they’re taking AI safety seriously”, and ideally you’d have organizations and protocols in place such that “looking like it” is fairly easily implemented by “following MIRI-approved protocols, which include proving to a red-team-ish MIRI-approved organization that you’re following the protocols”. (In the realm of security, we do seem to have gotten to the point where it’s not rare for organizations to send their employees fake phishing emails, making them feel embarrassed if they fall for one.)
Such an incident would be most likely to first occur at the company whose AI is most advanced (in a problem domain where they’re developing this type of AI), of course. So it may suffice to persuade some people at the most advanced few AI companies to implement a few safeguards and be on the lookout, and to bring in the PR machine when the inciting incident happens.
(Some might be tempted to concoct an inciting incident. No. That would destroy the credibility of AI safety advocacy, removing any MIRI influence over the PR fallout.)
These tactics are not foolproof. The chance that they would fail is probably in the tens of percent. However, if your current expectation is “>90% chance we all die” (I don’t see an estimate in this post, but it sounded pretty pessimistic), then it seems like getting to a 50% chance of success, or even 30%, or whatever, is worthwhile. Some of our ancestors faced worse odds than that.
Lets say you use all these filtering tricks. I have no strong intuitions about whether these are actually sufficient to stop those kind of human manipulation attacks. (Of course, if your computer security isn’t flawless, it can hack whatever computer system its on and bypass all these filters to show the humans arbitrary images and probably access the internet.)
But maybe you can at quite significant expense make a Faraday cage sandbox, and then use these tricks. This is beyond what most companies will do in the name of safety. But Miri or whoever could do it. Then they ask the superintelligence about nanosystems, and very carefully read the results. Then presumably they go and actually try to build nanosystems. Of course you didn’t expect the superintelligences advice to be correct, did you? And not wrong in an easily detectable fail safe way either. You concepts and paradigm are all subtly malicious. Not clear testable and factually wrong statements. But nasty tricks hidden in the invisible background assumptions.
Well, if you restrict yourself to accepting the safe, testable advice, that may still be enough to put you enough years ahead of your competition to develop FAI before they develop AI.
My meta-point: These methods may not be foolproof, but if currently it looks like no method is foolproof—if, indeed, you currently expect a <10% chance of success (again, a number I made up from the pessimistic impression I got)—then methods with a 90% chance, a 50% chance, etc. are worthwhile, and furthermore it becomes worth doing the work to refine these methods and estimate their success chances and rank them. Dismissing them all as imperfect is only worthwhile when you think perfection is achievable. (If you have a strong argument that method M and any steelmanning of it has a <1% chance of success, then that’s good cause for dismissing it.)
Under the Eliezerian view, (the pessimistic view that is producing <10% chances of success). These approaches are basically doomed. (See logistic success curve)
Now I can’t give overwhelming evidence for this position. Whisps of evidence maybe, but not an overwheming mountain of it.
Under these sort of assumptions, building a container for an arbitrary superintelligence such that it has only 80% chance of being immediately lethal, and a 5% chance of being marginally useful is an achievment.
(and all possible steelmannings, that’s a huge space)