(Apologies for the late reply. I’ve been generally distracted by trying to take advantage of perhaps fleeting opportunities in the equities markets, and occasionally by my own mistakes while trying to do that.)
It seems like the AI described in this story is still aligned enough to defend against AI-powered persuasion (i.e. by the time that AI is sophisticated enough to cause that kind of trouble, most people are not ever coming into contact with adversarial content)
How are people going to avoid contact with adversarial content, aside from “go into an info bubble with trusted AIs and humans and block off any communications from the outside”? (If that is happening a lot, it seems worthwhile say so explicitly in the story since that might be surprising/unexpected to a lot of readers?)
I think they do, but it’s not clear whether any of them change the main dynamic described in the post.
Ok, in that case I think it would be useful to say a few words in the OP about why in this story, they don’t have the desired effect, like, what happened when the safety researchers tried this?
I’d like to have a human society that is free to grow up in a way that looks good to humans, and which retains enough control to do whatever they decide is right down the line (while remaining safe and gradually expanding the resources available to them for continued growth). When push comes to shove I expect most people to strongly prefer that kind of hope (vs one that builds a kind of AI that will reach the right conclusions about everything), not on the basis of sophisticated explicit reasoning but because that’s the only path that can really grow out of the current trajectory in a way that’s not super locally super objectionable to lots of people, and so I’m focusing on people’s attempts and failures to construct such an AI.
I can empathize with this motivation, but argue that “a kind of AI that will reach the right conclusions about everything” isn’t necessarily incompatible with “humans retain enough control to do whatever they decide is right down the line” since such an AI could allow humans to retain control (and merely act as an assistant/advisor, for example) instead of forcibly imposing its decisions on everyone.
I don’t know exactly what kind of failure you are imagining is locked in, that pre-empts or avoids the kind of failure described here.
For example, all or most humans lose their abilities for doing philosophical reasoning that will eventually converge to philosophical truths, because they go crazy from AI-powered memetic warfare, or come under undue influence of AI advisors who lack such abilities themselves but are extremely convincing. Or humans lock in what they currently think are their values/philosophies in some form (e.g., as utility functions in AI, or asking their AIs to help protect the humans themselves from value drift while unable to effectively differentiate between “drift” and “philosophical progress”) to try to protect them from a highly volatile and unpredictable world.
How are people going to avoid contact with adversarial content, aside from “go into an info bubble with trusted AIs and humans and block off any communications from the outside”? (If that is happening a lot, it seems worthwhile say so explicitly in the story since that might be surprising/unexpected to a lot of readers?)
I don’t have a short answer or think this kind of question has a short answer. I don’t know what an “info bubble” is and the world I’m imagining may fit your definition of that term (but the quoted description makes it sound like I might be disagreeing with some background assumptions) . Here are some of the things that I imagine happening:
I don’t pay attention to random messages from strangers. The volume of communication is much larger than our world and so e.g. I’m not going to be able to post an email address and then spend time looking at every message that is sent to that address (we are already roughly at this point today). The value of human attention is naturally much higher in this world.
Attackers (usually) can’t force information in front of my face—the situation is basically the same as with bullets. (I don’t know if you would say that modern humans are in a physical bubble. It is the case that in this future, and to a significant extent in the modern world, humans just don’t interact with totally unregulated physical systems. Every time I interact with a physical system I want to have some kind of legal liability or deposit for behavior of that system, and some kind of monitoring that gives me confidence that if something bad happens they will in fact be held liable. Today this usually falls to the state that governs the physical space I’m in, or sometimes private security.)
I don’t think there is likely to be any absolute notion of trust and I don’t currently think it would be necessary to make things OK. The big fork in the road to me is whether you trust humans because they aren’t smart enough to implement attacks (e.g. because they have no good AI advisors). That doesn’t sound like a good idea to me and it’s not what I’m imagining.
Most of the time when I look at a message, a bunch of automated systems have looked at it first and will inform me about the intended effect of the message in order to respond to appropriately or decide whether to read it. This is itself super complicated. I think the first order consideration is that if a manipulator can tell P(Paul does X | Paul reads message Y) is high for some non-endorsed reason, such that sending message Y would be a good manipulative strategy, then a similarly-smart defender can also tell that probability is high and this gives a reason to be concerned about the message or (adjust for it). The big caveat is that the defender occupies the position of the interior. The caveat to the caveat is that the defender has various structural advantages (some discussed in the next bullet points).
People who want to get me to read messages are normally operating via some combination of deposits and/or legal liability for some kinds of actions (e.g. that I would judge as harmful given reflection). So even if a defender realizes a problem only after the fact, or only with small probability for any given resource, an attacker will still burn resources with the attack.
There may be considerable variation in how careful people are (just as there is considerable variation in how secure computer systems are), though ideally we’ll try to make it as easy as possible for people to have good hygiene. People who are most easily manipulated in some sense “drop out” of the future-influencing game during this transitional period (and I expect will often come back in later via e.g. states that continue to care about their welfare).
My current view is that I think this problem may be too hard for civilization to cope with, but it doesn’t seem particularly hard in principle. It feels pretty analogous to the cybersecurity situation.
Ok, in that case I think it would be useful to say a few words in the OP about why in this story, they don’t have the desired effect, like, what happened when the safety researchers tried this?
The AIs in the story are trained using methods of this kind (or, more likely, better methods that people thought of at the time).
I can empathize with this motivation, but argue that “a kind of AI that will reach the right conclusions about everything” isn’t necessarily incompatible with “humans retain enough control to do whatever they decide is right down the line” since such an AI could allow humans to retain control (and merely act as an assistant/advisor, for example) instead of forcibly imposing its decisions on everyone.
I don’t think it’s incompatible, it’s just a path which we seem likely to take and which appears to run a lower risk of locking in such wrong solutions (and which feels quite different from your description even if it goes wrong). For example, in this scenario it still seems like a protected civilization may fail to evolve in the way that it “wants,” and that instrumentally such a civilization may need to e.g. make calls about what kind of change feels like undesirable drift. But in this scenario those issues seem pretty decoupled from alignment. (To the extent that the rest of the comment is an argument for a coupling, or for pessimism on this point, I didn’t find it compelling.)
Most of the time when I look at a message, a bunch of automated systems have looked at it first and will inform me about the intended effect of the message in order to respond to appropriately or decide whether to read it.
This seems like the most important part so I’ll just focus on this for now. I’m having trouble seeing how this can work. Suppose that I, as an attacker, tell my AI assistant, “interact with Paul in my name (possibly over a very long period of time) so as to maximize the chances that Paul eventually ends up believing in religion/ideology/moral theory X and then start spreading X to his friends” (while implicitly minimizing the chances of these messages/interactions being flagged by your automated systems as adversarial). How would your automation distinguish between me doing this, versus me trying to have a normal human conversation with you about various topics, including what’s moral/normative? Or if the automation isn’t trying to directly make this judgment, what is it telling you to allow you to make this judgment? Can you give a concrete example of a sentence that it might say to you, upon seeing some element of the series of messages/interactions?
Trying to imagine myself how an automated filter might work, here’s a possible “solution” I came up with. Perhaps your AI maintains a model / probability distribution of things that an uncompromised Wei might naturally say, and flags anything outside or on the fringes of that distribution as potential evidence that I’ve been compromised by an AI-powered attack and is now trying to attack you. (I’m talking in binary terms of “compromised” and “uncompromised” for simplicity but of course it will be more complicated than that in reality.)
Is this close to what you’re thinking? (If not, apologies for going off on a tangent.) If so, given that I would “naturally” change my mind over time (i.e., based on my own thinking or talking with other uncompromised humans), it seems that your AI has to model that as well. I can imagine that in such a scenario, if I ever changed my mind in an unexpected (by the AI model) direction and wanted to talk to you about that, my own AI might say something like “If you say this to Paul, his AI will become more suspicious that you’ve been compromised by an AI-powered attack and your risk of getting blocked now or in the future increases by Y. Are you sure you still want to say this to Paul?” So at this point, collective human philosophical/moral progress would be driven more by what AI filters expect and let pass, than by what physical human brains actually compute, so we better get those models really right, but that faces seemingly difficult problems I mentioned at Replicate the trajectory with ML? and it doesn’t seem like anyone is working on such problems.
If we fail to get such models good enough early on, that could lock in failure as it becomes impossible to meaningfully collaborate with other humans (or human-AI systems) to try to improve such models, as you can’t distinguish whether they’re genuinely trying to make better models with you, or just trying to change your models as part of an attack.
Trying to imagine myself how an automated filter might work, here’s a possible “solution” I came up with. Perhaps your AI maintains a model / probability distribution of things that an uncompromised Wei might naturally say, and flags anything outside or on the fringes of that distribution as potential evidence that I’ve been compromised by an AI-powered attack and is now trying to attack you. (I’m talking in binary terms of “compromised” and “uncompromised” for simplicity but of course it will be more complicated than that in reality.)
This seems like the most important part so I’ll just focus on this for now
I’m not sure if it’s the most important part. If you are including filtering (and not updates about whether people are good to talk to / legal liability / etc.) then I think it’s a minority of the story. But it still seems fine to talk about (and it’s not like the other steps are easier).
Suppose that I, as an attacker, tell my AI assistant, “interact with Paul in my name (possibly over a very long period of time) so as to maximize the chances that Paul eventually ends up believing in religion/ideology/moral theory X and then start spreading X to his friends”
Suppose your AI chooses some message M which is calculated to lead to Paul making (what Paul would or should regard as) an error. It sounds like your main question is how an AI could recognize M as problematic (i.e. such that Paul ought to expect to be worse off after reading M, such that it can either be filtered or caveated, or such that this information can be provided to reputation systems or arbiters, or so on).
My current view is that the sophistication required to recognize M as problematic is similar to the sophistication required to generate M as a manipulative action. This is clearest if the attacker just generates a lot of messages and then picks M that they think will most successfully manipulate the target—then an equally-sophisticated defender will have the same view about the likely impacts of M.
This is fuzzier if you can’t tell the difference between deliberation and manipulation. If I define idealized deliberation as an individual activity then I can talk about the extent to which M leads to deviation from idealized deliberation, but it’s probably more accurate to think of idealized deliberation as a collective activity. But as far as I can tell the basic story is still intact (and e.g. I have the intuition about “knowing how to manipulate the process is roughly the same as recognizing manipulation,” just fuzzier.)
Can you give a concrete example of a sentence that it might say to you, upon seeing some element of the series of messages/interactions?
It’s probably helpful to get more concrete about the kind of attack you are imagining (which is presumably easier than getting concrete about defenses—both depend on future technology but defenses also depend on what the attack is).
If your attack involves convincing me of a false claim, or making a statement from which I will predictably make a false inference, then the ideal remedy would be explaining the possible error; if your attack involves threatening me, then an ideal remedy would be to help me implement my preferred policy with respect to threats. And so on.
I suspect you have none of these examples in mind, but it will be easier to talk about if we zoom in.
This is fuzzier if you can’t tell the difference between deliberation and manipulation. If I define idealized deliberation as an individual activity then I can talk about the extent to which M leads to deviation from idealized deliberation, but it’s probably more accurate to think of idealized deliberation as a collective activity.
How will your AI compute “the extent to which M leads to deviation from idealized deliberation”? (I’m particularly confused because this seems pretty close to what I guessed earlier and seems to face similar problems, but you said that’s not the kind of approach you’re imagining.)
If your attack involves convincing me of a false claim, or making a statement from which I will predictably make a false inference, then the ideal remedy would be explaining the possible error; if your attack involves threatening me, then an ideal remedy would be to help me implement my preferred policy with respect to threats. And so on.
The attack I have in mind is to imitate a normal human conversation about philosophy or about what’s normative (what one should do), but AI-optimized with a goal of convincing you to adopt a particular conclusion. This may well involve convincing you of a false claim, but of a philosophical nature such that you and your AI can’t detect the error (unless you’ve solved the problem of metaphilosophy and knows what kinds of reasoning reliably leads to true and false conclusions about philosophical problems).
I think I misunderstood what kind of attack you were talking about. I thought you were imagining humans being subject to attack while going about their ordinary business (i.e. while trying to satisfy goals other than moral reflection), but it sounds like in the recent comments you are imagining cases where humans are trying to collaboratively answer hard questions (e.g. about what’s right), some of them may sabotage the process, and none of them are able to answer the question on their own and so can’t avoid relying on untrusted data from other humans.
I don’t feel like this is going to overlap too much with the story in the OP, since it takes place over a very small amount of calendar time—we’re not trying to do lots of moral deliberation during the story itself, we’re trying to defer moral deliberation until after the singularity (by decoupling it from rapid physical/technological progress), and so the action you are wondering about would have happened after the story ended happily. There are still kinds of attacks that are still important (namely those that prevent humans from surviving through to the singularity).
Similarly it seems like your description of “go in an info bubble” is not really appropriate for this kind of attack—wouldn’t it be more natural to say “tell your AI not to treat untrusted data as evidence about what is good, and try to rely on carefully chosen data for making novel moral progress.”
So in that light, I basically want to decouple your concern into two parts:
Will collaborative moral deliberation actually “freeze” during this scary phase, or will people e.g. keep arguing on the internet and instruct their AI that it shouldn’t protect them from potential manipulation driven by those interactions?
Will human communities be able to recover mutual trust after the singularity in this story?
I feel more concerned about #1. I’m not sure where you are at.
(I’m particularly confused because this seems pretty close to what I guessed earlier and seems to face similar problems, but you said that’s not the kind of approach you’re imagining.)
I was saying that I think it’s better to directly look at the effects of what is said rather than trying to model the speaker and estimate if they are malicious (or have been compromised). I left a short comment though as a placeholder until writing the grandparent. Also I agree that in the case you seem to have had in mind my proposal is going to look a lot like what you wrote (see below).
How will your AI compute “the extent to which M leads to deviation from idealized deliberation”?
Here’s a simple case to start with:
My AI cares about some judgment X that I’d reach after some idealized deliberative process.
We may not be able to implement that process, and at any rate I have other preferences, so instead the AI observes the output X’ of some realistic deliberative process embedded in society.
After observing my estimate X’ the AI acts on its best guess X″ about X.
An attacker wants to influence X*, so they send me a message M designed to distort X’ (which they hope will in turn estimate X″)
In this case I think it’s true and easy to derive that:
If my AI knows what the attacker knows, then updating on X’ and on the fact that the attacker sent me M, can’t push X″ in any direction that’s predictable to the attacker.
Moreover, if me reading M changes X’ in some predictable-to-the-attacker direction, then my AI knows that the reading M makes X’ less informative about X.
I’d guess you are on board with the simple case. Some complications in reality:
We don’t have any definition of the real idealized deliberative process
The nature of my AI’s deference/corrigibility is quite different then simply regarding my judgment as evidence about X
The behavior of other people also provides evidence about X, and attackers may have information about other people (that the defender lacks)
My best guess had been that you were worried about 1 or 2, but from the recent comments it sounds like you may be actually thinking about 3.
The attack I have in mind is to imitate a normal human conversation about philosophy or about what’s normative (what one should do), but AI-optimized with a goal of convincing you to adopt a particular conclusion
Filling in some details in a simple way: let’s suppose that the attacker just samples a few plausible things for a human to say, then outputs the one that leads me to make the highest estimate for X. We believe that using “natural” samples from the distribution would yield an endorsed outcome, but that if you consistently pick X-inducing samples then the errors will compound and lead to something bad.
Then in my proposal a defender would observe X-inducing samples, and could tell that they are X-inducing (since the attacker could tell that and I think we’re discussing the case where the defender has parity—if not then I think we need to return to some of the issues we set aside earlier). They would not initially know whether they are chance or manipulation. But after a few instances of this they will notice that errors tend to push in a surprisingly X-inducing direction and that the upcoming X-inducing samples are therefore particularly harmful.
This is basically what you proposed in this comment, though I feel the defender can judge based on the object-level reason that the X-inducing outputs are actually bad rather than explicitly flagging corruption.
In the context of my argument above, a way to view the concern is that a competitive defender can tell which of two samples is more X-inducing, but can’t tell whether an output is surprisingly X-inducing vs if deliberation is just rationally X-inducing, because (unlike the attacker) they aren’t able to observe the several natural samples from which the attack was sampled.
This kind of thing seems like it can only happen when the right conclusion depends on stuff that other humans know but you and your AI do not (or where for alignment reasons you want to defer to a process that involve the other humans).
(Apologies for the late reply. I’ve been generally distracted by trying to take advantage of perhaps fleeting opportunities in the equities markets, and occasionally by my own mistakes while trying to do that.)
How are people going to avoid contact with adversarial content, aside from “go into an info bubble with trusted AIs and humans and block off any communications from the outside”? (If that is happening a lot, it seems worthwhile say so explicitly in the story since that might be surprising/unexpected to a lot of readers?)
Ok, in that case I think it would be useful to say a few words in the OP about why in this story, they don’t have the desired effect, like, what happened when the safety researchers tried this?
I can empathize with this motivation, but argue that “a kind of AI that will reach the right conclusions about everything” isn’t necessarily incompatible with “humans retain enough control to do whatever they decide is right down the line” since such an AI could allow humans to retain control (and merely act as an assistant/advisor, for example) instead of forcibly imposing its decisions on everyone.
For example, all or most humans lose their abilities for doing philosophical reasoning that will eventually converge to philosophical truths, because they go crazy from AI-powered memetic warfare, or come under undue influence of AI advisors who lack such abilities themselves but are extremely convincing. Or humans lock in what they currently think are their values/philosophies in some form (e.g., as utility functions in AI, or asking their AIs to help protect the humans themselves from value drift while unable to effectively differentiate between “drift” and “philosophical progress”) to try to protect them from a highly volatile and unpredictable world.
I don’t have a short answer or think this kind of question has a short answer. I don’t know what an “info bubble” is and the world I’m imagining may fit your definition of that term (but the quoted description makes it sound like I might be disagreeing with some background assumptions) . Here are some of the things that I imagine happening:
I don’t pay attention to random messages from strangers. The volume of communication is much larger than our world and so e.g. I’m not going to be able to post an email address and then spend time looking at every message that is sent to that address (we are already roughly at this point today). The value of human attention is naturally much higher in this world.
Attackers (usually) can’t force information in front of my face—the situation is basically the same as with bullets.
(I don’t know if you would say that modern humans are in a physical bubble. It is the case that in this future, and to a significant extent in the modern world, humans just don’t interact with totally unregulated physical systems. Every time I interact with a physical system I want to have some kind of legal liability or deposit for behavior of that system, and some kind of monitoring that gives me confidence that if something bad happens they will in fact be held liable. Today this usually falls to the state that governs the physical space I’m in, or sometimes private security.)
I don’t think there is likely to be any absolute notion of trust and I don’t currently think it would be necessary to make things OK. The big fork in the road to me is whether you trust humans because they aren’t smart enough to implement attacks (e.g. because they have no good AI advisors). That doesn’t sound like a good idea to me and it’s not what I’m imagining.
Most of the time when I look at a message, a bunch of automated systems have looked at it first and will inform me about the intended effect of the message in order to respond to appropriately or decide whether to read it. This is itself super complicated. I think the first order consideration is that if a manipulator can tell P(Paul does X | Paul reads message Y) is high for some non-endorsed reason, such that sending message Y would be a good manipulative strategy, then a similarly-smart defender can also tell that probability is high and this gives a reason to be concerned about the message or (adjust for it). The big caveat is that the defender occupies the position of the interior. The caveat to the caveat is that the defender has various structural advantages (some discussed in the next bullet points).
People who want to get me to read messages are normally operating via some combination of deposits and/or legal liability for some kinds of actions (e.g. that I would judge as harmful given reflection). So even if a defender realizes a problem only after the fact, or only with small probability for any given resource, an attacker will still burn resources with the attack.
There may be considerable variation in how careful people are (just as there is considerable variation in how secure computer systems are), though ideally we’ll try to make it as easy as possible for people to have good hygiene. People who are most easily manipulated in some sense “drop out” of the future-influencing game during this transitional period (and I expect will often come back in later via e.g. states that continue to care about their welfare).
My current view is that I think this problem may be too hard for civilization to cope with, but it doesn’t seem particularly hard in principle. It feels pretty analogous to the cybersecurity situation.
The AIs in the story are trained using methods of this kind (or, more likely, better methods that people thought of at the time).
I don’t think it’s incompatible, it’s just a path which we seem likely to take and which appears to run a lower risk of locking in such wrong solutions (and which feels quite different from your description even if it goes wrong). For example, in this scenario it still seems like a protected civilization may fail to evolve in the way that it “wants,” and that instrumentally such a civilization may need to e.g. make calls about what kind of change feels like undesirable drift. But in this scenario those issues seem pretty decoupled from alignment. (To the extent that the rest of the comment is an argument for a coupling, or for pessimism on this point, I didn’t find it compelling.)
This seems like the most important part so I’ll just focus on this for now. I’m having trouble seeing how this can work. Suppose that I, as an attacker, tell my AI assistant, “interact with Paul in my name (possibly over a very long period of time) so as to maximize the chances that Paul eventually ends up believing in religion/ideology/moral theory X and then start spreading X to his friends” (while implicitly minimizing the chances of these messages/interactions being flagged by your automated systems as adversarial). How would your automation distinguish between me doing this, versus me trying to have a normal human conversation with you about various topics, including what’s moral/normative? Or if the automation isn’t trying to directly make this judgment, what is it telling you to allow you to make this judgment? Can you give a concrete example of a sentence that it might say to you, upon seeing some element of the series of messages/interactions?
Trying to imagine myself how an automated filter might work, here’s a possible “solution” I came up with. Perhaps your AI maintains a model / probability distribution of things that an uncompromised Wei might naturally say, and flags anything outside or on the fringes of that distribution as potential evidence that I’ve been compromised by an AI-powered attack and is now trying to attack you. (I’m talking in binary terms of “compromised” and “uncompromised” for simplicity but of course it will be more complicated than that in reality.)
Is this close to what you’re thinking? (If not, apologies for going off on a tangent.) If so, given that I would “naturally” change my mind over time (i.e., based on my own thinking or talking with other uncompromised humans), it seems that your AI has to model that as well. I can imagine that in such a scenario, if I ever changed my mind in an unexpected (by the AI model) direction and wanted to talk to you about that, my own AI might say something like “If you say this to Paul, his AI will become more suspicious that you’ve been compromised by an AI-powered attack and your risk of getting blocked now or in the future increases by Y. Are you sure you still want to say this to Paul?” So at this point, collective human philosophical/moral progress would be driven more by what AI filters expect and let pass, than by what physical human brains actually compute, so we better get those models really right, but that faces seemingly difficult problems I mentioned at Replicate the trajectory with ML? and it doesn’t seem like anyone is working on such problems.
If we fail to get such models good enough early on, that could lock in failure as it becomes impossible to meaningfully collaborate with other humans (or human-AI systems) to try to improve such models, as you can’t distinguish whether they’re genuinely trying to make better models with you, or just trying to change your models as part of an attack.
This isn’t the kind of approach I’m imagining.
I’m not sure if it’s the most important part. If you are including filtering (and not updates about whether people are good to talk to / legal liability / etc.) then I think it’s a minority of the story. But it still seems fine to talk about (and it’s not like the other steps are easier).
Suppose your AI chooses some message M which is calculated to lead to Paul making (what Paul would or should regard as) an error. It sounds like your main question is how an AI could recognize M as problematic (i.e. such that Paul ought to expect to be worse off after reading M, such that it can either be filtered or caveated, or such that this information can be provided to reputation systems or arbiters, or so on).
My current view is that the sophistication required to recognize M as problematic is similar to the sophistication required to generate M as a manipulative action. This is clearest if the attacker just generates a lot of messages and then picks M that they think will most successfully manipulate the target—then an equally-sophisticated defender will have the same view about the likely impacts of M.
This is fuzzier if you can’t tell the difference between deliberation and manipulation. If I define idealized deliberation as an individual activity then I can talk about the extent to which M leads to deviation from idealized deliberation, but it’s probably more accurate to think of idealized deliberation as a collective activity. But as far as I can tell the basic story is still intact (and e.g. I have the intuition about “knowing how to manipulate the process is roughly the same as recognizing manipulation,” just fuzzier.)
It’s probably helpful to get more concrete about the kind of attack you are imagining (which is presumably easier than getting concrete about defenses—both depend on future technology but defenses also depend on what the attack is).
If your attack involves convincing me of a false claim, or making a statement from which I will predictably make a false inference, then the ideal remedy would be explaining the possible error; if your attack involves threatening me, then an ideal remedy would be to help me implement my preferred policy with respect to threats. And so on.
I suspect you have none of these examples in mind, but it will be easier to talk about if we zoom in.
How will your AI compute “the extent to which M leads to deviation from idealized deliberation”? (I’m particularly confused because this seems pretty close to what I guessed earlier and seems to face similar problems, but you said that’s not the kind of approach you’re imagining.)
The attack I have in mind is to imitate a normal human conversation about philosophy or about what’s normative (what one should do), but AI-optimized with a goal of convincing you to adopt a particular conclusion. This may well involve convincing you of a false claim, but of a philosophical nature such that you and your AI can’t detect the error (unless you’ve solved the problem of metaphilosophy and knows what kinds of reasoning reliably leads to true and false conclusions about philosophical problems).
I think I misunderstood what kind of attack you were talking about. I thought you were imagining humans being subject to attack while going about their ordinary business (i.e. while trying to satisfy goals other than moral reflection), but it sounds like in the recent comments you are imagining cases where humans are trying to collaboratively answer hard questions (e.g. about what’s right), some of them may sabotage the process, and none of them are able to answer the question on their own and so can’t avoid relying on untrusted data from other humans.
I don’t feel like this is going to overlap too much with the story in the OP, since it takes place over a very small amount of calendar time—we’re not trying to do lots of moral deliberation during the story itself, we’re trying to defer moral deliberation until after the singularity (by decoupling it from rapid physical/technological progress), and so the action you are wondering about would have happened after the story ended happily. There are still kinds of attacks that are still important (namely those that prevent humans from surviving through to the singularity).
Similarly it seems like your description of “go in an info bubble” is not really appropriate for this kind of attack—wouldn’t it be more natural to say “tell your AI not to treat untrusted data as evidence about what is good, and try to rely on carefully chosen data for making novel moral progress.”
So in that light, I basically want to decouple your concern into two parts:
Will collaborative moral deliberation actually “freeze” during this scary phase, or will people e.g. keep arguing on the internet and instruct their AI that it shouldn’t protect them from potential manipulation driven by those interactions?
Will human communities be able to recover mutual trust after the singularity in this story?
I feel more concerned about #1. I’m not sure where you are at.
I was saying that I think it’s better to directly look at the effects of what is said rather than trying to model the speaker and estimate if they are malicious (or have been compromised). I left a short comment though as a placeholder until writing the grandparent. Also I agree that in the case you seem to have had in mind my proposal is going to look a lot like what you wrote (see below).
Here’s a simple case to start with:
My AI cares about some judgment X that I’d reach after some idealized deliberative process.
We may not be able to implement that process, and at any rate I have other preferences, so instead the AI observes the output X’ of some realistic deliberative process embedded in society.
After observing my estimate X’ the AI acts on its best guess X″ about X.
An attacker wants to influence X*, so they send me a message M designed to distort X’ (which they hope will in turn estimate X″)
In this case I think it’s true and easy to derive that:
If my AI knows what the attacker knows, then updating on X’ and on the fact that the attacker sent me M, can’t push X″ in any direction that’s predictable to the attacker.
Moreover, if me reading M changes X’ in some predictable-to-the-attacker direction, then my AI knows that the reading M makes X’ less informative about X.
I’d guess you are on board with the simple case. Some complications in reality:
We don’t have any definition of the real idealized deliberative process
The nature of my AI’s deference/corrigibility is quite different then simply regarding my judgment as evidence about X
The behavior of other people also provides evidence about X, and attackers may have information about other people (that the defender lacks)
My best guess had been that you were worried about 1 or 2, but from the recent comments it sounds like you may be actually thinking about 3.
Filling in some details in a simple way: let’s suppose that the attacker just samples a few plausible things for a human to say, then outputs the one that leads me to make the highest estimate for X. We believe that using “natural” samples from the distribution would yield an endorsed outcome, but that if you consistently pick X-inducing samples then the errors will compound and lead to something bad.
Then in my proposal a defender would observe X-inducing samples, and could tell that they are X-inducing (since the attacker could tell that and I think we’re discussing the case where the defender has parity—if not then I think we need to return to some of the issues we set aside earlier). They would not initially know whether they are chance or manipulation. But after a few instances of this they will notice that errors tend to push in a surprisingly X-inducing direction and that the upcoming X-inducing samples are therefore particularly harmful.
This is basically what you proposed in this comment, though I feel the defender can judge based on the object-level reason that the X-inducing outputs are actually bad rather than explicitly flagging corruption.
In the context of my argument above, a way to view the concern is that a competitive defender can tell which of two samples is more X-inducing, but can’t tell whether an output is surprisingly X-inducing vs if deliberation is just rationally X-inducing, because (unlike the attacker) they aren’t able to observe the several natural samples from which the attack was sampled.
This kind of thing seems like it can only happen when the right conclusion depends on stuff that other humans know but you and your AI do not (or where for alignment reasons you want to defer to a process that involve the other humans).