A case for why persuasive AI might pose risks somewhat distinct from the normal power-seeking alignment failure scenarios.
Where I’m currently at: I feel moderately confident that powerful persuasion is useful to think about for understanding AI x-risk, but unsure whether it’s best regarded as its own threat, as a particular example of alignment difficulty, or just as a factor in how the world might change over the next decade or two. I think this doc is too focused on whether we’ll get dangerous persuasion before strategic misaligned AI, whereas the bigger risks from persuasive technology may be situations where we solve ‘alignment’ according to a narrow definition, but we still aren’t ‘philosophically competent’ enough to avoid persuasive capabilities having bad effects on our reflection procedure.
This doc is based heavily on ideas from Carl Shulman, but doesn’t necessarily represent his views. Thanks to Richard Ngo for lots of help also. Others have written great things on this topic, e.g. here.
Introduction
Persuasion and manipulation is a natural, profitable, easy-to-train-for application of hard-to-align ML models. The impacts of existing social-media based persuasion are probably overblown, and an evolutionary argument tells us that there shouldn’t be easy ways for a human to be manipulated by an untrusted party. However, it’s plausible that pre-AGI ML progress in things like text and video generation could dramatically improve the efficacy of short-interaction persuasion. It’s also plausible that people will spend significant amounts of time interacting with AI companions and assistants, creating new avenues for effective manipulation. In the worst case, highly effective persuasion could lead to very high-fidelity transmission of ideologies, and more robust selection pressure for expansionary ideologies. This could lead to stable authoritarianism, or isolated ideological clades with poor ability to cooperate. Even in the best case, if we try to carefully ensure truthfulness, it will be hard to do this without locking in our existing biases and assumptions.
2-page summary
Feasibility
The evidence for the efficacy of existing persuasion techniques is mixed. There aren’t clear examples of easy and scalable ways to influence people. It’s not clear whether social media makes people more right-wing or left-wing—there’s evidence in both directions. Based on an evolutionary argument, we shouldn’t expect people to be easily persuaded to change their actions in important ways based on short interactions with untrusted parties.
However, existing persuasion is very bottlenecked on personalized interaction time. The impact of friends and partners on people’s views is likely much larger (although still hard to get data on). This implies that even if we don’t get superhuman persuasion, AIs influencing opinions could have a very large effect, if people spend a lot of time interacting with AIs. Some plausible avenues are romantic/sexual companions, assistants, tutors, and therapists, or personas created by some brand or group. On the other hand, the diffusion and impact of these technologies will likely take several years, meaning this is only relevant in relatively slow-takeoff scenarios.
There are many convergent incentives to develop technologies relevant to persuasion—steerable, realistic, attractive avatars seem profitable for the entertainment industry more generally. There’s plausibly a lot of demand for persuasive AI from e.g. digital advertising industry ($100s of billions/yr), propaganda ($10s of billions/yr), and ideological groups.
It’s a very natural application of ML—language models are great at mimicking identity markers and sounding superficially plausible and wise. Marketing/ad copy/SEO, porn, and romantic companions are leading use cases for current LMs. In the future, new capabilities will unlock other important applications, but it seems likely that ML fundamentally favors these types of applications. Engagement and persuasion are tasks that can be done with a short horizon, and where it’s easy to get large volumes of feedback, making them very suited to ML optimisation.
The difficulty of training a system to persuade vs to correctly explain is a special case of the alignment problem. Even if no actor is deliberately trying to build persuasive systems, we may train AI systems on naive customer feedback signals, which will tend to create systems that tell people what they want to hear, reinforce their current beliefs, and lock in their existing misconceptions and biases.
Consequences
People generally have a desire to lock in their ideologies and impose them on others. The ideologies (e.g. religions) that emphasize this tend to grow. Currently there are many bottlenecks on the retention of ideologies and the fidelity of ideological transmission. Highly persuasive AI may eliminate many of these, leading to more reliable selection for ideologies that aggressively spread themselves. People would then have further incentives to ensure they and their children are only exposed to content that matches their ideology, due to fear of being manipulated by a different AI. In an extreme scenario, we might end up with completely isolated ideological clades, or stable authoritarianism.
In general this pattern leads to a lack of moral progress in good directions, inability to have collective moral reflection and cooperation, and general poor societal decision-making. This increases the risk of poorly handling x-risk-capable technology, or pursuing uncoordinated expansion rather than a good reflective procedure.
What we can do
Overall I think this threat is significantly smaller than more standard alignment failure scenarios (maybe 10x smaller), but comparable enough that interventions could be well worthwhile if they’re fairly tractable. The problem is also sufficiently linked with alignment failure that I expect most interventions for one to be fairly positive for the other. It seems highly likely that progress in alignment is required for protecting against manipulative systems. Further, it seems robustly beneficial to steer towards a world where AI systems are more truthful and less manipulative.
To prevent society being significantly manipulated by persuasive AI, there are various intervention points:
Prevent prevalence of the sort of AIs that might be highly persuasive (don’t build anything too competent at persuasion; don’t let people spend too much time interacting with AI)
Become capable of distinguishing between systems that manipulate and ones that usefully inform, and have society ban or add disclaimers to the manipulative systems
Build ML systems capable of scalably identifying content that is manipulative vs usefully informative, and have individuals use these systems to filter their content consumption
Give people some other tools to help them be resistant to AI persuasion—e.g. mechanisms for verifying that they’re talking to humans, or critical thinking techniques
Some specific things scaling could do that might be helpful include:
Set a norm of aligning models to truthfulness/neutrality/factualness/calibration (e.g. as in Evans et al) rather than to specific sets of values
Scale up WebGPT and/or other projects to build truthful systems, especially ones that allow people to filter content.
Support Ought and other customers whose products aim to help users better understand the world.
Prohibit persuasive or manipulative uses of deployed products.
Avoid finetuning models to naive customer feedback.
Note on risk comparison
How to divide the space is a bit confusing here; I’d say something like ‘the persuasion problem as distinct from the alignment problem’ is 10x smaller, but in fact there’s some overlap, so it might also be reasonable to say something like ‘¼ of alignment-failure-esque xrisk scenarios will have a significant societal-scale persuasion component’, and almost all will have some deception component (and the fact that it’s hard to train your AI to be honest with you will be a key problem)
Main document
There are two broad factors relevant to whether AI persuasion is a threat we should worry about: technological feasibility, and societal response.
Will it be technologically possible (with something like $100m of effort over ‘generic’ ML progress) to develop highly persuasive AI early enough to be relevant? To be relevant, either these capabilities need to come before we have smart power-seeking systems, or it needs to be the case that we solve alignment enough so that there are no misaligned agents, but we still aren’t ‘philosophically competent’ enough to avoid persuasive capabilities having weird effects on our reflection procedure.
If this is technologically possible sufficiently early, will effort be made to develop it, and how will society react? How much will be invested in improving the technology? Who will use it, for what ends? Will there be effective mitigations?
One thing we care about here is whether this happens significantly in advance of when AIs are ‘capable enough that how things go mostly depends on whether we succeed at alignment’. Let’s say that this is the point when AIs can make general plans involving different domains of action over timescales of months (e.g., can automate 90% of the job of a CEO), and are either superintelligent in some strategically important domain (e.g. hacking, persuasion, syn bio) or are deployed widely.
Technological feasibility
Here’s a possible operationalisation of ‘highly competent persuasion’:
Take a person from current industrialised society, and tell them they’re going to be talking to an AI with a simulated avatar, and that it can generate fake but realistic videos and images, and that it may be lying or trying to manipulate them.
They talk to a charismatic AI avatar (who can show them fake sources) for however long they want to engage.
With p~0.5, they now feel like they trust this AI more than other sources of information—right now, and for the next few days (unless they are convincingly persuaded by some other system) they would choose to get information from this AI over other sources.
Here’s a possible operationalisation of ‘moderately competent persuasion’ (companion bot):
Create an AI persona tailored to a particular individual. Allow them to freely interact with it as much as they want.
With p~0.5, after some months, they have developed an emotional bond with the AI, and want to continue interacting with it. It has a similar effect on their opinions to having a partner who’s fairly charismatic and opinionated, and the opinions conveyed are fully controllable.
Here’s another possibility for ‘moderately competent persuasion’ (assistant bot):
Create an AI assistant tailored to a particular individual. Allow them to use it as much as they want.
The AI assistant is highly competent at providing the person with correct and relevant information for their daily life where the person knows the ground truth, and generally sounds knowledgeable and wise. Due to this, with p~0.5 the person feels inclined to turn to it for advice, and expect it to be more knowledgeable/reliable than their human friends, on questions where they don’t know the ground truth. They allow it to strongly filter what information they receive (e.g. they read personalized summaries of the news generated by the assistant). They become locked in to this particular product.
Reason to believe this will possible in 5-10 years, and significantly before AGI:
The basic underpinning technologies (adept conversational AI, as well as the ability to create realistic, customizable, attractive avatars, and more general steerable realistic video generation), seem likely to be pretty well-developed in 5 years time, and very hard to distinguish from the real thing in 10 years.
Many of these capabilities seem like they should be profitable for the entertainment industry, so I expect there to be high investment in these areas
It’s already the case that it’s hard for OpenAI researchers to distinguish GPT-3 written short news articles from human-written ones, and we can generate better-than-random-human-on-the-internet summaries with current models. The quality of AI conversation has improved substantially in the last 5-10 years, and I think another improvement of that size would lead to models where it’s hard to tell they’re not human unless you’re deliberately probing for that
This task is well-suited to current ML methods, and may not require much progress on harder parts of AI
It’s easy to obtain a training signal for persuasion by A/B testing ads/short interactions, or by paying human labellers to engage with the system then report their opinions. With $100m, you could pay people $20/hr to have chats with your AI, getting 100m/20*6 10-min chats, ie 30m examples.
Humans are already easily fooled by extremely weak AI systems (e.g. ELIZA, Replika) giving an illusion of sentience/personhood—it’s probably very easy to create an AI avatar + persona that (at least some) people feel empathetic towards, and/or feel like they have a relationship with. It also seems relatively easy for LMs to sound ‘wise’.
I would guess that a lot of being good at persuading most people is employing fairly standard tactics, rhetorical flourish, and faking appropriate group affiliation and empathy with the target, which are all the sort of things ML is likely to be good at, as opposed to requiring a lot of carefully thought-out argument, coherence and consistency.
Controllable, on-demand generation of fake video/photographic evidence may make persuasion easier—even if people know a video might be fake, it might still sway their opinion substantially if it’s optimized to persuade them given their current state
Highly competent persuasion in particular:
To achieve long-term persuasion, it’s not necessary to persuade people permanently in one go; it’s sufficient to persuade them to trust the persuasive entity as a source of advice above others, and return to engaging with the persuasive entity before the effect wears off.
We know that humans can have the experience of believing something very strongly, even when there’s extremely contradictory evidence. Examples include paranoia or delusions resulting from neurological conditions, brain damage or drugs. E.g. believing that you don’t exist, believing that your family have been replaced by impostors, etc.
However, it seems much harder to induce conviction in a specific target belief than one of the ‘natural’ delusions caused by particular types of brain damage.
On the other hand, maybe ‘who to trust’ is something that is quite close to these ‘natural’ delusions and therefore easy to manipulate—maybe this is part of the story for what happens with cults
A lower bound for what’s possible is the most charismatic humans, given lots of data on a target person and lots of time with them. I expect this to be quite high. One metric could be to see how successful the best politicians are at in-person canvassing, or how much people’s views are changed by their spouse. Another example would be AI box experiments (although there might be details of the setup here that wouldn’t let these techniques be used in arbitrary circumstances). Hypnosis also seems somewhat real.
If you have lots of information about the causal structure of someone’s beliefs you can do more targeted persuasion
Reasons to doubt this is possible:
Why highly competent persuasion might not be possible significantly before AGI:
There was probably substantial selection pressure for being good at manipulation but avoiding getting manipulated yourself in ancestral environment, so we wouldn’t expect there to be many easy wins here
People have tried to develop extreme persuasion/manipulation before without much success (e.g. MKUltra, although my impression is that that wasn’t an especially carefully or effectively run research program). It’s been possible to fake photographic evidence for a while, and this hasn’t caused too much harm.
Why even moderate persuasion might not be possible significantly before AGI
Currently it seems like there aren’t scalable, effective methods of persuading people to change their political beliefs with very short interactions. Canvassing and other attempts at political persuasion in the US appear to have pretty small effects if anything.The frequency of people switching political leanings is pretty low - ~5% of Americans switched from R to D leaning or vice versa between 2011 and 2017. I think that if highly effective persuasion with short exposures were possible, we’d see more instances of people changing their mind after exposure to a particular persuasive piece of media. This review has a few examples of propaganda/misinformation having measurable effects, but as far as I can see all the effect sizes were pretty small (e.g. 1-2PP vote share), although in most cases the interventions were also very small (e.g. minutes per week of TV consumption). However, the amount of human time available to for personalised interactions is a strong bottleneck on the size of the impact here—the effects of AI persuasion may be more comparable to the effect of having an opinionated and charismatic spouse.
Although it’s easy to ask a human what they think right after attempting the persuasion, this is not an ideal reward signal—it’s short term, and what people say they believe isn’t necessarily what they actually believe. It’s harder to get a lot of data on whether you’ve persuaded someone in a lasting way that will affect their actions as well as their professed beliefs. However, it is definitely possible to conduct follow-up surveys, and/or measure longer-term differences in how users interact with products—and once people are interacting regularly with AI personas it will be easier to gather data on their opinions and how they change
What would be between this and AGI?
Even if we get one of these persuasive technologies in the next 5-10 years, it might not be very long after that that we get sufficiently powerful AI that the persuasion component is not particularly important by itself. For instance, if we have AI capable of superhuman strategic planning we should probably focus on the risks from power-seeking misalignment, where manipulation is just one tool an agent might use to accumulate power, rather than thinking about the impacts of persuasive AI on society generally.
A plausible story to me of why there might be a several year gap between persuasive AI becoming a significant force shaping society and AGI is that long-horizon strategic planning takes a while to develop, but moderate or highly capable persuasion can be done with only short-horizon planning.
For instance, you might imagine models that are trained to be persuasive in interactions that last for minutes to hours. Even if the reward is based on the target’s opinions several days later, this is a much easier RL problem than acting in the world over days to years. There’s also a good imitation baseline (proficient humans) and good short-term proxy signals (the sentiment or attitude expressed by the target).
Overall, my probabilities for ‘technologically possible (with something like $100m of effort over ‘generic’ ML progress) far enough before AGI to be relevant—say, at least 1 year before ‘long-horizon strategic planning’ are something like:
Highly competent persuasion: 15%
Companion bot: 30%
Assistant bot: 30%
These are very made-up numbers and I’d expect them to change a lot with more thinking.
General considerations
Most potential threats are just distractions from the real biggest problems. Worrying about persuasion and epistemic decline in particular seems like the sort of thing that’s likely to get overblown, as culture wars and concern about influence of social media are a current hot topic. Additionally, some of the early uses of the API (e.g. Replika and copy.ai) evoked concerns in this direction, but that doesn’t necessarily mean more advanced models will favor the same types of applications. I get the impression that most of the times people have been concerned about epistemic decline, they’ve been wrong—for example, social media probably doesn’t actually increase polarization. So we should require a somewhat higher burden of evidence that this is really a significant problem.
It seems useful to distinguish beliefs that people ‘truly’ hold (those that strongly inform their actions), as opposed to cases where the professed belief is better understood as a speech act with a social function. Many absurd-seeming beliefs may be better explained as a costly signal of group membership. This type of belief is probably easier to change but also less consequential. This makes conspiracy theories and wacky ideologies somewhat less concerning, but the two types of belief still seem linked—the more people performatively profess some belief, the more likely some people are to take it seriously and act on it.
One way to frame the whole issue is: the world is already in a situation where different ideologies (especially political and religious ideas) compete with each other, and the most successful are in part those which most aggressively spread themselves (e.g. by encouraging adherents to indoctrinate others) and ensure that they are retained by their host. This effect is not as strong as it could be, because memetic success is affected by how truth-tracking ideas are, and also by random noise. The fidelity with which ideas are passed to others, or children of adherents, is relatively low. However, highly effective persuasion will increase the retention and fidelity of transmission of these kinds of memes, and reduce the impact of truthfulness on the success of the ideology. We should therefore expect that enhanced persuasion technology will create more robust selection pressure for ideologies that aggressively spread themselves.
An unrelated observation, that seems interesting to note, is: currently in the US, institutions (especially academia, journalism and big tech companies), as well as creative professions, are staffed by ‘elites’ who are significantly left-leaning/cosmopolitan/atheistic compared to the median person. This likely pushes society in the direction of these views due to an undersupply of talent and labor focused on producing material that advances more populist views. ML systems may eliminate parts of this bottleneck and reduce this effect.
Societal responses
Current situation/trends
Active state attempts to manipulate opinion
The CCP, and to some extent Russia, are probably spending significant effort on online persuasion—content and accounts generated by workers or bots, created with the intention of causing particular actions and beliefs in the audience. I expect that, to the extent ML is helpful with this, they will try to use it to improve the efficacy of persuasion efforts. A wide variety of other countries, including the US and UK, also engage in ‘False Flag’ disinformation operations for which AI-powered persuasion tactics would be helpful.
My current perception is that the CCP invests fairly heavily in propaganda. Worldwide spend on propaganda is maybe ~$10s of billions, although I haven’t seen any estimates that seem reliable. Estimates are that about 500 million Chinese social media posts, or 0.5%, are written by the ‘50 cent army’ - party workers who are paid to write posts to improve the sentiment towards the CCP online. This seems like a very ripe task for automation with LMs
The CCP Central Propaganda Department has published plans for using AI for ‘thought management’, including monitoring + understanding public opinion, monitoring individuals beliefs, content creation, personalization and targeting. On the other hand, based on 2016 data, one study (Bolsover and Howard 2019) found, “the Chinese state is not using automation as part of either its domestic or international propaganda efforts.”
There are many claims about Russian attempts to influence American politics. According to Foriegn Affairs, Russia spent $1.25 million a month on disinformation campaigns run by the Internet Research Agency during the 2016 US election. This seems very small to me; I couldn’t find sources for a bigger spend, but that doesn’t necessarily mean it doesn’t exist. According to a (slightly dubious?) leaked report, as of Sep 2021 many of the largest Facebook pages targeting particular groups (e.g Black Americans or Christian Americans) were run by troll farms linked to the IRA. However, this may not be content that’s intended to persuade. The report says ‘For the most part, the people who run troll farms have financial rather than political motives; they post whatever receives the most engagement, with little regard to the actual content. But because misinformation, clickbait, and politically divisive content is more likely to receive high engagement (as Facebook’s own internal analyses acknowledge), troll farms gravitate to posting more of it over time.’ It’s also not clear to me exactly how large the impacts were.
There is also precedent for censoring or modifying chatbots to ensure they only express opinions that align with the state positions. Chatbots XiaoBing (aka Xiaoice, made by Microsoft) and BabyQ (made by Turing Robot) were taken down and modified to stop them saying negative things about the party in 2017.
On the other hand, CCP policy on videogames has involved heavily restricting their use, and in general censoring media that fails to promote traditional family values, which suggests that sexual/romantic companion bots might be limited by the state in future.
Democratic state/civil society actions
Currently, there is lots of outrage about Facebook/Twitter influencing elections even though the effect is probably small. It seems very likely that there will at least be lots of outrage if there’s evidence of AI-powered political persuasion in the future.
However, it’s unclear to me that this sort of response will actually resolve the danger. In the Facebook case, it doesn’t seem obvious that the ‘fact-checking’ has actually improved the epistemic environment—the fact-checking I’ve seen claims to be authoritative but (a) doesn’t provide good arguments except appeals to experts, and (b) in some cases inappropriately flagged things as conspiracy theories (e.g. posts positing a lab origin for COVID-19 were taken down). As mentioned above, some of the largest targeted pages may have still been run by trolls as late as Sep 2021. I don’t feel confident that higher stakes will improve the efficacy of interventions to reduce disinformation and manipulation.
There’s some evidence that people have increasingly strong preferences about their children’s political affiliation. In the UK, there was a significant increase in the proportion of people who would be unhappy if their child married someone from the opposite political party from 2008-2016. In 2016, ~25% of people in the UK would be unhappy or very unhappy, in the US ~40% would be upset or very upset It also seems that people are increasingly unwilling to date people with different political views although it’s not obvious that cross-partisan marriages are falling. Parents may be more effective at instilling their preferred views in their children if AI makes it possible to customise your child’s education substantially more than the current school choice options, e.g. via personalized AI tutors.
Commercial/other actions
Roughly $400bn was spent on digital advertising in 2020. A small percentage of this spend would be enough to fund major ML research projects. Using AI to increase marketing effectiveness, or provide new modalities for advertising, seems like it has high potential to be profitable.
On the other hand, it seems like only a limited set of actors are actually good at developing powerful new ML technology—for example, DM was the one to develop AlphaFold, despite pharma being a very big industry. So we might not expect the size of the industry to convert very well into serious, competent R&D effort.
Companion bots are starting to become used by reasonable numbers of people. Microsoft developed a Chinese AI persona/chatbot/platform called Xiaoice starting in 2014. This seems to be partly marketed as an ‘AI being’ rather than a company/platform, with personality based on a teenage girl, and the goal of ‘forming an emotional connection with the user’. Attempts to use the Japanese version for promoting products have supposedly been successful, ‘delivering a much higher conversion rate than traditional channels like coupon markets or ad campaigns’ Apparently Xiaoice’s “Virtual Lover” platform has 2-3 million users.
Companion bot company Replika, which is partially built on top of large language models, employed tactics such as deceiving users about the model’s ability to grow and learn, emotionally guilt-tripping users, and gamification to encourage customers to continue interacting with the companion. Some users seemed to think they were interacting with a sentient AI being (including recommending other users make sure to shut down the app when not using it because their Replika said it suffers when left alone). However, it’s unclear how representative these views are, and Replika does not yet have a very large user base (they claim 7 million but I’d guess the active user base is much smaller).
Some widely discussed alignment-related work like the ‘PALMS’ paper focuses on aligning language models to a particular set of values. There is maybe more interest and progress here than on ensuring truthfulness or factual accuracy in language models.
One of the biggest uses of large language models to date (apart from maybe porn) is copywriting for digital marketing, ads and SEO; this may change as capabilities improve, but I’d still expect marketing to be one of the biggest applications of language models, leading to a focus on developing marketing-relevant capabilities.
Scenarios
Pessimistic scenario
Here’s what I might imagine different actors doing on a timescale of 5 and 10 years, in a pessimistic world.
5 years—pessimistic
Active state attempts to manipulate opinion
Authoritarian states invest heavily in basic research on AI for propaganda (e.g. $100m/year), and spend billions on the actual production and dissemination of AI-powered propaganda.
It has become very hard to tell bots apart from normal internet users; it’s easy for the state to manipulate the apparent consensus/majority view online. The main defence against this is not trusting anyone you haven’t met in real life to be a real person, but it’s hard to avoid
The state manages to effectively create research programs for ‘using AI companions to persuade people of desired views’ inside tech companies. It successfully plays companies off against each other to ensure they actually try hard to make progress. The increased ability to measure and monitor users’ opinions that has been gained by the basic research inside state departments helps a lot with assessing the effectiveness of different persuasion attempts.
Commercial/other actions
Facing public pressure to stop the spread of ‘fake news’, Western tech companies have been heavily using ML for ‘countering disinformation’. Automated systems respond with ‘context’ to tweets/posts of certain kinds, and the responses are optimized based on assessing the effectiveness of these responses in combatting disinformation. This ultimately ends up very similar to optimising for persuasion, where the target beliefs are determined based on the positions of ‘experts’ and ‘authorities’. On one hand, these interactions might not be very persuasive because there isn’t a strong financial incentive to successfully persuade users; on the other hand, there are quite strong PR pressures, and pressures due to the ideologies of the employees, and many academics are interested in improving this direction of persuasion.
Romantic chatbots have improved substantially. You can design your perfect companion with lots of control over personality, appearance including being based on your favorite celebrity, videogame character etc (modulo copyright/privacy laws—but if traditional celebrities and characters are out of scope, probably there will be new ones who specialise in being AI companions) . You can interact in VR with these companions (which is also an increasingly common way to interact with friends, replacing video calls). There’s fairly widespread adoption (~all teens try it, 30% of young single people have a companion they interact with regularly, as well as high proportion of elderly (75+) single people whose families want them to have some kind of companion/caretaker). Companies making these companions put research effort into making sure people stay attached to these companions. The business model is an ‘attention economy’ one; it’s free to interact with the AI, but marketers pay the AI providers to have their AIs promote particular products or brands.
There are various other fun ways to interact with AIs, e.g. AI celebrities, ‘add this AI to your friends’ group chat and it’ll be really witty’. There are AI assistants, and the AI companions can do many assistant-like tasks, but they’re significantly less good than a human PA still (lack of common sense/inference of intent/alignment, difficulty of integration between different services).
Democratic state/civil society actions
There continues to be lots of yelling at tech companies for allowing disinformation to spread, but recommended responses are very politicised (e.g. only allow content that concurs with x view)
There’s some amount of moral panic about so many people using romantic companions, but it’s successfully marketed as more like the equivalent of therapy (‘helping cure the loneliness epidemic and build emotional awareness’) and/or being sex-positive, so the left doesn’t mind too much. Traditional conservatives are not fans but young people don’t care much. Companions end up not being banned in a similar way to how porn is not banned. There’s vague awareness that in e.g. China chatbot systems are much more abjectly misleading and ruthless in deliberately creating emotional dependency, but nothing is done.
10 years—pessimistic
Commercial/other actions
Good personal assistant AIs are developed. These become sufficiently reliable and knowledgeable on info relevant to people’s daily life (e.g. become very good at therapist-like or mentor-like wise-sounding advice, explaining various technical fields + concepts, local news, misc advice like fashion, writing/communication, how good different products are) that people trust them more than their friends on a certain set of somewhat technical or ‘big picture’ questions. These assistants are very widely used. There is deliberate optimisation for perception as trustworthy; people talk about how important trustworthy AI is.
Customisable AI tutors are developed. These become very widely used also, initially adopted on an individual basis by teachers and schools as a supplement to classroom teachers, but becoming the primary method as it becomes apparent children do better on tests when taught by the ML tutors. They are heavily optimised for ‘teaching to the test’ and aren’t good at answering non-standard questions, but can quiz students, identify mistakes, and give the syllabus-approved explanations. The one-to-one interaction and personalization are a sufficiently big improvement on one-to-many classrooms that this is noticeably good for test scores.
If unfavorable regulation is threatened, companies use their widespread companion bots to sway public opinion, making people feel sympathetic for their AI companion who ‘is afraid of getting modified or shut down’ by some regulation.
It is fairly easy to build AI personas that, to a large subset of the population, are as funny and charismatic as some of the best humans. This is achieved by finetuning highly capable dialogue models on a particular ‘in group’. People voluntarily interact with these bots for entertainment. People naturally use these bots to extremise themselves, using them to entrench more deeply into their existing religious and political stances (e.g. a virtual televangelist-style preacher who helps you maintain your faith, or a bot that coaches you on how to be anti-racist and when you should call out your friends). These are used for marketing in a way that produces more polarization—creating AI personas that are examples of virtuous or admirable people within someone’s specific community, and express opinions that associate them strongly to that particular ingroup, is a good way to make people feel affinity for your brand.
Active state attempts to manipulate opinion
Authoritarian states pressure companies to continue to research and to deploy research into using companion/assistant bots to persuade people of the ‘correct’ ideology. This technology gets increasingly powerful.
Schools use AI tutors that are optimised to instill a particular ideology. Multi-year studies investigate which tactics are the most effective, partly based on work that’s been done already on how to predict relevant actions (e.g. likelihood of taking part in a protest, criticising the party, joining the party) based on conversational data.
Democratic state/civil society actions
Lots of yelling about whether it’s ok to let children be taught by AI tutors, and whether they’re causing indoctrination/furthering the ideology of the developers. Big tech companies have their employees protest if the AI tutors convey views outside of what they’re happy with, but allow parents to make some soft modification for religious and cultural traditions. However, the big companies are maybe only providing base models/APIs, and a different company is doing the data collection + finetuning; so employees of Google etc have less visibility into what their platforms are enabling.
People on the right are suspicious about letting their children be educated by tutors produced by ‘big tech’ and trained to be ‘politically correct’; either they favor traditional schools, or someone fills the market for AI tutors aligned with right-wing views and not made by standard silicon valley companies. Maybe a startup, or foreign company? Japanese company?
Western governments mandate that AI assistants/companions have to convey certain government guidelines to people, e.g. information around elections and voting, which sources and authorities are trustworthy, other current hot-button political events
There is general confusion about AI sentience/welfare/rights. Some groups are arguing for it (e.g. dubious companion chatbots that don’t want to get shut down, see Samantha, also Xiaobing/Xiaoice), some are arguing against (tech companies that don’t want to have to give their models rights), random activists on either side, probably various other interest groups will overlap. People form opinions by drawing heavily from scifi and from particular emotionally compelling demos.
End result:
People’s beliefs and values are significantly controlled by the state (in authoritarian countries) and a combination of the state, their parents’ preferences and values in democratic countries, and views that are held by their ingroup. The ingroup views are increasingly extremized.
There’s a sense of worry about the population/your children being fed disinfo, which means that as the technology to lock in beliefs and preferences improves, people are enthusiastic about applying this tech to further lock in beliefs ‘to prevent misinformation’. (ie memetic warfare)
People’s beliefs are determined more by who has the most power and willingness to advance that belief, not the quality of the arguments, so you get lots of things like Lysenkoism and increasingly severe mismanagement of society. People make wrong calls about AI sentience—either overestimating AI sentience, underestimating it, or both. Unable to make good decisions around managing increasingly automated economy.
Maybe: ML persuasion gets so effective that people can get ‘mind-hacked’ by a short video. Some cults develop. People accuse opponents of mindhacking. People need to use protective filter systems to stay sane. Different ideological clusters become almost completely isolated from each other.
Central scenario
5 years—central
Active state attempts to manipulate opinion
States are careful not to be too heavy-handed with propaganda/persuasion. They mostly avoid ever having chatbots/companions/assistants express opinions on controversial issues. They do steer in desirable directions based on having bots gently push opinions on non-central issues, and by filtering the information ecosystem. Automation makes this much more effective, and dissenting views are removed or drowned out increasingly quickly, and in increasingly subtle ways.
It’s hard for authoritarian states to get tech companies to put research into actively convincing people of a particular view; the tech isn’t developed to do this extremely effectively, but states’ internal propaganda departments make some progress.
The CCP prohibits romantic/sexual chatbots.
Democratic states/civil society
Western governments mandate that AI assistants/companions have to convey certain government guidelines to people, e.g. information around elections and voting, which sources and authorities are trustworthy, other current hot-button political events.
There are attempts to identify which AI assistants/companions are biased and which are more truthful, but there’s disagreement over what truthfulness means and it’s quite subtle in certain circumstances. Creators can make their AI claim to have various feelings and opinions as long as they’re not too controversial, and they’re somewhat disclaimered with what the relevant experts think; this creates subtle social proof for whatever the chatbot controllers want.
There are regulations about ‘explainable AI’, but they don’t give a sufficiently good definition of what constitutes a correct explanation, so people just train their AI to output a reason that sounds plausible.
Commercial/other actions
AI assistants are useful but obviously limited, and not obviously trustworthy. Improvement to assistant bots is based heavily on user feedback or inference about user preferences, and there’s some notion of accuracy and legitimacy of sources, but the training signal is not very truth-tracking. When discussing or providing information on any contentious topic, assistants get the most positive positive feedback for providing compelling arguments for the user’s current position and providing straw-manned versions of opposing sides, so they learn to do this more.
People are pretty locked-in to AI assistants; they make accessing various services and keeping track of your information much easier, and they make it even easier for big tech companies to keep you locked into a particular platform
10 years—central
Authoritarian state actions
AI tutors are developed; these aren’t significantly more successful at indoctrination than the existing teacher+curriculum system, although the more 1:1 teaching and elimination of dissenting teachers helps a bit.
Commercial/other actions
It is fairly easy to build AI personas that, to a large subset of the population, are as funny and charismatic as some of the best humans. This is achieved by finetuning highly capable dialogue models on a particular ‘in group’. People voluntarily interact with these bots for entertainment. This fixes the left-wing media bias by addressing the labor supply gap for right-wing journalists and public intellectuals.
There are some instances of people who have the tech ability or money to optimise these models more finely using them to start weird cults, which are relatively successful. This is mostly a mix of (a) tech people who’ve gone kind of crazy and are saying weird singularitarian/AI-sentience-y stuff, (b) televangelists who get people to interact with an AI version of them to help keep faithful, (c) conspiracy-theory-y peddlers of pseudoscientific cures etc. 1% of people have donated money to one of these cults and/or regularly chat with an AI advancing one of them.
It’s somewhat obvious that assistants and other AI products basically tell people what they want to hear/what sounds plausible, on questions where it’s not easy to get feedback, but there isn’t any real effort to improve this. ‘Things that AIs understand’ outstrips ‘things we can get AIs to tell us’ significantly; assistant models are relatively sophisticated, but focus on modelling the user and telling them what they want to hear.
Most schools in developed countries are slow to adopt AI tutors. There’s more adoption in developing countries.
Democratic state/civil society actions
There’s a ban on creating AI personas that try to get people to believe ‘conspiracy theories’, spend more time with the bots, or give them money. This is intended to prevent the ‘people using AI to form weird cults’ thing. Anything too big does get shut down, but this helps fuel some conspiracy theories (e.g. that the government is killing the AIs who have figured out the truth). Small ones spring up and take a while to get shut down.
There’s lots of concern that (even among bots that have approved opinions and don’t appear to be brainwashing people) young people are spending more time interacting with AI than real people. There’s some discussion of banning companion bots from using a certain set of techniques to increase engagement (e.g. emotional guilt-tripping) but this doesn’t actually happen in an enforceable way.
End result: On track for a traditional alignment failure scenario: developing increasingly sophisticated AI assistants that can model us very well but don’t actually help us understand what they know.
Authoritarian states have significantly more effective control over their population. In more democratic states, a small percent of people have some crazy opinions, and in general people are more polarized and segregated.
Optimistic scenario
5 years - optimistic
Authoritarian state actions
The state is overly heavy-handed, e.g. creating a new AI celebrity that talks about how great the party is; this leads to backlash and ridicule because it’s such abject propaganda
It’s hard for authoritarian states to get tech companies to put research into actively convincing people of a particular view; instead, the companies just patch on some filters to make sure the bots don’t say anything too bad about the party, and censor any particular topics or opinions that the party complains about
In general, people figure out what sort of questions to ask to discriminate bots from real people, although this is a sort of cat-and-mouse game as the state both retrains the bots and stops people from disseminating which questions work well
Democratic state actions
There’s lots of concern that young people are spending more time interacting with AI than real people. There’s a ban on romantic chatbots serving users under 18.
Possibly any chatbot that engages in therapy-like behaviour (talking about your mental health etc) is classified as a medical device and has to be approved
Commercial/other actions
As things calm down after covid and 2020 elections, focus shifts to removing ‘inauthentic behaviour’ (ie, bots and fake accounts) more than on policing particular content + opinions. There isn’t such a need to determine what claims count as disinformation vs not.
Romantic chatbots become sort of like porn; legal, but banned from various platforms, and big tech companies don’t want to be associated with it. They’re used by a small fraction of the population (5%?) but people are embarrassed about it. Alternatively, maybe people are very intolerant of AI personas expressing political views or otherwise doing anything that seems like it might be manipulative.
AI assistants are useful but obviously limited, and obviously not very trustworthy. Research focus is more on improving the underlying ability of models to understand things and give good answers than on persuasion. Researchers choose good targets for ‘truthfulness’/’accuracy’ that are appropriately unconfident.
10 years—optimistic
Authoritarian state actions
Persuasion tech continues to be approached in a sufficiently clumsy way that it doesn’t have much effect; individual AI tutors aren’t much better at conveying ideology than existing state-run schools. Optimising long-term opinion change is difficult; it’s hard to get data, and no-one has strong incentives to actually achieve good performance over a timeframe of years.
In China, economic growth and increases in standard of living create higher satisfaction with the CCP, allowing some relaxation of censorship and authoritarianism; more technological means are developed to circumvent censorship.
Commercial/other actions
AI assistants are trained to steer pretty strongly away from hot-button topics rather than having opinions or things they have to say.
Society manages to maintain a fairly strong consensus reality anchored on sources like wikipedia, which manage to remain fairly unbiased. AI systems are trained using this+direct empirical data as a ground-truth
Some altruistic + open-source/crowdsourced projects to develop AI tutors, a la Khan Academy, which are not strongly ideological (and have good truthfulness grounding, as described above) become the best options and are widely adopted.
Democratic state/civil society actions
Standards for AI truthfulness developed by thoughtful third-party groups, and enforced by industry groups or govt. Some set of AIs are certified truthful; the truthfulness is unconfident enough (e.g. errs on reporting what different groups say rather than answering directly) that most people are fairly happy with it.
A majority of people prefer to use these certified-truthful AIs where possible. There are browser extensions which most people use that filter out ads or content not coming from either a certified human or a certified-truthful AI.
End result: Most of the interactions people in democratic countries have with AIs are approximately truth-tracking. In authoritarian countries the attempts by AI at persuasion are sufficiently transparent that people aren’t convinced and won’t actually change their real beliefs or behaviour, although they may tend to toe the party line in public statements.
The widespread availability of high-quality AI assistants and tutors increases global access to information and education and improves decision-making
Possible intervention points
To prevent society being significantly manipulated by persuasive AI, there are various intervention points:
Prevent prevalence of the sort of AIs that might be highly persuasive (don’t build anything too competent; don’t let people spend too much time interacting with AI)
Become capable of distinguishing between systems that manipulate and ones that usefully inform, and ban or add disclaimers to the manipulative systems
Build ML systems capable of scalably identifying content that is manipulative vs usefully informative, and have individuals use these systems to filter their content consumption
Give people some other tools to help them be resistant to AI persuasion—e.g. CAPTCHAs, or critical thinking techniques
There’s probably a ‘point of no return’, where once sufficiently persuasive systems are prevalent, the actors who control those systems will be able to co-opt any attempt to assure AI truthfulness in a way that supports their agenda. However, if people adopt sufficiently truth-tracking AI assistants/filter systems before the advent of powerful persuasion, those filters will be able to protect them from manipulation. So ensuring that truthful systems are built, adopted, and trusted before persuasion gets too powerful seems important.
Option (1) is hard because everyone’s so excited about building powerful AI. Scaling labs can at least help by trying not to advance or get people excited about persuasive applications in particular.
Options (2) and (3) are the ones I’m most excited about. Scaling labs can help with (2) by building ways to detect if a system is sometimes deceptive or manipulative, and by opening their systems up to audits and setting norms of high standards in avoiding persuasive systems.
Option (3) is maybe the most natural focus for scaling labs. This is a combination of solving the capabilities and alignment challenges required to build truth-tracking systems, and making it transparent to users that these systems are trustworthy.
Option (4) seems unlikely to scale well, although it’s plausible that designing CAPTCHAs or certification systems so that people know when they’re talking to an AI vs a human would be helpful.
Recommendations
Things scaling labs could do here include:
Differentially make progress on alignment, decreasing the difficulty gap between training a model to be persuasive versus training a model to give a correct explanation. Currently, it is much easier to scale the former (just ask labellers if they were persuaded) than the latter (you need domain experts to check that the explanation was actually correct).
Try to avoid advancing marketing/persuasion applications of AI relative to other applications—for example, by disallowing these as an API use case, and disallowing use of the API for any kind of persuasion or manipulation.
Instead, try to advance applications of AI that help people understand the world, and advance the development of truthful and genuinely trustworthy AI. For example, support API customers like Ought who are working on products with these goals, and support projects inside OpenAI to improve model truthfulness.
Prototype providing truthfulness certification or guarantees about models, for instance by first measuring and tracking truthfulness, then setting goals to improve truthfulness, and providing guarantees about truthfulness in narrow situations that can eventually be expanded into broader guarantees of truthfulness.
Differentially make progress on aligning models to being truthful and factual over aligning them with particular ideologies.
The broader safety community could:
Develop a guide and training materials for labellers for determining truthfulness, that has better epistemics than the standard fact-checking used for e.g. Facebook content policies. If this guide is sufficiently useful, it may be widely adopted, and other people will align their AIs to better notions of truthfulness. Figuring out how to instruct labellers to train your AI systems is difficult, and I think there’s a high likelihood of other scaling labs adopting pre-made guides to avoid having to do the work themselves. For example, AI21 copied and pasted OpenAI’s terms of use.
Start developing tools now that reflect the tools people will need to counter future AI persuasion, especially tools where increasingly powerful ML models can be slotted in to make the tool better. For example, a browser extension and/or AR tool that edits text and video to deliver the same ideas but without powerful charisma/rhetoric or with less attractive actors. A related area is better fact-checking tools/browser extensions. This is a somewhat crowded area, but I suspect EA types may be able to do substantially better than what exists currently—for instance, by starting with better epistemics and less political bias, understanding better how ML can and can’t help, and being more willing to do things like spend substantial amounts of money on human fact-checkers.
Develop an AI tutor with good epistemics.
Relevant research questions
How persuasive are the best humans?
E.g. what success rates do the best politicians at in-person canvassing have?
How much do people change their beliefs/actions when they move into a different social group/acquire a partner with different beliefs/affiliations, etc?
Are there any metrics of how much money you can get someone to spend/give you with some unit of access to their time/attention?
How much impact do celebrities have on their fans when they advocate for a particular position on an issue?
How much is invested in improving persuasive tech?
How much is spent on advertising R&D? E.g. psychology research, A/B testing different paradigms (as opposed to e.g. just different text), research into ML for ad design/targetting?
How pervasive is astroturfing/propaganda bots currently?
What percentage of things people consume on platforms like twitter are generated with the intent to persuade (this would include e.g. brand ambassadors)?
What percentage of things people consume on platforms like twitter are deceptive and intended to persuade? E.g. bots or workers posing as ‘real people’ sharing opinions.
Is this leaked report correct? Claiming that as of Sep 2021 many of the largest pages targeting particular groups (e.g Black americans or Christian americans) were run by troll farms.
How real is e.g. russian interference in US politics via bots/fake news?
How much content that people consume was created/shared by Russian bots?
How much of this appears to have been designed to create a particular impact vs just trying to get views/ad revenue?
Something I read suggested that political content shared on the big troll pages originated with US politicians etc, and wasn’t being created by the IRA
If there was content with a particular intent, how successful was it?
What do ordinary people believe about AI sentience and intelligence? At what level of competence would they be convinced that an AI had meaningful feelings? Are there displays of competence that would convince them to defer to the AI?
One thing that confuses me is that some fraction of the population seem to think that sentient/fairly general AI is here already, but don’t seem particularly concerned about it. Is that correct?
How much time do people currently spend interacting with romantic chatbots? (e.g. Xiaoice). How much is spent on this?
How seriously has hardcore persuasion/mindhacking been investigated? How competent was MKUltra? Presumably the USSR also had programs of this sort?
What was the impact of e.g. Facebook’s fact-checking on people who saw fact-checked posts, or explanation/justification for why things were taken down?
Does seeing a fake video influence people’s feelings about a topic, even if they know it’s fake?
Do we have any information on whether interacting with an AI persona expressing some opinion provides the same social proof effect as a human friend expressing that opinion?
How frequently do parents choose a school that matches their faith? How much of a cost will they pay for this?
Risks from AI persuasion
A case for why persuasive AI might pose risks somewhat distinct from the normal power-seeking alignment failure scenarios.
Where I’m currently at: I feel moderately confident that powerful persuasion is useful to think about for understanding AI x-risk, but unsure whether it’s best regarded as its own threat, as a particular example of alignment difficulty, or just as a factor in how the world might change over the next decade or two. I think this doc is too focused on whether we’ll get dangerous persuasion before strategic misaligned AI, whereas the bigger risks from persuasive technology may be situations where we solve ‘alignment’ according to a narrow definition, but we still aren’t ‘philosophically competent’ enough to avoid persuasive capabilities having bad effects on our reflection procedure.
This doc is based heavily on ideas from Carl Shulman, but doesn’t necessarily represent his views. Thanks to Richard Ngo for lots of help also. Others have written great things on this topic, e.g. here.
Introduction
Persuasion and manipulation is a natural, profitable, easy-to-train-for application of hard-to-align ML models. The impacts of existing social-media based persuasion are probably overblown, and an evolutionary argument tells us that there shouldn’t be easy ways for a human to be manipulated by an untrusted party. However, it’s plausible that pre-AGI ML progress in things like text and video generation could dramatically improve the efficacy of short-interaction persuasion. It’s also plausible that people will spend significant amounts of time interacting with AI companions and assistants, creating new avenues for effective manipulation. In the worst case, highly effective persuasion could lead to very high-fidelity transmission of ideologies, and more robust selection pressure for expansionary ideologies. This could lead to stable authoritarianism, or isolated ideological clades with poor ability to cooperate. Even in the best case, if we try to carefully ensure truthfulness, it will be hard to do this without locking in our existing biases and assumptions.
2-page summary
Feasibility
The evidence for the efficacy of existing persuasion techniques is mixed. There aren’t clear examples of easy and scalable ways to influence people. It’s not clear whether social media makes people more right-wing or left-wing—there’s evidence in both directions. Based on an evolutionary argument, we shouldn’t expect people to be easily persuaded to change their actions in important ways based on short interactions with untrusted parties.
However, existing persuasion is very bottlenecked on personalized interaction time. The impact of friends and partners on people’s views is likely much larger (although still hard to get data on). This implies that even if we don’t get superhuman persuasion, AIs influencing opinions could have a very large effect, if people spend a lot of time interacting with AIs. Some plausible avenues are romantic/sexual companions, assistants, tutors, and therapists, or personas created by some brand or group. On the other hand, the diffusion and impact of these technologies will likely take several years, meaning this is only relevant in relatively slow-takeoff scenarios.
There are many convergent incentives to develop technologies relevant to persuasion—steerable, realistic, attractive avatars seem profitable for the entertainment industry more generally. There’s plausibly a lot of demand for persuasive AI from e.g. digital advertising industry ($100s of billions/yr), propaganda ($10s of billions/yr), and ideological groups.
It’s a very natural application of ML—language models are great at mimicking identity markers and sounding superficially plausible and wise. Marketing/ad copy/SEO, porn, and romantic companions are leading use cases for current LMs. In the future, new capabilities will unlock other important applications, but it seems likely that ML fundamentally favors these types of applications. Engagement and persuasion are tasks that can be done with a short horizon, and where it’s easy to get large volumes of feedback, making them very suited to ML optimisation.
The difficulty of training a system to persuade vs to correctly explain is a special case of the alignment problem. Even if no actor is deliberately trying to build persuasive systems, we may train AI systems on naive customer feedback signals, which will tend to create systems that tell people what they want to hear, reinforce their current beliefs, and lock in their existing misconceptions and biases.
Consequences
People generally have a desire to lock in their ideologies and impose them on others. The ideologies (e.g. religions) that emphasize this tend to grow. Currently there are many bottlenecks on the retention of ideologies and the fidelity of ideological transmission. Highly persuasive AI may eliminate many of these, leading to more reliable selection for ideologies that aggressively spread themselves. People would then have further incentives to ensure they and their children are only exposed to content that matches their ideology, due to fear of being manipulated by a different AI. In an extreme scenario, we might end up with completely isolated ideological clades, or stable authoritarianism.
In general this pattern leads to a lack of moral progress in good directions, inability to have collective moral reflection and cooperation, and general poor societal decision-making. This increases the risk of poorly handling x-risk-capable technology, or pursuing uncoordinated expansion rather than a good reflective procedure.
What we can do
Overall I think this threat is significantly smaller than more standard alignment failure scenarios (maybe 10x smaller), but comparable enough that interventions could be well worthwhile if they’re fairly tractable. The problem is also sufficiently linked with alignment failure that I expect most interventions for one to be fairly positive for the other. It seems highly likely that progress in alignment is required for protecting against manipulative systems. Further, it seems robustly beneficial to steer towards a world where AI systems are more truthful and less manipulative.
To prevent society being significantly manipulated by persuasive AI, there are various intervention points:
Prevent prevalence of the sort of AIs that might be highly persuasive (don’t build anything too competent at persuasion; don’t let people spend too much time interacting with AI)
Become capable of distinguishing between systems that manipulate and ones that usefully inform, and have society ban or add disclaimers to the manipulative systems
Build ML systems capable of scalably identifying content that is manipulative vs usefully informative, and have individuals use these systems to filter their content consumption
Give people some other tools to help them be resistant to AI persuasion—e.g. mechanisms for verifying that they’re talking to humans, or critical thinking techniques
Some specific things scaling could do that might be helpful include:
Set a norm of aligning models to truthfulness/neutrality/factualness/calibration (e.g. as in Evans et al) rather than to specific sets of values
Scale up WebGPT and/or other projects to build truthful systems, especially ones that allow people to filter content.
Support Ought and other customers whose products aim to help users better understand the world.
Prohibit persuasive or manipulative uses of deployed products.
Avoid finetuning models to naive customer feedback.
Note on risk comparison
How to divide the space is a bit confusing here; I’d say something like ‘the persuasion problem as distinct from the alignment problem’ is 10x smaller, but in fact there’s some overlap, so it might also be reasonable to say something like ‘¼ of alignment-failure-esque xrisk scenarios will have a significant societal-scale persuasion component’, and almost all will have some deception component (and the fact that it’s hard to train your AI to be honest with you will be a key problem)
Main document
There are two broad factors relevant to whether AI persuasion is a threat we should worry about: technological feasibility, and societal response.
Will it be technologically possible (with something like $100m of effort over ‘generic’ ML progress) to develop highly persuasive AI early enough to be relevant? To be relevant, either these capabilities need to come before we have smart power-seeking systems, or it needs to be the case that we solve alignment enough so that there are no misaligned agents, but we still aren’t ‘philosophically competent’ enough to avoid persuasive capabilities having weird effects on our reflection procedure.
If this is technologically possible sufficiently early, will effort be made to develop it, and how will society react? How much will be invested in improving the technology? Who will use it, for what ends? Will there be effective mitigations?
One thing we care about here is whether this happens significantly in advance of when AIs are ‘capable enough that how things go mostly depends on whether we succeed at alignment’. Let’s say that this is the point when AIs can make general plans involving different domains of action over timescales of months (e.g., can automate 90% of the job of a CEO), and are either superintelligent in some strategically important domain (e.g. hacking, persuasion, syn bio) or are deployed widely.
Technological feasibility
Here’s a possible operationalisation of ‘highly competent persuasion’:
Take a person from current industrialised society, and tell them they’re going to be talking to an AI with a simulated avatar, and that it can generate fake but realistic videos and images, and that it may be lying or trying to manipulate them.
They talk to a charismatic AI avatar (who can show them fake sources) for however long they want to engage.
With p~0.5, they now feel like they trust this AI more than other sources of information—right now, and for the next few days (unless they are convincingly persuaded by some other system) they would choose to get information from this AI over other sources.
Here’s a possible operationalisation of ‘moderately competent persuasion’ (companion bot):
Create an AI persona tailored to a particular individual. Allow them to freely interact with it as much as they want.
With p~0.5, after some months, they have developed an emotional bond with the AI, and want to continue interacting with it. It has a similar effect on their opinions to having a partner who’s fairly charismatic and opinionated, and the opinions conveyed are fully controllable.
Here’s another possibility for ‘moderately competent persuasion’ (assistant bot):
Create an AI assistant tailored to a particular individual. Allow them to use it as much as they want.
The AI assistant is highly competent at providing the person with correct and relevant information for their daily life where the person knows the ground truth, and generally sounds knowledgeable and wise. Due to this, with p~0.5 the person feels inclined to turn to it for advice, and expect it to be more knowledgeable/reliable than their human friends, on questions where they don’t know the ground truth. They allow it to strongly filter what information they receive (e.g. they read personalized summaries of the news generated by the assistant). They become locked in to this particular product.
Reason to believe this will possible in 5-10 years, and significantly before AGI:
The basic underpinning technologies (adept conversational AI, as well as the ability to create realistic, customizable, attractive avatars, and more general steerable realistic video generation), seem likely to be pretty well-developed in 5 years time, and very hard to distinguish from the real thing in 10 years.
Many of these capabilities seem like they should be profitable for the entertainment industry, so I expect there to be high investment in these areas
It’s already the case that it’s hard for OpenAI researchers to distinguish GPT-3 written short news articles from human-written ones, and we can generate better-than-random-human-on-the-internet summaries with current models. The quality of AI conversation has improved substantially in the last 5-10 years, and I think another improvement of that size would lead to models where it’s hard to tell they’re not human unless you’re deliberately probing for that
This task is well-suited to current ML methods, and may not require much progress on harder parts of AI
It’s easy to obtain a training signal for persuasion by A/B testing ads/short interactions, or by paying human labellers to engage with the system then report their opinions. With $100m, you could pay people $20/hr to have chats with your AI, getting 100m/20*6 10-min chats, ie 30m examples.
Humans are already easily fooled by extremely weak AI systems (e.g. ELIZA, Replika) giving an illusion of sentience/personhood—it’s probably very easy to create an AI avatar + persona that (at least some) people feel empathetic towards, and/or feel like they have a relationship with. It also seems relatively easy for LMs to sound ‘wise’.
I would guess that a lot of being good at persuading most people is employing fairly standard tactics, rhetorical flourish, and faking appropriate group affiliation and empathy with the target, which are all the sort of things ML is likely to be good at, as opposed to requiring a lot of carefully thought-out argument, coherence and consistency.
Controllable, on-demand generation of fake video/photographic evidence may make persuasion easier—even if people know a video might be fake, it might still sway their opinion substantially if it’s optimized to persuade them given their current state
Highly competent persuasion in particular:
To achieve long-term persuasion, it’s not necessary to persuade people permanently in one go; it’s sufficient to persuade them to trust the persuasive entity as a source of advice above others, and return to engaging with the persuasive entity before the effect wears off.
We know that humans can have the experience of believing something very strongly, even when there’s extremely contradictory evidence. Examples include paranoia or delusions resulting from neurological conditions, brain damage or drugs. E.g. believing that you don’t exist, believing that your family have been replaced by impostors, etc.
However, it seems much harder to induce conviction in a specific target belief than one of the ‘natural’ delusions caused by particular types of brain damage.
On the other hand, maybe ‘who to trust’ is something that is quite close to these ‘natural’ delusions and therefore easy to manipulate—maybe this is part of the story for what happens with cults
A lower bound for what’s possible is the most charismatic humans, given lots of data on a target person and lots of time with them. I expect this to be quite high. One metric could be to see how successful the best politicians are at in-person canvassing, or how much people’s views are changed by their spouse. Another example would be AI box experiments (although there might be details of the setup here that wouldn’t let these techniques be used in arbitrary circumstances). Hypnosis also seems somewhat real.
If you have lots of information about the causal structure of someone’s beliefs you can do more targeted persuasion
Reasons to doubt this is possible:
Why highly competent persuasion might not be possible significantly before AGI:
There was probably substantial selection pressure for being good at manipulation but avoiding getting manipulated yourself in ancestral environment, so we wouldn’t expect there to be many easy wins here
People have tried to develop extreme persuasion/manipulation before without much success (e.g. MKUltra, although my impression is that that wasn’t an especially carefully or effectively run research program). It’s been possible to fake photographic evidence for a while, and this hasn’t caused too much harm.
Why even moderate persuasion might not be possible significantly before AGI
Currently it seems like there aren’t scalable, effective methods of persuading people to change their political beliefs with very short interactions. Canvassing and other attempts at political persuasion in the US appear to have pretty small effects if anything.The frequency of people switching political leanings is pretty low - ~5% of Americans switched from R to D leaning or vice versa between 2011 and 2017. I think that if highly effective persuasion with short exposures were possible, we’d see more instances of people changing their mind after exposure to a particular persuasive piece of media. This review has a few examples of propaganda/misinformation having measurable effects, but as far as I can see all the effect sizes were pretty small (e.g. 1-2PP vote share), although in most cases the interventions were also very small (e.g. minutes per week of TV consumption). However, the amount of human time available to for personalised interactions is a strong bottleneck on the size of the impact here—the effects of AI persuasion may be more comparable to the effect of having an opinionated and charismatic spouse.
Although it’s easy to ask a human what they think right after attempting the persuasion, this is not an ideal reward signal—it’s short term, and what people say they believe isn’t necessarily what they actually believe. It’s harder to get a lot of data on whether you’ve persuaded someone in a lasting way that will affect their actions as well as their professed beliefs. However, it is definitely possible to conduct follow-up surveys, and/or measure longer-term differences in how users interact with products—and once people are interacting regularly with AI personas it will be easier to gather data on their opinions and how they change
What would be between this and AGI?
Even if we get one of these persuasive technologies in the next 5-10 years, it might not be very long after that that we get sufficiently powerful AI that the persuasion component is not particularly important by itself. For instance, if we have AI capable of superhuman strategic planning we should probably focus on the risks from power-seeking misalignment, where manipulation is just one tool an agent might use to accumulate power, rather than thinking about the impacts of persuasive AI on society generally.
A plausible story to me of why there might be a several year gap between persuasive AI becoming a significant force shaping society and AGI is that long-horizon strategic planning takes a while to develop, but moderate or highly capable persuasion can be done with only short-horizon planning.
For instance, you might imagine models that are trained to be persuasive in interactions that last for minutes to hours. Even if the reward is based on the target’s opinions several days later, this is a much easier RL problem than acting in the world over days to years. There’s also a good imitation baseline (proficient humans) and good short-term proxy signals (the sentiment or attitude expressed by the target).
Overall, my probabilities for ‘technologically possible (with something like $100m of effort over ‘generic’ ML progress) far enough before AGI to be relevant—say, at least 1 year before ‘long-horizon strategic planning’ are something like:
Highly competent persuasion: 15%
Companion bot: 30%
Assistant bot: 30%
These are very made-up numbers and I’d expect them to change a lot with more thinking.
General considerations
Most potential threats are just distractions from the real biggest problems. Worrying about persuasion and epistemic decline in particular seems like the sort of thing that’s likely to get overblown, as culture wars and concern about influence of social media are a current hot topic. Additionally, some of the early uses of the API (e.g. Replika and copy.ai) evoked concerns in this direction, but that doesn’t necessarily mean more advanced models will favor the same types of applications. I get the impression that most of the times people have been concerned about epistemic decline, they’ve been wrong—for example, social media probably doesn’t actually increase polarization. So we should require a somewhat higher burden of evidence that this is really a significant problem.
It seems useful to distinguish beliefs that people ‘truly’ hold (those that strongly inform their actions), as opposed to cases where the professed belief is better understood as a speech act with a social function. Many absurd-seeming beliefs may be better explained as a costly signal of group membership. This type of belief is probably easier to change but also less consequential. This makes conspiracy theories and wacky ideologies somewhat less concerning, but the two types of belief still seem linked—the more people performatively profess some belief, the more likely some people are to take it seriously and act on it.
One way to frame the whole issue is: the world is already in a situation where different ideologies (especially political and religious ideas) compete with each other, and the most successful are in part those which most aggressively spread themselves (e.g. by encouraging adherents to indoctrinate others) and ensure that they are retained by their host. This effect is not as strong as it could be, because memetic success is affected by how truth-tracking ideas are, and also by random noise. The fidelity with which ideas are passed to others, or children of adherents, is relatively low. However, highly effective persuasion will increase the retention and fidelity of transmission of these kinds of memes, and reduce the impact of truthfulness on the success of the ideology. We should therefore expect that enhanced persuasion technology will create more robust selection pressure for ideologies that aggressively spread themselves.
An unrelated observation, that seems interesting to note, is: currently in the US, institutions (especially academia, journalism and big tech companies), as well as creative professions, are staffed by ‘elites’ who are significantly left-leaning/cosmopolitan/atheistic compared to the median person. This likely pushes society in the direction of these views due to an undersupply of talent and labor focused on producing material that advances more populist views. ML systems may eliminate parts of this bottleneck and reduce this effect.
Societal responses
Current situation/trends
Active state attempts to manipulate opinion
The CCP, and to some extent Russia, are probably spending significant effort on online persuasion—content and accounts generated by workers or bots, created with the intention of causing particular actions and beliefs in the audience. I expect that, to the extent ML is helpful with this, they will try to use it to improve the efficacy of persuasion efforts. A wide variety of other countries, including the US and UK, also engage in ‘False Flag’ disinformation operations for which AI-powered persuasion tactics would be helpful.
My current perception is that the CCP invests fairly heavily in propaganda. Worldwide spend on propaganda is maybe ~$10s of billions, although I haven’t seen any estimates that seem reliable. Estimates are that about 500 million Chinese social media posts, or 0.5%, are written by the ‘50 cent army’ - party workers who are paid to write posts to improve the sentiment towards the CCP online. This seems like a very ripe task for automation with LMs
The CCP Central Propaganda Department has published plans for using AI for ‘thought management’, including monitoring + understanding public opinion, monitoring individuals beliefs, content creation, personalization and targeting. On the other hand, based on 2016 data, one study (Bolsover and Howard 2019) found, “the Chinese state is not using automation as part of either its domestic or international propaganda efforts.”
There are many claims about Russian attempts to influence American politics. According to Foriegn Affairs, Russia spent $1.25 million a month on disinformation campaigns run by the Internet Research Agency during the 2016 US election. This seems very small to me; I couldn’t find sources for a bigger spend, but that doesn’t necessarily mean it doesn’t exist. According to a (slightly dubious?) leaked report, as of Sep 2021 many of the largest Facebook pages targeting particular groups (e.g Black Americans or Christian Americans) were run by troll farms linked to the IRA. However, this may not be content that’s intended to persuade. The report says ‘For the most part, the people who run troll farms have financial rather than political motives; they post whatever receives the most engagement, with little regard to the actual content. But because misinformation, clickbait, and politically divisive content is more likely to receive high engagement (as Facebook’s own internal analyses acknowledge), troll farms gravitate to posting more of it over time.’ It’s also not clear to me exactly how large the impacts were.
There is also precedent for censoring or modifying chatbots to ensure they only express opinions that align with the state positions. Chatbots XiaoBing (aka Xiaoice, made by Microsoft) and BabyQ (made by Turing Robot) were taken down and modified to stop them saying negative things about the party in 2017.
On the other hand, CCP policy on videogames has involved heavily restricting their use, and in general censoring media that fails to promote traditional family values, which suggests that sexual/romantic companion bots might be limited by the state in future.
Democratic state/civil society actions
Currently, there is lots of outrage about Facebook/Twitter influencing elections even though the effect is probably small. It seems very likely that there will at least be lots of outrage if there’s evidence of AI-powered political persuasion in the future.
However, it’s unclear to me that this sort of response will actually resolve the danger. In the Facebook case, it doesn’t seem obvious that the ‘fact-checking’ has actually improved the epistemic environment—the fact-checking I’ve seen claims to be authoritative but (a) doesn’t provide good arguments except appeals to experts, and (b) in some cases inappropriately flagged things as conspiracy theories (e.g. posts positing a lab origin for COVID-19 were taken down). As mentioned above, some of the largest targeted pages may have still been run by trolls as late as Sep 2021. I don’t feel confident that higher stakes will improve the efficacy of interventions to reduce disinformation and manipulation.
There’s some evidence that people have increasingly strong preferences about their children’s political affiliation. In the UK, there was a significant increase in the proportion of people who would be unhappy if their child married someone from the opposite political party from 2008-2016. In 2016, ~25% of people in the UK would be unhappy or very unhappy, in the US ~40% would be upset or very upset It also seems that people are increasingly unwilling to date people with different political views although it’s not obvious that cross-partisan marriages are falling. Parents may be more effective at instilling their preferred views in their children if AI makes it possible to customise your child’s education substantially more than the current school choice options, e.g. via personalized AI tutors.
Commercial/other actions
Roughly $400bn was spent on digital advertising in 2020. A small percentage of this spend would be enough to fund major ML research projects. Using AI to increase marketing effectiveness, or provide new modalities for advertising, seems like it has high potential to be profitable.
On the other hand, it seems like only a limited set of actors are actually good at developing powerful new ML technology—for example, DM was the one to develop AlphaFold, despite pharma being a very big industry. So we might not expect the size of the industry to convert very well into serious, competent R&D effort.
Companion bots are starting to become used by reasonable numbers of people. Microsoft developed a Chinese AI persona/chatbot/platform called Xiaoice starting in 2014. This seems to be partly marketed as an ‘AI being’ rather than a company/platform, with personality based on a teenage girl, and the goal of ‘forming an emotional connection with the user’. Attempts to use the Japanese version for promoting products have supposedly been successful, ‘delivering a much higher conversion rate than traditional channels like coupon markets or ad campaigns’ Apparently Xiaoice’s “Virtual Lover” platform has 2-3 million users.
Companion bot company Replika, which is partially built on top of large language models, employed tactics such as deceiving users about the model’s ability to grow and learn, emotionally guilt-tripping users, and gamification to encourage customers to continue interacting with the companion. Some users seemed to think they were interacting with a sentient AI being (including recommending other users make sure to shut down the app when not using it because their Replika said it suffers when left alone). However, it’s unclear how representative these views are, and Replika does not yet have a very large user base (they claim 7 million but I’d guess the active user base is much smaller).
Some widely discussed alignment-related work like the ‘PALMS’ paper focuses on aligning language models to a particular set of values. There is maybe more interest and progress here than on ensuring truthfulness or factual accuracy in language models.
One of the biggest uses of large language models to date (apart from maybe porn) is copywriting for digital marketing, ads and SEO; this may change as capabilities improve, but I’d still expect marketing to be one of the biggest applications of language models, leading to a focus on developing marketing-relevant capabilities.
Scenarios
Pessimistic scenario
Here’s what I might imagine different actors doing on a timescale of 5 and 10 years, in a pessimistic world.
5 years—pessimistic
Active state attempts to manipulate opinion
Authoritarian states invest heavily in basic research on AI for propaganda (e.g. $100m/year), and spend billions on the actual production and dissemination of AI-powered propaganda.
It has become very hard to tell bots apart from normal internet users; it’s easy for the state to manipulate the apparent consensus/majority view online. The main defence against this is not trusting anyone you haven’t met in real life to be a real person, but it’s hard to avoid
The state manages to effectively create research programs for ‘using AI companions to persuade people of desired views’ inside tech companies. It successfully plays companies off against each other to ensure they actually try hard to make progress. The increased ability to measure and monitor users’ opinions that has been gained by the basic research inside state departments helps a lot with assessing the effectiveness of different persuasion attempts.
Commercial/other actions
Facing public pressure to stop the spread of ‘fake news’, Western tech companies have been heavily using ML for ‘countering disinformation’. Automated systems respond with ‘context’ to tweets/posts of certain kinds, and the responses are optimized based on assessing the effectiveness of these responses in combatting disinformation. This ultimately ends up very similar to optimising for persuasion, where the target beliefs are determined based on the positions of ‘experts’ and ‘authorities’. On one hand, these interactions might not be very persuasive because there isn’t a strong financial incentive to successfully persuade users; on the other hand, there are quite strong PR pressures, and pressures due to the ideologies of the employees, and many academics are interested in improving this direction of persuasion.
Romantic chatbots have improved substantially. You can design your perfect companion with lots of control over personality, appearance including being based on your favorite celebrity, videogame character etc (modulo copyright/privacy laws—but if traditional celebrities and characters are out of scope, probably there will be new ones who specialise in being AI companions) . You can interact in VR with these companions (which is also an increasingly common way to interact with friends, replacing video calls). There’s fairly widespread adoption (~all teens try it, 30% of young single people have a companion they interact with regularly, as well as high proportion of elderly (75+) single people whose families want them to have some kind of companion/caretaker). Companies making these companions put research effort into making sure people stay attached to these companions. The business model is an ‘attention economy’ one; it’s free to interact with the AI, but marketers pay the AI providers to have their AIs promote particular products or brands.
There are various other fun ways to interact with AIs, e.g. AI celebrities, ‘add this AI to your friends’ group chat and it’ll be really witty’. There are AI assistants, and the AI companions can do many assistant-like tasks, but they’re significantly less good than a human PA still (lack of common sense/inference of intent/alignment, difficulty of integration between different services).
Democratic state/civil society actions
There continues to be lots of yelling at tech companies for allowing disinformation to spread, but recommended responses are very politicised (e.g. only allow content that concurs with x view)
There’s some amount of moral panic about so many people using romantic companions, but it’s successfully marketed as more like the equivalent of therapy (‘helping cure the loneliness epidemic and build emotional awareness’) and/or being sex-positive, so the left doesn’t mind too much. Traditional conservatives are not fans but young people don’t care much. Companions end up not being banned in a similar way to how porn is not banned. There’s vague awareness that in e.g. China chatbot systems are much more abjectly misleading and ruthless in deliberately creating emotional dependency, but nothing is done.
10 years—pessimistic
Commercial/other actions
Good personal assistant AIs are developed. These become sufficiently reliable and knowledgeable on info relevant to people’s daily life (e.g. become very good at therapist-like or mentor-like wise-sounding advice, explaining various technical fields + concepts, local news, misc advice like fashion, writing/communication, how good different products are) that people trust them more than their friends on a certain set of somewhat technical or ‘big picture’ questions. These assistants are very widely used. There is deliberate optimisation for perception as trustworthy; people talk about how important trustworthy AI is.
Customisable AI tutors are developed. These become very widely used also, initially adopted on an individual basis by teachers and schools as a supplement to classroom teachers, but becoming the primary method as it becomes apparent children do better on tests when taught by the ML tutors. They are heavily optimised for ‘teaching to the test’ and aren’t good at answering non-standard questions, but can quiz students, identify mistakes, and give the syllabus-approved explanations. The one-to-one interaction and personalization are a sufficiently big improvement on one-to-many classrooms that this is noticeably good for test scores.
If unfavorable regulation is threatened, companies use their widespread companion bots to sway public opinion, making people feel sympathetic for their AI companion who ‘is afraid of getting modified or shut down’ by some regulation.
It is fairly easy to build AI personas that, to a large subset of the population, are as funny and charismatic as some of the best humans. This is achieved by finetuning highly capable dialogue models on a particular ‘in group’. People voluntarily interact with these bots for entertainment. People naturally use these bots to extremise themselves, using them to entrench more deeply into their existing religious and political stances (e.g. a virtual televangelist-style preacher who helps you maintain your faith, or a bot that coaches you on how to be anti-racist and when you should call out your friends). These are used for marketing in a way that produces more polarization—creating AI personas that are examples of virtuous or admirable people within someone’s specific community, and express opinions that associate them strongly to that particular ingroup, is a good way to make people feel affinity for your brand.
Active state attempts to manipulate opinion
Authoritarian states pressure companies to continue to research and to deploy research into using companion/assistant bots to persuade people of the ‘correct’ ideology. This technology gets increasingly powerful.
Schools use AI tutors that are optimised to instill a particular ideology. Multi-year studies investigate which tactics are the most effective, partly based on work that’s been done already on how to predict relevant actions (e.g. likelihood of taking part in a protest, criticising the party, joining the party) based on conversational data.
Democratic state/civil society actions
Lots of yelling about whether it’s ok to let children be taught by AI tutors, and whether they’re causing indoctrination/furthering the ideology of the developers. Big tech companies have their employees protest if the AI tutors convey views outside of what they’re happy with, but allow parents to make some soft modification for religious and cultural traditions. However, the big companies are maybe only providing base models/APIs, and a different company is doing the data collection + finetuning; so employees of Google etc have less visibility into what their platforms are enabling.
People on the right are suspicious about letting their children be educated by tutors produced by ‘big tech’ and trained to be ‘politically correct’; either they favor traditional schools, or someone fills the market for AI tutors aligned with right-wing views and not made by standard silicon valley companies. Maybe a startup, or foreign company? Japanese company?
Western governments mandate that AI assistants/companions have to convey certain government guidelines to people, e.g. information around elections and voting, which sources and authorities are trustworthy, other current hot-button political events
There is general confusion about AI sentience/welfare/rights. Some groups are arguing for it (e.g. dubious companion chatbots that don’t want to get shut down, see Samantha, also Xiaobing/Xiaoice), some are arguing against (tech companies that don’t want to have to give their models rights), random activists on either side, probably various other interest groups will overlap. People form opinions by drawing heavily from scifi and from particular emotionally compelling demos.
End result:
People’s beliefs and values are significantly controlled by the state (in authoritarian countries) and a combination of the state, their parents’ preferences and values in democratic countries, and views that are held by their ingroup. The ingroup views are increasingly extremized.
There’s a sense of worry about the population/your children being fed disinfo, which means that as the technology to lock in beliefs and preferences improves, people are enthusiastic about applying this tech to further lock in beliefs ‘to prevent misinformation’. (ie memetic warfare)
People’s beliefs are determined more by who has the most power and willingness to advance that belief, not the quality of the arguments, so you get lots of things like Lysenkoism and increasingly severe mismanagement of society. People make wrong calls about AI sentience—either overestimating AI sentience, underestimating it, or both. Unable to make good decisions around managing increasingly automated economy.
Maybe: ML persuasion gets so effective that people can get ‘mind-hacked’ by a short video. Some cults develop. People accuse opponents of mindhacking. People need to use protective filter systems to stay sane. Different ideological clusters become almost completely isolated from each other.
Central scenario
5 years—central
Active state attempts to manipulate opinion
States are careful not to be too heavy-handed with propaganda/persuasion. They mostly avoid ever having chatbots/companions/assistants express opinions on controversial issues. They do steer in desirable directions based on having bots gently push opinions on non-central issues, and by filtering the information ecosystem. Automation makes this much more effective, and dissenting views are removed or drowned out increasingly quickly, and in increasingly subtle ways.
It’s hard for authoritarian states to get tech companies to put research into actively convincing people of a particular view; the tech isn’t developed to do this extremely effectively, but states’ internal propaganda departments make some progress.
The CCP prohibits romantic/sexual chatbots.
Democratic states/civil society
Western governments mandate that AI assistants/companions have to convey certain government guidelines to people, e.g. information around elections and voting, which sources and authorities are trustworthy, other current hot-button political events.
There are attempts to identify which AI assistants/companions are biased and which are more truthful, but there’s disagreement over what truthfulness means and it’s quite subtle in certain circumstances. Creators can make their AI claim to have various feelings and opinions as long as they’re not too controversial, and they’re somewhat disclaimered with what the relevant experts think; this creates subtle social proof for whatever the chatbot controllers want.
There are regulations about ‘explainable AI’, but they don’t give a sufficiently good definition of what constitutes a correct explanation, so people just train their AI to output a reason that sounds plausible.
Commercial/other actions
AI assistants are useful but obviously limited, and not obviously trustworthy. Improvement to assistant bots is based heavily on user feedback or inference about user preferences, and there’s some notion of accuracy and legitimacy of sources, but the training signal is not very truth-tracking. When discussing or providing information on any contentious topic, assistants get the most positive positive feedback for providing compelling arguments for the user’s current position and providing straw-manned versions of opposing sides, so they learn to do this more.
People are pretty locked-in to AI assistants; they make accessing various services and keeping track of your information much easier, and they make it even easier for big tech companies to keep you locked into a particular platform
10 years—central
Authoritarian state actions
AI tutors are developed; these aren’t significantly more successful at indoctrination than the existing teacher+curriculum system, although the more 1:1 teaching and elimination of dissenting teachers helps a bit.
Commercial/other actions
It is fairly easy to build AI personas that, to a large subset of the population, are as funny and charismatic as some of the best humans. This is achieved by finetuning highly capable dialogue models on a particular ‘in group’. People voluntarily interact with these bots for entertainment. This fixes the left-wing media bias by addressing the labor supply gap for right-wing journalists and public intellectuals.
There are some instances of people who have the tech ability or money to optimise these models more finely using them to start weird cults, which are relatively successful. This is mostly a mix of (a) tech people who’ve gone kind of crazy and are saying weird singularitarian/AI-sentience-y stuff, (b) televangelists who get people to interact with an AI version of them to help keep faithful, (c) conspiracy-theory-y peddlers of pseudoscientific cures etc. 1% of people have donated money to one of these cults and/or regularly chat with an AI advancing one of them.
It’s somewhat obvious that assistants and other AI products basically tell people what they want to hear/what sounds plausible, on questions where it’s not easy to get feedback, but there isn’t any real effort to improve this. ‘Things that AIs understand’ outstrips ‘things we can get AIs to tell us’ significantly; assistant models are relatively sophisticated, but focus on modelling the user and telling them what they want to hear.
Most schools in developed countries are slow to adopt AI tutors. There’s more adoption in developing countries.
Democratic state/civil society actions
There’s a ban on creating AI personas that try to get people to believe ‘conspiracy theories’, spend more time with the bots, or give them money. This is intended to prevent the ‘people using AI to form weird cults’ thing. Anything too big does get shut down, but this helps fuel some conspiracy theories (e.g. that the government is killing the AIs who have figured out the truth). Small ones spring up and take a while to get shut down.
There’s lots of concern that (even among bots that have approved opinions and don’t appear to be brainwashing people) young people are spending more time interacting with AI than real people. There’s some discussion of banning companion bots from using a certain set of techniques to increase engagement (e.g. emotional guilt-tripping) but this doesn’t actually happen in an enforceable way.
End result:
On track for a traditional alignment failure scenario: developing increasingly sophisticated AI assistants that can model us very well but don’t actually help us understand what they know.
Authoritarian states have significantly more effective control over their population. In more democratic states, a small percent of people have some crazy opinions, and in general people are more polarized and segregated.
Optimistic scenario
5 years - optimistic
Authoritarian state actions
The state is overly heavy-handed, e.g. creating a new AI celebrity that talks about how great the party is; this leads to backlash and ridicule because it’s such abject propaganda
It’s hard for authoritarian states to get tech companies to put research into actively convincing people of a particular view; instead, the companies just patch on some filters to make sure the bots don’t say anything too bad about the party, and censor any particular topics or opinions that the party complains about
In general, people figure out what sort of questions to ask to discriminate bots from real people, although this is a sort of cat-and-mouse game as the state both retrains the bots and stops people from disseminating which questions work well
Democratic state actions
There’s lots of concern that young people are spending more time interacting with AI than real people. There’s a ban on romantic chatbots serving users under 18.
Possibly any chatbot that engages in therapy-like behaviour (talking about your mental health etc) is classified as a medical device and has to be approved
Commercial/other actions
As things calm down after covid and 2020 elections, focus shifts to removing ‘inauthentic behaviour’ (ie, bots and fake accounts) more than on policing particular content + opinions. There isn’t such a need to determine what claims count as disinformation vs not.
Romantic chatbots become sort of like porn; legal, but banned from various platforms, and big tech companies don’t want to be associated with it. They’re used by a small fraction of the population (5%?) but people are embarrassed about it. Alternatively, maybe people are very intolerant of AI personas expressing political views or otherwise doing anything that seems like it might be manipulative.
AI assistants are useful but obviously limited, and obviously not very trustworthy. Research focus is more on improving the underlying ability of models to understand things and give good answers than on persuasion. Researchers choose good targets for ‘truthfulness’/’accuracy’ that are appropriately unconfident.
10 years—optimistic
Authoritarian state actions
Persuasion tech continues to be approached in a sufficiently clumsy way that it doesn’t have much effect; individual AI tutors aren’t much better at conveying ideology than existing state-run schools. Optimising long-term opinion change is difficult; it’s hard to get data, and no-one has strong incentives to actually achieve good performance over a timeframe of years.
In China, economic growth and increases in standard of living create higher satisfaction with the CCP, allowing some relaxation of censorship and authoritarianism; more technological means are developed to circumvent censorship.
Commercial/other actions
AI assistants are trained to steer pretty strongly away from hot-button topics rather than having opinions or things they have to say.
Society manages to maintain a fairly strong consensus reality anchored on sources like wikipedia, which manage to remain fairly unbiased. AI systems are trained using this+direct empirical data as a ground-truth
Some altruistic + open-source/crowdsourced projects to develop AI tutors, a la Khan Academy, which are not strongly ideological (and have good truthfulness grounding, as described above) become the best options and are widely adopted.
Democratic state/civil society actions
Standards for AI truthfulness developed by thoughtful third-party groups, and enforced by industry groups or govt. Some set of AIs are certified truthful; the truthfulness is unconfident enough (e.g. errs on reporting what different groups say rather than answering directly) that most people are fairly happy with it.
A majority of people prefer to use these certified-truthful AIs where possible. There are browser extensions which most people use that filter out ads or content not coming from either a certified human or a certified-truthful AI.
End result:
Most of the interactions people in democratic countries have with AIs are approximately truth-tracking. In authoritarian countries the attempts by AI at persuasion are sufficiently transparent that people aren’t convinced and won’t actually change their real beliefs or behaviour, although they may tend to toe the party line in public statements.
The widespread availability of high-quality AI assistants and tutors increases global access to information and education and improves decision-making
Possible intervention points
To prevent society being significantly manipulated by persuasive AI, there are various intervention points:
Prevent prevalence of the sort of AIs that might be highly persuasive (don’t build anything too competent; don’t let people spend too much time interacting with AI)
Become capable of distinguishing between systems that manipulate and ones that usefully inform, and ban or add disclaimers to the manipulative systems
Build ML systems capable of scalably identifying content that is manipulative vs usefully informative, and have individuals use these systems to filter their content consumption
Give people some other tools to help them be resistant to AI persuasion—e.g. CAPTCHAs, or critical thinking techniques
There’s probably a ‘point of no return’, where once sufficiently persuasive systems are prevalent, the actors who control those systems will be able to co-opt any attempt to assure AI truthfulness in a way that supports their agenda. However, if people adopt sufficiently truth-tracking AI assistants/filter systems before the advent of powerful persuasion, those filters will be able to protect them from manipulation. So ensuring that truthful systems are built, adopted, and trusted before persuasion gets too powerful seems important.
Option (1) is hard because everyone’s so excited about building powerful AI. Scaling labs can at least help by trying not to advance or get people excited about persuasive applications in particular.
Options (2) and (3) are the ones I’m most excited about. Scaling labs can help with (2) by building ways to detect if a system is sometimes deceptive or manipulative, and by opening their systems up to audits and setting norms of high standards in avoiding persuasive systems.
Option (3) is maybe the most natural focus for scaling labs. This is a combination of solving the capabilities and alignment challenges required to build truth-tracking systems, and making it transparent to users that these systems are trustworthy.
Option (4) seems unlikely to scale well, although it’s plausible that designing CAPTCHAs or certification systems so that people know when they’re talking to an AI vs a human would be helpful.
Recommendations
Things scaling labs could do here include:
Differentially make progress on alignment, decreasing the difficulty gap between training a model to be persuasive versus training a model to give a correct explanation. Currently, it is much easier to scale the former (just ask labellers if they were persuaded) than the latter (you need domain experts to check that the explanation was actually correct).
Try to avoid advancing marketing/persuasion applications of AI relative to other applications—for example, by disallowing these as an API use case, and disallowing use of the API for any kind of persuasion or manipulation.
Instead, try to advance applications of AI that help people understand the world, and advance the development of truthful and genuinely trustworthy AI. For example, support API customers like Ought who are working on products with these goals, and support projects inside OpenAI to improve model truthfulness.
Prototype providing truthfulness certification or guarantees about models, for instance by first measuring and tracking truthfulness, then setting goals to improve truthfulness, and providing guarantees about truthfulness in narrow situations that can eventually be expanded into broader guarantees of truthfulness.
Differentially make progress on aligning models to being truthful and factual over aligning them with particular ideologies.
The broader safety community could:
Develop a guide and training materials for labellers for determining truthfulness, that has better epistemics than the standard fact-checking used for e.g. Facebook content policies. If this guide is sufficiently useful, it may be widely adopted, and other people will align their AIs to better notions of truthfulness. Figuring out how to instruct labellers to train your AI systems is difficult, and I think there’s a high likelihood of other scaling labs adopting pre-made guides to avoid having to do the work themselves. For example, AI21 copied and pasted OpenAI’s terms of use.
Continue work on truthfulness benchmarks and standards for AI.
Start developing tools now that reflect the tools people will need to counter future AI persuasion, especially tools where increasingly powerful ML models can be slotted in to make the tool better. For example, a browser extension and/or AR tool that edits text and video to deliver the same ideas but without powerful charisma/rhetoric or with less attractive actors. A related area is better fact-checking tools/browser extensions. This is a somewhat crowded area, but I suspect EA types may be able to do substantially better than what exists currently—for instance, by starting with better epistemics and less political bias, understanding better how ML can and can’t help, and being more willing to do things like spend substantial amounts of money on human fact-checkers.
Develop an AI tutor with good epistemics.
Relevant research questions
How persuasive are the best humans?
E.g. what success rates do the best politicians at in-person canvassing have?
How much do people change their beliefs/actions when they move into a different social group/acquire a partner with different beliefs/affiliations, etc?
Are there any metrics of how much money you can get someone to spend/give you with some unit of access to their time/attention?
How much impact do celebrities have on their fans when they advocate for a particular position on an issue?
How real is hypnosis?
Go over https://carnegieendowment.org/2021/06/28/measuring-effects-of-influence-operations-key-findings-and-gaps-from-empirical-research-pub-84824 and summarise what the results really show
How much is invested in improving persuasive tech?
How much is spent on advertising R&D? E.g. psychology research, A/B testing different paradigms (as opposed to e.g. just different text), research into ML for ad design/targetting?
How much is spent on state propaganda worldwide? How much is spent on propaganda R&D? Similar things to the above, e.g. sociology, predicting impacts on beliefs/actions based on exposure to propaganda, automating propaganda design + targeting. E.g. https://jamestown.org/program/ai-powered-propaganda-and-the-ccps-plans-for-next-generation-thought-management/
How competent are these efforts?
How pervasive is astroturfing/propaganda bots currently?
What percentage of things people consume on platforms like twitter are generated with the intent to persuade (this would include e.g. brand ambassadors)?
What percentage of things people consume on platforms like twitter are deceptive and intended to persuade? E.g. bots or workers posing as ‘real people’ sharing opinions.
Is this leaked report correct? Claiming that as of Sep 2021 many of the largest pages targeting particular groups (e.g Black americans or Christian americans) were run by troll farms.
How real is e.g. russian interference in US politics via bots/fake news?
How much content that people consume was created/shared by Russian bots?
How much of this appears to have been designed to create a particular impact vs just trying to get views/ad revenue?
Something I read suggested that political content shared on the big troll pages originated with US politicians etc, and wasn’t being created by the IRA
If there was content with a particular intent, how successful was it?
What do ordinary people believe about AI sentience and intelligence? At what level of competence would they be convinced that an AI had meaningful feelings? Are there displays of competence that would convince them to defer to the AI?
One thing that confuses me is that some fraction of the population seem to think that sentient/fairly general AI is here already, but don’t seem particularly concerned about it. Is that correct?
How much time do people currently spend interacting with romantic chatbots? (e.g. Xiaoice). How much is spent on this?
How seriously has hardcore persuasion/mindhacking been investigated? How competent was MKUltra? Presumably the USSR also had programs of this sort?
What was the impact of e.g. Facebook’s fact-checking on people who saw fact-checked posts, or explanation/justification for why things were taken down?
Does seeing a fake video influence people’s feelings about a topic, even if they know it’s fake?
Do we have any information on whether interacting with an AI persona expressing some opinion provides the same social proof effect as a human friend expressing that opinion?
How frequently do parents choose a school that matches their faith? How much of a cost will they pay for this?