The Next ChatGPT Moment: AI Avatars
Epistemic Status: Speculative. Dependent on intuitions about near-term AI tech and human psychology.
Claim: Within the next 1-3 years, many people will have an interaction with an AI avatar that feels authentically human. This will significantly amplify the public perception of current AI capabilities and risks.
An AI avatar is a realistic AI-generated render of a human (speech and video) that can have a real-time conversation with a human, for example over a video call.
The individual components needed to implement AI avatars already exist. AI is capable of holding a conversation over text, transcribing speech to text, and synthesizing natural-sounding speech.[1] Generating photorealistic video of a talking human is currently limited, but still impressive and making rapid progress.
Taken together, these capabilities mean it will soon be possible to create a realistic AI avatar. The first generation avatars will be a bit rough, especially the rendered video, but overall there don’t seem to be large conceptual hurdles to creating convincing AI avatars.[2]
Personal conversation with a high-quality AI avatar will have a significant emotional and mental impact on most people.[3] The impact will be especially acute for people distant from the world of AI, but will also affect those familiar with AI.
For humans, communication medium matters just as much as content. The same words can hit much harder when spoken in an emotive voice by an expressive face, than when silently read off a screen. Having a realistic personal conversation with an AI avatar will change people’s gut-level intuitions about AI.
For better or worse, once decent AI avatars become generally accessible, public sentiment around AI will experience another shift comparable to the one spurred by ChatGPT.[4] AI will be perceived as more human-like and capable. It will seem like an independent agent that possesses “true intelligence”.
After talking with a realistic AI avatar, the common refrains of “It’s not actually intelligent, it just predicts the next token” and “Why would it want anything?” won’t resonate with the public. For many people, consciousness is a prerequisite for real AI, and human-like AI avatars will appear to be a direct instantiation of that.
ChatGPT’s release was a cultural moment.[5] It captured the public’s imagination and triggered a reclassification of AI from sci-fi to present reality. AI avatars could bring on another cultural moment that shifts public perception even further.
The upcoming shift is predictable—AI avatars don’t require any fundamental technical breakthroughs. It’s a major evolution that we have the rare opportunity to prepare for in advance.
- ^
Speech-to-text is good enough (OpenAI Whisper), text-to-speech is nearly good enough (ElevenLabs), and conversation / language modeling is good enough (ChatGPT with a Character.ai-style personality). All this currently suffices for realistic audio conversation with an AI. Human video generation isn’t quite good enough yet, but it’s making progress (Audio to Photoreal, HeyGen, Metahuman). Based on the current rate of progress, a functional AI avatar seems attainable within 1-3 years.
- ^
Latency might be a problem in the near-term. In particular, it’s unclear how fast the video generation will be.
- ^
This is already happening to a limited extent. Many people have formed significant emotional attachments through text-only interactions with relatively weak language models (e.g. Character.ai and Replika).
- ^
The shift could be more gradual than ChatGPT’s, though. AI avatar tech is improving gradually whereas ChatGPT was dropped sui generis on the world.
- ^
The Google Trends chart for “AI”. ChatGPT came out on November 30, 2022.
- 6 Jan 2024 15:03 UTC; 2 points) 's comment on Project ideas: Epistemics by (
People used to imagine the internet working like a 3d game, with stores and online avatars for other customers at the store. This turned out to be not useful, the additional information isn’t helping the user.
Mobile apps used to be more like the PC desktop apps they came from, where gradually unnecessary elements have been hidden through flat UI design.
While I also kinda imagine an ai collaborating with a human with a little avatar that emotes, jumps around and points to things, looks distressed when there is no network connection...does this give the user true value?
Or will people find it annoying and instead we end up with “flat”, where chatbot outputs become terse and labeled by model confidence or if a specific claim has been fact checked.
For collaboration on job-like tasks that assumption might hold. For companionship and playful interactions I think the visual domain, possibly in VR/AR, will be found to be relevant and kept. Given our psychological priors, I also think for many people it may feel like a qualitative change in what kind of entity we are interacting with—from lifeless machine, over uncanny human imitation, to believable personality on another substrate.
Yeah, I also doubt that it will be the primary way of using AI. I’m just saying that AI avatar tech could exist soon and that it will change how the public views AI.
ChatGPT itself is in a bit of a similar situation. It changed the way many people think of AI, even for those who don’t find it particularly useful.
Absolutely. I kinda imagine Microsofts Cortana putting her ghostly fingers through foreground apps in windows, especially native Microsoft apps, to try to help the user out. She would seem to be actually physically helping you and/or actually existing in your computers desktop.
But it’s all vestigial and extra pixel rendering that isn’t helping the user accomplish anything. Even the concept of gender for the ai or a voice is vestigial.
My bet is that conversational agents get buy-in in the early days because of Skeuomorphism, but eventually are phased out in favour of more efficient interaction styles.
If you go look at digi.ai’s website for their plans, they basically want to tick all the boxes on this, in a usecase where it matters and will make money, and already put out a render of what they want tio to look like. So I’d guess closer to 1 than 3 years.
Looks like they are focusing on animated avatars. I expect the realtime photorealistic video to be the main bottleneck, so I agree that removing that requirement will probably speed things up.
Yes, they’re going with a cute Pixar-like style (I gather they hired an ex-Pixar animator). Anime would likely also work for something like this. Both of those might reduce the psychological impact a little by adding an air of unreality, though I suspect a sufficiently interactive conversation would still have a good deal of impact.
Empirical data point: In my experience, talking to Inflection’s Pi on the phone covers the low latency integration of “AI is capable of holding a conversation over text, transcribing speech to text, and synthesizing natural-sounding speech” sufficiently well to pass some bar of “feels authentically human” to me until you try to test its limits. I imagine that subjective experience to be more likely to appear if you don’t have background knowledge about LLMs / DL. Its main problems are 1) keeping track of context in plausibly human-like way (e.g. playing a game of guessing capital cities of European countries leads to repetitive questions about the same few countries even if asked to take care in various ways) and 2) inconsistent rejection of talking about certain things depending on previous text (e.g. retelling dark jokes by real comedians).
I share your expectation that adding photorealistic video generation to it can plausibly lead to another “cultural moment”, though it might depend on whether such avatars find similarly rapid adoption as ChatGPT or whether it’s phased in more gradually. (I’ve no overview of the entire space and stumbled over Inflection’s product by chance after a random podcast listening. If there are similar ones out there already I’d love to know.)
edit: Corrected link formatting.
I guess it could be a great tool to help people quickly learn to converse in a foreign language.