Abstract
Despite much progress in training AI systems to imitate human language, building agents that use language to communicate intentionally with humans in interactive environments remains a major challenge. We introduce Cicero, the first AI agent to achieve human-level performance in Diplomacy, a strategy game involving both cooperation and competition that emphasizes natural language negotiation and tactical coordination between seven players. Cicero integrates a language model with planning and reinforcement learning algorithms by inferring players’ beliefs and intentions from its conversations and generating dialogue in pursuit of its plans. Across 40 games of an anonymous online Diplomacy league, Cicero achieved more than double the average score of the human players and ranked in the top 10% of participants who played more than one game.
Meta Fundamental AI Research Diplomacy Team (FAIR)†, Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, et al. 2022. “Human-Level Play in the Game of Diplomacy by Combining Language Models with Strategic Reasoning.” Science, November, eade9097. https://doi.org/10.1126/science.ade9097.
Meta AI announces Cicero: Human-Level Diplomacy play (with dialogue)
- On the Diplomacy AI by 28 Nov 2022 13:20 UTC; 127 points) (
- Critique of some recent philosophy of LLMs’ minds by 20 Jan 2023 12:53 UTC; 52 points) (
- EA & LW Forums Weekly Summary (14th Nov − 27th Nov 22′) by 29 Nov 2022 22:59 UTC; 22 points) (EA Forum;
- EA & LW Forums Weekly Summary (14th Nov − 27th Nov 22′) by 29 Nov 2022 23:00 UTC; 21 points) (
- Notes on Meta’s Diplomacy-Playing AI by 22 Dec 2022 11:34 UTC; 14 points) (
It’s worth noting that they built quite a complicated, specialized AI system (ie they did not take an LLM and finetune a generalist agent that also can play diplomacy):
First, they train a dialogue-conditional action model by behavioral cloning on human data to predict what other players will do.
They they do joint RL planning to get action intentions of the AI and other payers using the outputs of the conditional action model and a learned dialogue-free value model. (They use also regularize this plan using a KL penalty to the output of the action model.)
They also train a conditional dialogue model that maps by finetuning a small LM (a 2.7b BART) to map intents + game history > messages. Interestingly, this model is trained in a way that makes it pretty honest by default.
They train a set of filters to remove hallucinations, inconsistencies, toxicity, leaking its actual plans, etc from the output messages, before sending them to other players.
The intents are updated after every message. At the end of each turn, they output the final intent as the action.
I do expect someone to figure out how to avoid all these dongles and do it with a more generalist model in the next year or two, though.
I think people who are freaking out about Cicero moreso than foundational model scaling/prompting progress are wrong; this is not much of an update on AI capabilities nor an update on Meta’s plans (they were publically working on diplomacy for over a year). I don’t think they introduce any new techniques in this paper either?
It is an update upwards on the competency of this team of Meta, a slight update upwards on the capabilities of small LMs, and probably an update upwards on the amount of hype and interest in AI.
Oh, and it’s also a slight downwards update on the difficulty of press diplomacy. For example, it might just be possible that you don’t need much backstabbing for expert human-level diplomacy?
This doesn’t mean that Cicero is honest—notably, it can “change its mind” about what action it should take, given the dialogue of other players. For example, in the video, at 26:12 we see Austria saying to Russia that they will keep Gal a DMZ, but at 27:03 Austria moves an army into Gal.
This article interviewing expert Diplomacy players suggests the same (though it somewhat justifies it with player reputations lingering between games, which wasn’t the case here):
As far as I can tell, the AI has no specialized architecture for deciding about its future strategies or giving semantic meaning to its words. It outputting the string “I will keep Gal a DMZ” does not have the semantic meaning of it committing to keep troops out of Gal. It’s just the phrase players that are most likely to win use in that boardstate with its internal strategy.
Like chess grandmasters being outperformed by a simple search tree when it was supposed to be the peak of human intelligence, I think this will have the same effect of disenchanting the game of diplomacy. Humans are not decision theoretical geniuses; just saying whatever people want you to hear while playing optimally for yourself is sufficient to win. There may be a level of play where decision theory and commitments are relevant, but humans just aren’t that good.
That said, I think this is actually a good reason to update towards freaking out. It’s happened quite a few times now that ‘naive’ big milestones have been hit unexpectedly soon “without any major innovations or new techniques”—chess, go, starcraft, dota, gpt-3, dall-e, and now diplomacy. It’s starting to look like humans are less complicated than we thought—more like a bunch of current-level AI architectures squished together in the same brain (with some capacity to train new ones in deployment) than like a powerful generally applicable intelligence. Or a room full of toddlers with superpowers, to use the CFAR phrase. While this doesn’t increase our estimates of the rate of AI development, it does suggest that the goalpost for superhuman intellectual performance in all areas is closer than we might have thought otherwise.
This is incorrect; they use “honest” intentions to learn a model of message > intention, then use this model to annotate all the other messages with intentions, which then they then use to train the intent > message map. So the model has a strong bias toward being honest in its intention > message map. (The authors even say that an issue with the model is it has the tendency to spill too many of its plans to its enemies!)
The reason an honest intention > message map doesn’t lead to a fully honest agent is that the search procedure that goes from message + history > intention can “change its mind” about what the best intention is.
This is correct; every time AI systems reach a milestone earlier than expected, this is simultaneously an update upward on AI progress being faster than expected, and an update downward on the difficulty of the milestone.
I’d like to push back on “AI has beaten StarCraft”. AlphaStar didn’t see the game interface we see, it just saw an interface with exact positions of all its stuff and ability to make any commands possible. It’s far from the mouse-and-keyboard that humans are limited to, and in SC that’s a big limitation. When the AI can read the game state from the pixels and send mouse and keyboard inputs, then I’ll be impressed.
I think that this is true of the original version of alphastar, but they have since trained a new version on camera inputs and with stronger limitations on apm (22 actions/5s) (Maybe you’d want some kind of noise applied to the inputs still, but I think the current state is much closer to human-like playing conditions.) See: https://www.deepmind.com/blog/alphastar-grandmaster-level-in-starcraft-ii-using-multi-agent-reinforcement-learning
Ah I didn’t know they had upgraded it. I’m much more satisfied that SC2 is solved now.
I think there’s a standard argument that goes “You can’t just copy paste a bunch of systems that are superhuman in their respective domains and get a more general agent out.” (e.g. here’s David Chapman saying something like this: https://mobile.twitter.com/Meaningness/status/1563913716969508864)
If you have that belief, I imagine this paper should update you more towards AI capabilities. It is indeed possible to duct tape a bunch of different machine learning models together and get out something impressive. If you didn’t believe this, it should update you on the idea that AGI could come from several small new techniques duct taped together to handle each other’s weakness.
I don’t think that Cicero is a general agent made by gluing together superhuman narrow agents! It’s not clear that any of its components are super human in a meaningful sense.
I also don’t think that “you can’t just copy paste together a bunch of systems that are superhuman...” is a fair summary of David Chapman’s tweet! I think his tweet is specifically pointing out that naming your components suggestive names and drawing arrows between them does not do the hard work of building your generalist agent (which is far more involved).
(Btw, your link is broken, here’s the tweet.)
I don’t either! I think it should update your beliefs that that’s possible though.
I don’t see why it should update my beliefs a non-neglible amount? I expected techniques like this to work for a wide variety of specific tasks given enough effort (indeed, stacking together 5 different techniques into a specialist agent is how a lot of academic work in robotics looks like). I also think that the way people can compose
text-davinci-002
or other LMs with themselves into more generalist agents basically should screen off this evidence, even if you weren’t expecting to see it.I didn’t say it should update your beliefs (edit: I did literally say this lol but it’s not what I meant!) I said it should update the beliefs of people who have a specific prevailing attitude.
I do believe David Chapman’s tweet though! I don’t think you can just hotwire together a bunch of modules that are superhuman only in narrow domains, and get a powerful generalist agent, without doing a lot of work in the middle.
(That being said, I don’t count gluing together a Python interpreter and a retrieval mechanism to a fine-tuned GPT-3 or whatever to fall in this category; here the work is done by GPT-3 (a generalist agent) and the other parts are primarily augmenting its capabilities.)
I think what we’re seeing here is that LLMs can act as glue to put together these modules in surprising ways, and make them more general. You see that here and with Saycan. And I do think that Chapman’s point becomes less tenable with them in the picture.
So… LLMs are AGIs?
LLMs can act as glue that makes AI’s more G?
Yes, essentially.
Five days ago, AI safety YouTuber Rob Miles posted on Twitter, “Can we all agree to not train AI to superhuman levels at Full Press Diplomacy? Can we please just not?”
I don’t know anything about Diplomacy and I just watched this video, could someone expand a bit on why this game is a particularly alarming capability gain? The chat logs seemed pretty tame, the bot didn’t even seem to attempt psychological manipulation or gaslighting or anything similar. What important real world capability does Diplomacy translate into that other games don’t? (People for instance don’t seem very alarmed nowadays about AI being vastly superhuman at chess or Go.)
So Diplomacy is not a computationally complex game, it’s a game about out-strategizing your opponents where roughly all of the strategy is convincing other of your opponents to work with you. There are no new tactics to invent and an AI can’t really see deeper into the game than other players, it just has to be more persuasive and make decisions about the right people at the right time. You often have to do things like plan ahead to make your actions so that in a future turn someone else will choose to ally with you. The AI didn’t do any specific psychological manipulation, it was just good at being persuasive and strategic in the normal human way. It’s also notable for being able to both play the game and talk with people about the game.
This could translate into something like being good at convincing that the AI should be let out of its box, but I think mostly it’s just being better at multiple skills simultaneously than many people expected.
(Disclaimer: I’ve only played Diplomacy in person before, and not at this high of a level)
I don’t think the game is an alarming capability gain at all—I agree with LawrenceC’s comment below. It’s more of a “gain-of-function research” scenario to me. Like, maybe we shouldn’t deliberately try to train a model to be good at this? If you’ve ever played Diplomacy, you know the whole point of the game is manipulating and backstabbing your way to world domination. I think it’s great that the research didn’t actually seem to come up with any scary generalizable techniques or dangerous memetics, but I think ideally shouldn’t even be trying in the first place.
Previous: SBER → Gray et al 2020 → DORA.
I commented back in June 2020 of SBER that “natural language Diplomacy agents surely can’t be too much more difficult given NLM progress...”, and indeed, they were not, despite the insane leap in capabilities from “the best NN can’t even beat humans at a simplified Diplomacy shorn of all communication and negotiation and manipulation and deception aspects” to “NNs can now talk a lot of human players into losing”. The tide is rising, and it is still May 2020.
Timeline considerations:
This is not particularly unexpected if you believed in the scaling hypothesis. (It should be surprising if you continue to take seriously alternative & still-prestigious centrist paradigms like “we need a dozen paradigm shifts and it’ll take until 2050”.)
The human range is narrow, so once you reach SBER and GPT-3, you already are most of the way to full-press Diplomacy. The fact that the authors & forecasters thought it would take until 2029 in 2019 (ie 10 years instead of 3 years) is part and parcel of the systematic underestimation of DL, which we have seen elsewhere in eg all the forecasts shattered by inner-monologue techniques—as Eliezer put it, experts can often be the blindest because they miss the Outside View forest for the Inside View trees, having ‘slaved over a hot GPU’ for so long.
From a scaling perspective, the main surprise of the timing is that Facebook chose to fund this research for this long.
As Diplomacy is unimportant and a pretty niche game even among board games, there’s no reason it ‘had’ to be solved this side of 2030 (when it presumably would become so easy that even a half-hearted effort by a grad student or hobbyist would crack it). Similarly, the main surprise in DeepMind’s recent Stratego human-level AI is mostly that ‘they bothered’. Keep this in mind if you want to forecast Settlers of Catan progress: it’s not hard, and the real question you are forecasting is, ‘will anyone bother?’ (And, since the tendency of research groups is to bury their failures out back, you also can’t forecast using wording like ‘conditional on a major effort by a major DL group like FAIR, DM, OA etc’ - you won’t know if anyone does bother & fails because it’s genuinely hard.)
Deception is deceptive: one interesting aspect of the ‘honesty’ of the bot is that it might show how deception is an emergent property of a whole system, not just the one part. (EDIT: some Twitter discussion)
CICERO may be constrained to be ‘honest’ in each interaction but it still will betray you if you trust it & move into positions where betraying you is profitable. Is it merely opportunistic, or is it analogous to humans where self-deception makes you more convincing? (You sincerely promise to not steal X, but the temptation of X eventually turns out to be too great...) It is trained end-to-end and is estimating expected value, so even if there is no ‘deception module’ or ‘deception intent’ in the planning, the fact that certain statements lead to certain long-term payoffs (via betrayal) may influence its value estimates, smuggling in manipulation & deception. Why did it pick option A instead of equally ‘honest’ option B? No idea. But option A down the line turns out to ‘unexpectedly’ yield a betrayal opportunity, which it then takes. The interplay between optimization, model-free, model-based planning, and the underlying models is a subtle one. (Blackbox optimization like evolution could also evolve this sort of de facto deception even when components are constrained to be ‘honest’ on a turn-by-turn basis.)
I’m not sure of this, because piKL (and the newer variants introduced & used in CICERO) are complex (maybe some causal influence diagrams would be helpful here), but if so, it’d be interesting, and a cautionary example for interpretability & alignment research. Just like ‘security’ and ‘reliability’, honesty is a system-level property, not a part-level property, and the composition of many ‘honest’ components can yield deceptive actions.
From a broader perspective, this result seems to continue to reinforce the observation “maybe humans just aren’t that smart”.
Here’s full-press Diplomacy, a game so hard that they don’t even have a meaningful AI baseline to compare to because all prior agents were so bad, which is considered one of the pinnacles of social games, combining both a hard board game with arbitrarily complicated social dynamics mediated through unconstrained natural language; and yet. They use a very small language model, and not even that much compute for the CFR planning, in a ramshackle contraption, and… it works well? Yeah, 5 minute rounds and maybe not the very best players in the world, OK, but come on, we’ve seen how this story goes, and of course, the first version is always the worst, which means that given more R&D it’ll become much more powerful and more efficient in the typical DL experience curve. ‘Attacks only get better’ / ‘sampling can show the presence of knowledge but not the absence’ - small LMs are already quite useful.
Gain-of-lambda-function research: yes, this is among the worser things you could be researching, up there with the Codex code evolution & Adept Transformer agents. There are… uh, not many realistic, beneficial applications for this work. No one really needs a Diplomacy AI, and applications to things like ad auctions are tenuous. (Note the amusing wriggling of FB PR when they talk about “a strategy game which requires building trust, negotiating and cooperating with multiple players”—you left out some relevant verbs there...) And as we’ve seen with biological research, no matter how many times bugs escape research laboratories and literally kill people, the déformation professionnelle will cover it up and justify it. Researchers who scoff at the idea that a website should be able to set a cookie without a bunch of laws regulating it suddenly turn into don’t-tread-on-me anarchocapitalists as soon as it comes to any suggestion that their research maybe shouldn’t be done.
But this is far from the most blatantly harmful research (hey, they haven’t killed anyone yet, so the bar is high), so we shouldn’t be too hard on them or personalize it. Let’s just treat this as a good example for those who think that researchers collectively have any fire alarm and will self-regulate. (No, they won’t, and someone out there is already calculating how many megawatts the Torment Nexus will generate and going “Sweet!” and putting in a proposal for a prototype ‘Suffering Swirlie’ research programme culminating in creating a Torment Nexus by 2029.)
Cicero is not particularly unexpected to me, but my expectations here are not driven by the scaling hypothesis. The result achieved here was not achieved by adding more layers to a single AI engine, it was achieved by human designers who assembled several specialised AI engines by hand.
So I do not view this result as one that adds particularly strong evidence to the scaling hypothesis. I could equally well make the case that it adds more evidence to the alternative hypothesis, put forward by people like Gary Marcus, that scaling alone as the sole technique has run out of steam, and that the prevailing ML research paradigm needs to shift to a more hybrid approach of combining models. (The prevailing applied AI paradigm has of course always been that you usually need to combine models.)
Another way to explain my lack of surprise would be to say that Cicero is a just super-human board game playing engine that has been equipped with a voice synthesizer. But I might be downplaying the achievement here.
I have not read any of the authors’ or Meta’s messaging around this, so I am not sure if they make that point, but the sub-components of Cicero that somewhat competently and ‘honestly’ explain its currently intended moves seem to have beneficial applications too, if they were combined with an engine which is different from a game engine that absolutely wants to win and that can change it’s mind about moves to play later. This is a dual-use technology with both good and bad possible uses.
That being said, I agree that this is yet another regulatory wake-up call, if we would need one. As a group, AI researchers will not conveniently regulate themselves: they will move forward in creating more advanced dual-use technology, while openly acknowledging (see annex A.3 of the paper) that this technology might be used for both good and bad purposes downstream. So it is up to the rest of the world to make sure that these downstream uses are regulated.
Well, first I would point out that I predicted it would happen soon, while literally writing “The Scaling Hypothesis”, and the actual researchers involved did not, predicting it would take at least a decade; if it is not predicted by the scaling hypothesis, but is in fact predicted better by neurosymbolic views denying scaling, it’s curious that I was the one who said that and not Gary Marcus.
Second, it’s hard to see how a success using models which would have been among the largest NNs ever trained just 3 years ago, finetuned using 128-256 GPUs on a dataset which would be among the largest compute-clusters & datasets ever used in DL also up until a few years ago, requiring 80 CPU-cores & 8 GPUs at runtime (which would be &etc), is really a vindication for hand-engineered neurosymbolic approaches emphasizing the importance of symbolic reasoning and complicated modular approaches, as opposed to approaches emphasizing the importance of increasing data & compute for solving previously completely intractable problems. Nor do the results indicate that the modules are really all that critical. I do not think CICERO is going to need even an AlphaZero level of breakthrough to remove various parts, when eg. the elaborate filter system representing much of the ‘neurosymbolic’ part of the system is only removing 15.5% of all messages.
I regard this sort of argument as illegitimate because it is a fully-general counterargument which allows strategy-stealing and explains everything while predicting nothing: no matter how successful scaling is, it will always be possible (short of a system being strongly superhuman) to take a system and improve it in some way by injecting hand-engineering or borrowing other systems and lashing them together. This is fine for purely pragmatic purposes, but the problem is when one then claims it is a victory for ‘neurosymbolic programming’ or whatever catchphrase is in vogue. Somehow, the scaled-up system is never a true scotsman of scaling, but the modifications or hybrids are always true scotsmen of neurosymbolics… Those systems will then be surpassed a few years later by larger (and often simpler) systems, thereby refuting the previous argument—only for the same thing to happen again, ratcheting upwards. The question is not whether some hand-engineering can temporarily leapfrog the Bitter Lesson (it is quite explicitly acknowledged by Sutton and scaling proponents that you can gain constant factors by doing stuff which is not scaling), but whether it will progress faster. CICERO would only be a victory for neurosymbolic approaches if it had used nothing you couldn’t’ve done in, say, 2015.
This seems to be what you are doing here: you handwave away the use of BART and extremely CPU/GPU-intensive search as not a victory for scaling (why? it was not a foregone conclusion that a BART would be any good, nor that the search could be made to work, without far more extensive engineering of theory of mind and symbolic reasoning), and then only count the lashing-together of a few scaled-up components as a victory for neurosymbolic approaches instead.
Thanks, that does a lot to clarify your viewpoints. Your reply calls for some further remarks.
I’ll start off by saying that I value your technology tracking writing highly because you are one of those blogging technology trackers who is able to look beyond the press releases and beyond the hype. But I have the same high opinion of the writings of Gary Marcus.
For the record: I am not trying to handwave the progress-via-hybrid-approaches hypothesis of Marcus into correctness. The observations I am making here are much more in the ‘explains everything while predicting nothing’ department.
I am observing out that both your progress-via-scaling hypothesis and the progress-via-hybrid-approaches hypothesis of Marcus can be made to explain the underlying Cicero facts here. I do not see this case as a clear victory for either one of these hypotheses. What we have here is an AI design that cleverly combines multiple components while also being impressive in the scaling department.
Technology tracking is difficult, especially about the future.
The following observation may get to the core of how I may be perceiving the elephant differently. I interpret an innovation like GANs not as a triumph of scaling, but as a triumph of cleverly putting two components together. I see GANs as an innovation that directly contradicts the message of the Bitter Lesson paradigm, one that is much more in the spirit of what Marcus proposes.
Here is what I find particularly interesting in Marcus. In pieces like like The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence, Marcus is advancing the hypothesis that the academic Bitter-Lesson AI field is in a technology overhang: these people could make make a lot of progress on their benchmarks very quickly, faster than mere neural net scaling will allow, if they were to ignore the Bitter Lesson paradigm and embrace a hybrid approach where the toolbox is much bigger than general-purpose learning, ever-larger training sets, and more and more compute. Sounds somewhat plausible to me.
If you put a medium or high probability on this overhang hypothesis of Marcus, then you are in a world where very rapid AI progress might happen, levels of AI progress much faster than those predicted by the progress curves produced by Bitter Lesson AI research.
You seem to be advancing an alternative hypothesis, one where advances made by clever hybrid approaches will always be replicated a few years later by using a Bitter Lesson style monolithic deep neural net trained with a massive dataset. This would conveniently restore the validity of extrapolating Bitter Lesson driven progress curves, because you can use them as an upper bound. We’ll see.
I am currently not primarily in the business of technology tracking, I am an AI safety researcher working on safety solutions and regulation. With that hat on, I will say the following.
Bitter-lesson style systems consisting of a single deep neural net, especially if these systems are also model-free RL agents, have huge disadvantages in the robustness, testability, and interpretability departments. These disadvantages are endlessly talked about on this web site of course. By contrast, systems built out of separate components with legible interfaces between them are usually much more robust, interpretable and testable. This is much less often mentioned here.
In safety engineering for any high-risk application, I would usually prefer to work with an AI system built out of many legible sub-components, not with some deep neural net that happens to perform equally or better on an in-training-distribution benchmark. So I would like to see more academic AI research that ignores the Bitter Lesson paradigm, and the paradigm that all AI research must be ML research. I am pleased to say that a lot of academic and applied AI researchers, at least in the part of the world where I live, never got on board with these paradigms in the first place. To find their work, you have look beyond conferences like NeurIPS.
That’s an interesting example to pick: watching GAN research firsthand as it developed from scratch in 2014 to bust ~2020 played a considerable role in my thinking about the Bitter Lesson & Scaling Hypothesis & ‘blessings of scale’, so I regard GANs differently than you do, and have opinions on the subject.
First, GANs may not be new. Even if you do not entirely buy Schmidhuber’s claim that his predictability-minimization arch 30 years ago or whatever is identical to GANs (and better where not identical), there’s also that 2009 blog post or whatever by someone reinventing it. And given the distribution of multiple-discoveries, that implies there’s probably another 2 or 3 reinventions out there somewhere. If what matters about GANs is the ‘innovation’ of the idea, and not scaling, why were all the inventions so sterile and yielded so little until so late? (And why have such big innovations been so thoroughly superseded in their stronghold of image generation by approaches which look nothing at all like GANs and appear to owe little to them, like diffusion models?)
Second, GANs in their successful period were clearly a triumph of compute. All the biggest successes of GANs use a lot of compute historically: BigGAN is trained on literally a TPUv3-512 pod. None of the interesting success stories of GANs look like ‘train on a CPU with levels of compute available to a GOFAI researcher 20 years ago because scaling doesn’t matter’. Cases like StyleGAN where you seem to get good results from ‘just’ 4 GPU-months of compute (itself a pretty big amount) turn out to scale infamously badly to more complex datasets, and to be the product of literally hundreds or thousands of fullscale runs, and deadends. (The Karras group sometimes reports ‘total compute used during R&D’ in their papers, and it’s an impressively large multiple.) For all the efficiency of StyleGAN on faces, no one is ever going to make it work well on LAION-4b etc. (At Tensorfork, our StyleGAN2-ext worked mostly by throwing away as much of the original StyleGAN2 hand engineering as possible, which held it back, in favor of increasing n/param/compute in tandem. The later distilled ImageNet StyleGAN approach relies on simplifying the problem drastically by throwing out hard datapoints, and still only delivers bad results on ImageNet.) And that’s the most impressive ‘innovation’ in GAN architecture. This is further buttressed by the critical meta-science observations in GAN research that when you compared GAN archs on a fair basis, in a single codebase with equal hyperparam tuning etc, fully reproducibly, you typically found… arch made little difference to the top scores, and mostly just changed the variance of runs. (This parallels other field-survey replication efforts like in embedding research: results get better over time, which researchers claim reflect the sophistication of their architectures… and the gains disappear when you control for compute/n/param. The results were better—but just because later researchers used moar dakka, and either didn’t realize that all the fruits of their hard work was actually due to harder-working GPUs or quietly omitted that observation, in classic academic fashion. I feel like every time I get to ask a researcher who did something important in private how it really happened, I find out that the story told in the paper is basically a fairy tale for small children and feel like the angry math student who complained about Gauss, “he makes [his mathematics] like a fox, wiping out the traces in the sand with his tail.”)
Third, to further emphasize the triumph of GANs due to compute, their fall also appears to be due to compute too. The obsolescence of GANs is, as far as I can tell, due solely to historical happenstance. BigGAN never hit a ceiling; no one calculated scaling laws on BigGAN FIDs and showed it was doomed; BigGAN doesn’t get fatally less stable with scale, Brock demonstrated it gets more stable and scales to at least n = 0.3b without a problem. What happened was simply that the tiny handful of people who could and would do serious GAN scaling happened to leave the field (eg Brock) or run into non-fundamental problems which killed their efforts (eg Tensorfork), while the people who did continue DL scaling happened to not be GAN people but specialized in alternatives like autoregressive, VAE, and then diffusion models. So they scaled up all of those successfully, while no one tried with GANs, and now the ‘fact’ that ‘GANS don’t work’ is just a thing That Is Known: “It is well known GANs are unstable and cannot be trained at scale, which is why we use diffusion models instead...” (Saw a version of that in a paper literally yesterday.) There is never any relevant evidence given for that, just bare assertion or irrelevant cites. I have an essay I should finish explaining why GANs should probably be tried again.
Fourth, we can observe that this is not unique to GANs. Right now, it seems like pretty much any image arch you might care to try works. NeRF? A deep VAE? An autoregressive Transformer on pixels? An autoregressive on VAE tokens? A diffusion on pixels? A diffusion on latents? An autoencoder? Yeah sure, they all work, even (perhaps especially) the ones which don’t work on small datasets/compute. When I look at Imagen samples vs Parti samples, say, I can’t tell which one is the diffusion model and which one is the autoregressive Transformer. Conceptually, they have about as much in common as a fungus with a cockroach, but… they both work pretty darn well anyway. What do they share in common, besides ‘working’? Compute, data, and parameters. Lots and lots and lots of those. (Similarly for video generation.) I predict that if GANs get any real scaling-up effort put into improving them and making them run at similar scale, we will find that GANs work well too. (And will sample a heckuva lot faster than diffusion or AR models...)
‘Innovative’ archs are just not that important. An emphasis on arch, much less apologizing for ‘neurosymbolic’ approaches, struggles to explain this history. Meanwhile, an emphasis on scaling can cleanly explain why GANs succeeded at Goodfellow’s reinvention, why GANs fell out of fashion, why their rivals succeeded, and may yet predict why they get revived.
We don’t need to see. Just look at the past, which has already happened.
Yeah, he’s wrong about that because it’d only be true if academic markets were maximizing for long-term scaling (they don’t) rather than greedily myopically optimizing for elaborate architectures that grind out an improvement right now. The low-hanging fruit has already been plucked and everyone is jostling to grab a slightly higher-hanging fruit they can see, while the long-term scaling work gets ignored. This is why various efforts to take a breakthrough of scaling, like AlphaGo or GPT-3, and hand-engineer it, yield moderate results but nothing mindblowing like the original.
No. You’d predict that there’d potentially be more short-term progress (as researchers pluck the fruits exposed by the latest scaling bout) but then less long-term (as scaling stops, because everyone is off plucking but with ever diminishing returns).
Here too I disagree. What we see with scaled up systems is increasingly interpretability of components and less issues with things like polysemanticity or relying on brittle shortcuts, increasingly generalizability and solving edge cases (almost by definition), increasing capabilities allowing for meaningful tests at all, and reasons to think that things like adversarial robustness will come ‘for free’ with scale (see isoperimetry). Just like with the Bitter Lesson for capability, at any fixed level of scale or capability, you can always bolt on more shackles and gadgets and gewgaws to get a ‘safer’ system, but you then are ever increasing risk of obsolescence and futility because a later scaled-up system may render your work completely irrelevant both in terms of economic deployment and in terms of what techniques you need to understand it and what safety properties you can attain. (Any system so simple it can be interpreted in a ‘symbolic’ classical way may inherently be too simple to solve any real problems—if the solutions to those problems were that simple, why weren’t they created symbolically before bringing DL into the picture...? ‘Bias/variance’ applies to safety just as much as anything else: a system which is too small and too dumb to genuinely understand things can be a lot more dangerous than a scaled-up system which does understand them but is shaky on how much it cares. Or more pointedly: hybrid systems which do not solve the problem, which is ‘all of them’ in the long run, cannot have any safety properties since they do not work.)
Related to this, from the blog post What does Meta AI’s Diplomacy-winning Cicero Mean for AI?:
I am originally a CS researcher trained several decades ago, actually in the middle of an AI winter. That might explain our different viewpoints here. I also have a background in industrial research and applied AI, which has given me a lot of insight into the vast array of problems that academic research refuses to solve for you. More long-form thoughts about this are in my Demanding and Designing Aligned Cognitive Architectures.
From where I am standing, the scaling hype is wasting a lot of the minds of the younger generation, wasting their minds on the problem of improving ML benchmark scores under the unrealistic assumption that ML will have infinite clean training data. This situation does not fill me with as much existential dread as it does some other people on this forum, but anyway.
Related to our discussion earlier, I see that Marcus and Davis just published a blog post: What does Meta AI’s Diplomacy-winning Cicero Mean for AI?. In it, they argue, as you and I both would expect, that Cicero is a neurosymbolic system, and that its design achieves its results by several clever things beyond using more compute and more data alone. I expect you would disagree with their analysis.
Thanks for the very detailed description of your view on GAN history and sociology—very interesting.
You focus on the history of benchmark progress after DLL based GANs were introduced as a new method for driving that progress. The point I was trying to make is about a different moment in history: I am perceiving that the original introduction of DLL based GANs was a clear discontinuity.
If you search wide enough for similar things, then no idea that works is really new. Neural nets were also not new when the deep learning revolution started.
I think your main thesis here is that academic researcher creativity and cleverness, their ability to come up with unexpected architecture improvements, has nothing to do with driving the pace of AI progress forward:
Sorry, but you cannot use a simple control-for-compute/n/param statistics approach to determine the truth of any hypothesis of how clever researchers really were in coming up with innovations to keep an observed scaling curve going. For all you know. these curves are what they are because everybody has been deeply clever at the architecture evolution/revolution level, or at the hyperparameter tuning level. But maybe I am mainly skeptical of your statistical conclusions here because you are are leaving things out of the short description of the statistical analysis you refer to. So if you want can give me a pointer to a more detailed statistical writeup, one that tries to control for cleverness too, please do.
That being said, like you I perceive, in a more anecdotal form, that true architectural innovation is absent from a lot of academic ML work, or at least the academic ML work appearing in the so-called ‘top’ AI conferences that this forum often talks about. I mostly attribute that to such academic ML only focusing on a very limited set of big data / Bitter Lesson inspired benchmarks, benchmarks which are not all that relevant to many types of AI improvements one would like to see in the real world. In industry, where one often needs to solve real-world problems beyond those which are fashionable in academia, I have seen a lot more creativity in architectural innovations than in the typical ML benchmark improvement paper. I see a lot of that industry-type creativity in the Cicero paper too.
You mention that your compute-and-data-is-all-that-drives-progress opinion has been informed by looking at things like GANs for image generation and embedding research.
This progress in these sub-fields differs from the type of AI technology progress that I would like to see much more of, as an AI safety and alignment researcher. This also implies that I have different opinion on what drives or should drive AI technology progress.
One benchmark that interests me is an AI out-of-distribution robustness benchmark where the model training happens on sample data drawn from a first distribution, and the model evaluation happens on sample data drawn from a different second distribution, only connected to the first by having the two processes that generate them share some deeper patterns like the laws of physics, or broad parameters of human morality.
This kind of out-of-distribution robustness problem is one of the themes of Marcus too, for the physics part at least. One of the key arguments for the hybrid/neurosymbolic approach is that you will need to (symbolically) encode some priors about these deeper patterns into the AI, if you ever want it to perform well on such out-of-distribution benchmarks.
Another argument for the neurosymbolic approach is that you often simply do not have enough training data to get your model robust enough if you start from a null prior, so you will need to compensate for this by adding some priors. Having deeply polluted training data also means you will need to add priors, or do lots of other tricks, to get the model you really want. There is an intriguing possibility that DNN based transfer learning might contribute to the type of benchmarks I am interested in. This branch of research is usually framed in a way where people do not picture the the second small training data set being used in the transfer learning run as a prior, but on a deeper level it is definitely a prior.
You have been arguing that symbolic+scaling is all we need to drive AI progress, that there is no room for the neuro+symbolic+scaling approach. This argument rests on a hidden assumption that many academic AI researchers also like to make: the assumption that for all AI application domains that you are interested in, you will never run out of clean training data.
Doing academic AI research under the assumption that you always have infinite clean training data assumption would be fine if such research were confined to one small and humble sub-branch of academic AI. The problem is that the actual branch of AI making this assumption is far from small and humble. It in fact claims, via writings like the Bitter Lesson, to be the sum total of what respectable academic AI research should be all about. It is also the sub-branch that gets almost all the hype and the press.
The availability of infinite clean training data assumption is of course true for games that can be learned by self-play. It is less true for many other things that we would like AI to be better at. The ‘top’ academic ML conferences are slowly waking up to this, but much too slowly as far as I am concerned.
The really dangerous part of the research is that this is the type of game which incentivizes deceptive alignment by default, which is extremely dangerous to do. It ranks amongst one of the worst failure modes in AI Alignment, and this is at the top 5 risky directions to go to.
Worth noting that this was “Blitz” Diplomacy with only five-minute negotiation rounds. Still very impressive though.
Relevant: In What 2026 Looks Like, Daniel Kokotajlo predicted expert level Diplomacy play would be reached in 2025.
I’m mentioning this, not to discredit Daniel’s prediction, but to point out that this seems like capabilities progress ahead of what some expected.
Continuing the quote:
Worth noting that Meta did not do this: they took many small models (some with LM pretraining) and composed them in a specialized way. It’s definitely faster than what Daniel said in his post, but this is also in part an update downwards on the difficulty of full press diplomacy (relative to what Daniel expected).
If we’re using Daniel’s post to talk about whether capabilities progress is faster or slower than expected, it’s worth noting that parts of the 2022 prediction did not come true:
GPT-3 is not “finally obselete” --
text-davinci-002
, a GPT-3 variant, is still the best API model. (That being said, it is no longer SoTA compared to some private models.)We did not get giant multi-modal transformers.
He did get the “bureaucracy” prediction quite right; a lot of recent LM progress has been figuring out how to prompt engineer and compose LMs to elicit more capabilities out of them.
A deliberate nod?
Two other relevant pieces of information:
First, you can watch a human expert diplomacy player play vs Cicero bots here:
Second, according to the same player, Cicero has won a full press diplomacy tournament in the past, under the name Franz Broseph:
https://www.thenadf.org/tournament/captain-meme-runs-first-blitzcon/
Biggest thing that stood out to me watching this was that while the AI’s tactics seemed quite good, its game theory seemed quite poor—e.g. it wasn’t sufficiently vindictive if you betrayed it, which made it vulnerable to exploitation by a human aware of that fact.
I am doubtful about this. I am unsure whether Cicero will score higher if it is more vindictive, so I am hesitant to call its game theory poor. A good analogy is that I am hesitant to call AlphaGo’s endgame moves poor even if they look 100% poor, because I am not sure whether AlphaGo will win more games if it plays more human like endgame.
In the video, the human wins precisely because they exploit this fact about the AI.
I’m an author on the paper. This is an interesting topic that I think we approached in roughly the right way. For context, some of my teammates and I did earlier research on AI for poker, so that concern for exploitability certainly carried over to our work on Diplomacy.
The setting that the human plays in the video (one human vs 6 known Cicero agents) is not the setting that we intended the agent to play in and is not the setting that we evaluate the agent. That’s simply a demonstration to get a sense of how the bot plays. If you want to evaluate the bot’s exploitability and game theory, it should be done in the setting we intended for evaluation.
The setting we intended the bot to play in is games where all players are anonymous, and there is a large pool of possible players. That means players don’t necessarily know which player is a bot, or whether there is a bot in that specific game at all. In that case, it’s reasonable for the human players to assume all other players might engage in retaliatory behavior, so the agent gets the benefit of a tit-for-tat reputation without having to actually demonstrate it.
The assumption that players are anonymous is explicitly accounted for in the algorithm. It’s the reason why we assume there is a common knowledge distribution over our lambda parameters for piKL while in fact we actually play according to a single low lambda. If you were to change that assumption, perhaps by having all players know that a specific player is a bot at the start of the game, then you should change the common knowledge distribution over lambda parameters to be that the bot will play according to the lambda it actually intends to play. In that case the agent will behave differently. Specifically, it will play a much more mixed, less exploitable policy.
It sounds like Cicero competes to win against other players who are trying to satisfy other human goals ingrained by evolution. Does not seem very fair.
Do we know to what extent top-rated players actually try to win in this anonymized no-stakes setting, as opposed to trying to signal qualities that we evolved to want to signal in non-anonymized ancestral environment?
Why is your gain of function research deserving of NIH funding?
I’m reaching vantablack levels of blackpill...
Yikes. This feels like someone watched the movie WarGames and thought “yeah, that sounds cool, let’s train an AI to do that”.
Had you seen the researcher explanation for the March 2022 “AI suggested 40,000 new possible chemical weapons in just six hours” paper? I quote (paywall):
This seems like one of the most tractable things to address to reduce AI risk.
If 5 years from now anyone developing AI or biotechnology is still not thinking (early and seriously) about ways their work could cause harm that other people have been talking about for years, I think we should consider ourselves to have failed.
To add some more emphasis to my point, because I think it deserves more emphasis:
Quoting the interview Jacy linked to:
I know I’m not saying anything new here, and I’m merely a layperson without ability to verify the truth of the claim I highlighted in bold above, but I do want to emphasize further:
I seems clear that changing the machine learning space so that it is like the chemistry space in the sense that you do get informed about ways machine learning can be misused and cause harm, is something that we should all push to make happen as soon as we can. (Also expanding the discussion of potential harm beyond harm caused from misuse to any harm related to the technology.)
Years ago I recall hearing Stuart Russell mention the analogy of how civil engineers don’t have a separate field for bridge safety; rather bridge safety is something all bridge designers are educated on and concerned about, and he similarly doesn’t want the field of AI safety to be separate from AI but wants all people working on AI to be educated on and concerned with risks from AI.
This is the same point I’m saying here, and I’m saying it again because it seems like the present machine learning space is still far from this point and we as a community really do need to devote more efforts to ensuring that we change this in the near future.
I wouldn’t be surprised if there is military funding involved somewhere.
Quick emarks and questions:
AI developers have been competing to solve purely-adversarial / zero-sum games, like Chess or Go. But Diplomacy, in contrast, is semi-cooperative. Will be safer if AGI emerges from semi-cooperative games than purely-adversarial games?
Is it safer if AGI can be negotiated with?
No-Press Diplomacy was solved by DeepMind in 2020. MetaAI was just solved Full-Press Diplomacy. The difference is that in No-Press Diplomacy the players can’t communicate whereas in Full-Press Diplomacy the players can chat for 5 minutes between rounds.
Is Full-Press more difficult than No-Press Diplomacy, other than the skill of communicating one’s intentions?
Full-Press Diplomacy requires a recursive theory of mind — does No-Press Diplomacy also?
CICERO consists of a planning engine and a dialogue engine. How much of the “intelligence” is the dialogue engine?
Maybe the planning engine is doing all the work, and the dialogue engine is just converting plans into natural language, but isn’t doing anything more impressive than that.
Alternatively, it might be that the dialogue engine (which is a large language model) is containing latent knowledge and skills.
Could an architecture like this actually be used in international diplomacy and corporate negotiations? Will it be?
There’s hope among the AI Safety community that competent-but-not-yet-dangerous AI might assist them in alignment research. Maybe this Diplomacy result will boost hope in the AI Governence community that competent-but-not-yet-dangerous AI might assist them in governance. Would this hope be reasonable?
Re 3: Cicero team concedes they haven’t overcome the challenge of maintaining coherency in chatting agents. They think they got away with it because 5 minutes are too short, and consider the game with longer negotiation periods will be more challenging.
My brief opinions:
1. Slightly lower probability of s-risk, approximately the same probability of x-risk.
2. Slightly lower probability of s-risk, approximtely the same probability of x-risk.
3. Prior to Cicero I thought full-press diplomacy was significantly more difficult, due to the politics aspect. Now I guess it wasn’t actually significantly more difficult.
4. Not sure.
5. No.
6. No.
Interestingly, the paper seems to imply that the system does not attempt to deceive:
Cicero is designed to be honest in the sense that all its messages are generated from its intents, where its intents are what moves Cicero in fact intends to play at the moment Cicero said them (Cicero can change its mind after saying things), and at the end of the turn played moves are equal to its last intents.
Not only Cicero uses its true intents to generate messages, it also tries to generate messages that correspond to intents. That is, its dialogue model is trained to imitate humans in WebDiplomacy, but when humans intend to attack Belgium, they will sometimes say things like “I won’t attack Belgium”. That is, AI can lie by forming intent to attack Belgium, and devising lying intent “won’t attack Belgium”, and generating lying messages from lying intent. Cicero doesn’t do this, its intent input to dialogue model is always truthful. AI can also lie by forming intent to attack Belgium, and generating lying messages like “I won’t attack Belgium” by imitating lying humans from truthful intent. Cicero also doesn’t do this! Dialogue model is trained to imitate only truthful humans, training data is filtered by the lie detector and 5% of turns are filtered.
That does not mean Cicero does not dissemble or mislead! There are three aspects to this. First, there is messaging model, entirely separate from dialogue model. Messaging model decides whether to send messages at all, trained to imitate humans. When humans intend to attack Belgium, held by France, they may not message France at all. Cicero copies this behavior.
Second, there is topic model, also entirely separate. Topic model decides which intent to talk about, trained to imitate humans. When humans intend to attack Belgium, held by France, and also Norway, held by Russia, they may talk to France about Norway and Russia about Belgium. Cicero also copies this behavior.
Third, there is filtering model, also entirely separate. When Cicero intends to attack Belgium, held by France, maybe messaging model decides to talk to France and topic model decides to talk to France about Belgium and dialogue model decides to say “I will attack Belgium”. That does not mean Cicero says “I will attack Belgium”, filtering model can veto it. In particular, value-based filtering model estimates how saying something will impact its own utility. Eight messages are sampled from dialogue model, their value impacts are calculated, importance is calculated from value impacts, and in 15% of most important situations, bottom three messages are dropped, and one message is picked at random.
Should Cicero’s relative honesty lead us to update toward ELK being easier, or is it too task-specific to be relevant to ELK overall?
I think it’s irrelevant to ELK. Cicero’s honesty is hardcoded in fixed architecture, it’s not latent at all.
Although impressive, it is worth to notice that Cicero only played blitz games (in which each turn lasts 5 minutes, and players are not usually very invested).
An AI beating 90% of players in blitz chess is less of an achievement than an AI beating 90% of players in 40 min chess; and I expect the same to be true for Diplomacy. Also, backstabbing and elaborate schemes are considerably rarer in blitz.
I would be very curious to see Cicero compete in a Diplomacy game with longer turns.
I’m curious how long it’ll be until a general model can play Diplomacy at this level. Anyone fine-tuned an LLM like GPT-3 on chess yet? Chess should be simpler for an LLM to learn unless my intuition is misleading?
I’ve fine tuned LLMs on chess and it indeed is quite easy for them to learn.
Interesting, how good can they get? Any ELO estimates?
We never did ELO tests, but the 2.7B model trained from scratch on human games in PGN notation beat me and beat my colleague (~1800 ELO). But it would start making mistakes if the game went on very long (we hypothesized it was having difficulties constructing the board state from long PGN contexts), so you could beat it by drawing the game out.
I couldn’t find any actual elo estimates (nor code that lets me estimate the elo of a bot), but GPT-3 (at least,
davinci
andtext-davinci-002
) can play chess decently well using PGN notation without any finetuning.this seems like a very interesting research artifact to me
How can they be so incredibly obtuse?
Cicero, as it is redirecting its entire fleet: ‘What did you call me?’