Yann LeCun has been saying a lot of things on social media recently about this topic, only some of which I’ve read. He’s also written and talked about it several times in the past. Most of what I’ve seen from him recently seems to not be addressing any of the actual arguments, but on the other hand I know he’s discussed this in many forums over several years, and he’s had the arguments spelled out to him so many times by so many people that it’s hard for me to believe he really doesn’t know what the substantive arguments are. Can someone who’s read more of Yann’s arguments on this please give their best understanding of what he’s actually arguing, in a way that will be understandable to people who are familiar with the standard x-risk arguments?
To save people the click to Lecun’s twitter, I’ll gather what pieces I can from Lecun’s recent twitter posts:
He does seem to believe that there is in fact a problem named “AI alignment” that has to be solved for human-level AIs, it’s just that he believes it will be much more manageable than AI-notkilleveryoneism people expect.
And from the responses to Julian Togelius’ recent blog post:
So from points 1 and 2 it looks like he fundamentally disagrees with Eliezer’s gesturing at a “core of generality” that develops once you optimize deeply enough. He doesn’t expect a system to suddenly “get it” and see the deep regularities that underlie most problems, in fact I think he doesn’t think there are such regularities.
Point 3 seems to be about the difficulty of box-escapes and dominating all of humanity. I’m reading that as disagreeing with the general jump that lesswrong types usually do to go from “human-level AI” to “strictly stronger than all of humanity combined”.
Point 4 seems like a disagreement with instrumental convergence.
Point 5 is the only really interesting one imo, and I think it’s a very good point. Current image recognition models at all ability levels are vulnerable to adversarial attacks which make them unable to recognise images with imperceptible changes as the proper category. I don’t know if LLMs have something similar, but I think it’s very likely the case, and I wouldn’t expect the adversarial attacks to stop working on the most advanced models. So we do have examples of very dumb but targeted methods of defeating very general models, which implies that we might actually have a chance against a superintelligence if we’ve developed targeted weapons before it gets loose.
So he’s not impressed by GPT4, and apparently doesn’t think that LLMs in general have a shot at credibly reaching human-level.
He expects AI safety to not be fundamentally different from any other engineering domain, and seems to disagree that we’ll only have a single shot at aligning a superintelligence.
After some more scouring of his twitter page, I actually found an argument for pessimism of LLMs that I agree with !!! (hallelujah)
This seems to be related to the “curse of behaviour cloning”. Learning to behave correctly only from a dataset of correct behaviour doesn’t work, you need examples in your dataset of how to correct wrong behaviour. As an example, if you try to make chatGPT play chess, at some point it will make a nonsensical move, or if you make it do chess analysis it will mistakenly claim that something happened, and thereafter it will condition its output on that wrong claim! It doesn’t go “ah yes 2 prompts ago I made a mistake due to random sampling”, that’s not the sort of thing that’s in the dataset, it just goes with it, and the text it generates drifts further and further away from its training distribution.
All the chess books it was trained on did not contain mistakes, and it perpetually believes that its prompt was sampled from the human distribution of chess analysis writing, where in fact it was sampled from chatGPT’s distribution.
So basically autoregressive language models are fundamentally incapable of very-long-form correct thinking, because as soon as they make a mistake (and they do make mistakes with probability (1-e)^n), they will condition their future output on something false, which will make them produce yet more mistakes and spiral out of control into incoherency.
I observe this behavior a lot when using GPT-4 to assist in code. The moment it starts spitting out code that has a bug, the likelihood of future code snippets having bugs grows very quickly.
Sometimes it seems that humans do it, too. For example, when I make a typo, it is quite likely that I made another typo in the same paragraph.
(Alternative explanation: I am more likely to make mistakes when I am e.g. tired, so having made a mistake is evidence for being tired, which increases the chance of other mistakes being made.)
((On the other hand, there may be a similar explanation for the GPT, too.))
But you don’t condition your future output on your typo being correct, that’s what gpt is doing here. If it randomly makes a mistake that the text in its dataset wouldn’t contain, like mistakenly saying that a queen was captured, or it takes a mistaken step during a physics computation, when it thereafter tries to predict the next word, it still “thinks” that its past output is sampled from the distribution of human-chess-analysis or human-physics-problem-solving. On the human distribution if “the queen was captured” exists in the past prompt, then you can take it as fact that the queen was captured, but this is false for text sampled from LLMs.
To solve this problem you would need a very large dataset of mistakes made by LLMs, and their true continuations. You’d need to take all physics books ever written, intersperse them with LLM continuations, then have humans write the corrections to the continuations, like “oh, actually we made a mistake in the last paragraph, here is the correct way to relate pressure to temperature in this problem...”.
Don’t have to be humans any more, GTP4 can do this to itself.
It doesn’t work (at least right now), when I tried making chatGPT play chess against stockfish by giving it positions in algebraic notation and telling it to output 2 paragraphs of chess analysis before making its move, it would make a nonsensical move, and if I prompted it with “is there a mistake in the past output?” Or “are you sure this move is legal?”, It doesn’t realize that anything is out of order. Only once I point out the error explicitly can it realise that it made one and rationalize an explanation for the error.
That is novel (and, in my opinion potentially important/scary) capability of GPT4. You can look at A_Posthuman comment below for details. I do expect it to work on chess, be interested if proven wrong. You mentioned chatGPT but it can’t do reflection on usable level. To be fair I don’t know if GPT4 capabilities are on useful level/only tweak away right now, and how far they can be pushed if they are (as in if it can self-improve to ASI), but for solving “curse” problem even weak reflection capabilities should suffice.
I’ve not noticed this but it’d be interesting if true as it seems that the tuning/RLHF has managed to remove most of the behaviour where it talks down to the level of the person writing as evidenced by e.g. spelling mistakes. Should be easily testable too.
This argument proves too much in the sense that its generalization is simply a standard argument of why exact prediction of future sequences is difficult (exponentially diverging).
The solutions (which humans use) are fairly straightforward to apply to LLMs: 1.) we don’t condition only on our own predictions, we update on observations. (For LLMs this amounts to using react style prompting where the LLM’s outputs are always balanced interspersed with observations from the world and/or inputs from humans). 2.) We use approximate abstract future prediction/planning, which LLMs are also amenable to.
Yes, but training AI to try to fix errors is not that hard.
Yes it is. There is no freely available dataset of conveniently labelled LLM errors and their correct continuations. You need human labels to identify the errors, and you need an amount of them on the order of your training set, which here is the entire internet.
“Fundamentally incapable” is perhaps putting things too strongly, when you can see from the Reflexion paper and other recent work in the past 2 weeks that humans are figuring out how to work around this issue via things like reflection/iterative prompting:
https://nanothoughts.substack.com/p/reflecting-on-reflexion
https://arxiv.org/abs/2303.11366
Using this simple approach lets GPT-4 jump from 67% to 88% correct on the HumanEval benchmark.
So I believe the lesson is: “limitations” in LLMs may turn out to be fairly easily enhanced away by clever human helpers. Therefore IMO, whether or not a particular LLM should be considered dangerous must also take into account the likely ways humans will build additional tech onto/around it to enhance it.
The problem is that I think I might agree that “slightly smarter-than-human AGI in a box managed by trained humans” really doesn’t have as easy a way out as EY might think. But that’s also not what’s going to happen if things are only left to complete “don’t sweat it” techno-optimists. What’s going to happen is the AGI gets deployed as a virtual assistant in every copy of Windows 13 or whatever. Carefulness is important especially if AIs aren’t as powerful as Yud thinks, because that’s when it can make the difference between survival and defeat. Besides, if even after that we keep recursively improving on the thing, there’s only so far we can push our luck.