I refer to these posts:
https://optimists.ai/2023/11/28/ai-is-easy-to-control/
https://www.lesswrong.com/posts/CoZhXrhpQxpy9xw9y/where-i-agree-and-disagree-with-eliezer
My (poor, maybe mis-) understanding is that the argument is that as SGD optimizes for “predicting the next token” and we select for systems with very low loss by modifying every single parameter in the neural network (which basically defines the network itself), it seems quite unlikely that we’ll have a “sharp left turn” in the near term, which happened because evolution was too weak an outer optimizer to fully “control” humans’ thinking in the direction that most improved inclusive genetic fitness, as it is too weak to directly tinker every neuron connection in our brain.
Given SGD’s vastly stronger ability at outer optimisation of every parameter, isn’t it possible, if not likely, that any sharp left turn occurs only at a vastly superhuman level, if the inner optimizer becomes vastly stronger than SGD?
The above arguments have persuaded me that we might be able to thread the needle for survival if humanity is able to use the not-yet-actively-deceptive outputs of moderately-superhuman models (because they are still just predicting the next token to the best of their capability), to help us solve the potential sharp left turn and if humanity doesn’t do anything else stupid with other training methods/misuse and manages to solve the other problems. Of course, in an ideal world we wouldn’t be in this situation.
I have read some rebuttals by others on LessWrong but did not find anything that convincingly debunked this idea (maybe I missed something).
Did Eliezer, or anyone else, ever tell us why this is wrong (if it is)? I have been searching for the past week but have only found this: https://x.com/ESYudkowsky/status/1726329895121514565 which seemed to be switching to more of a post-training discussion.
Some general advice:
Taboo “sharp left turn”. If possible, replace it with a specific example from e.g. here.
Don’t worry about what Eliezer in particular has to say, just look for good arguments.
There’s also a failure mode of focusing on “which arguments are the best” instead of “what is actually true”. I don’t understand this failure mode very well, except that I’ve seen myself and others fall into it. Falling into it looks like focusing a lot on specific arguments, and spending a lot of time working out what was meant by the words, rather than feeling comfortable adjusting arguments to fit better into your own ontology and to fit better with your own beliefs.
Object level response:
I think this is a misunderstanding. Evolution failed to align humans in the sense that, when humans are in an extremely OOD environment, they don’t act as if they had evolved in that OOD environment. In other words, alignment is about generalization, it’s not about how much fine-grained control the outer optimizer had.
Eliezer does touch on this in a podcast, search for “information bottleneck”. In the podcast, Eliezer seems to be saying that SGD might have less simplicity bias than evolution, which may imply even worse goal generalization. But I don’t think I buy that, because there are plenty of other sources of simplicity bias in neural network training, so a priori I don’t see a strong reason to believe one would generalize better than the other.
There are theoretical questions about how generalization works, how neural networks in particular generalize, how we can apply these theories to reason about the “goals” of a trained AI, and how those goals relate to the training distribution. These are much more relevant to alignment than the information bottleneck of SGD. My impression of Quintin and Nora is that they don’t model advanced AIs as pursuing “goals” in the same way that I do, and I think that’s one of the main sources of disagreement. They have a much more context-dependent idea of what a goal is, and they seem to think this model of goals is sufficient for very intelligent behavior.
LLMs are already moderately-superhuman at the task of predicting next tokens. This isn’t sufficient to help solve alignment problems. We would need them to meet the much much higher bar of being moderately-superhuman at the general task of science/engineering. And having that level of capability implies (arguably, see here for some arguments):
the capacity to learn and develop new skills (basic self-modification),
the capacity to creatively solve way-out-of-training-distribution problems using way-out-of-training-distribution solutions,
maybe something analogous to the capability of moral philosophy, i.e. analyzing its own motivations and trying to work out what it “should” want.
These capabilities should (imo) produce sufficiently large context changes (distribution shifts) that they will probably be sufficient to reveal goal-misgeneralization. (some specific example mechanisms for goal misgeneralization are listed here).
So my answer, in short, is that “help us solve” implies “high capability” implies “OOD goal misgeneralization” (which is what I think you basically meant by “sharp left turn”). So it seems unlikely to me that we can use future AI to solve alignment in the way you describe, because misalignment should already be creating large problems by the time the AI is capable of helping.
The most obvious way of addressing this, “just feel more comfortable adjusting arguments to fit better into your own ontology and to fit better with your own beliefs”, has its own failure mode, where you end up attacking a strawman that you think is a better argument than what they made, defeating it, and thinking you’ve solved the issue when you haven’t. People have complained about this failure mode of steelmanning a couple of times. At a fixed level of knowledge and thought about the subject, it seems one can only trade off one danger against the other.
However, if you’re planning to increase your knowledge and time-spent-thinking about the subject, then during that time it’s better to focus on the ideas than on who-said-or-meant-what; the latter is instrumentally useful as a source of ideas.
We also need the assumption—which is definitely not obvious—that significant intelligence increases are relatively close to achievable. Superhumanly strong math skills presumably don’t let AI solve NP problems in P time, and it’s similarly plausible—though far from certain—that really good engineering skill tops out somewhere only moderately above human ability due to instrinsic difficulty, and really good deception skills top out somewhere not enough to subvert the best systems that we could build to do oversight and detect misalignment. (On the other hand, even with these objections being correct, it would only show that control is possible, not that it is likely to occur.)
My sense is that this is because different people have different intuitive priors, and process arguments (mostly) as a kind of Bayesian evidence that updates those priors, rather than modifying the priors (i.e. intuitions) directly.
Eliezer in particular strikes me as having an intuitive prior for AI alignment outcomes that looks very similar to priors for tasks like e.g. writing bug-free software on the first try, assessing the likelihood that a given plan will play out as envisioned, correctly compensating for optimism bias, etc. which is what gives rise to posts concerning concepts like security mindset.
Other people don’t share this intuitive prior, and so have to be argued into it. To such people, the reliability of the arguments in question is actually critical, because if those arguments turn out to have holes, that reverts the downstream updates and restores the original intuitive prior, whatever it looked like—kind of like a souped up version of the burden of proof concept, where the initial placement of that burden is determined entirely via the intuitive judgement of the individual.
This also seems related to why different people seem to naturally gravitate towards either conjunctive or disjunctive models of catastrophic outcomes from AI misalignment: the conjunctive impulse stems from an intuition that AI catastrophe is a priori unlikely, and so a bunch of different claims have to hold simultaneously in order to force a large enough update, whereas the disjunctive impulse stems from the notion that any given low-level claim need not be on particularly firm ground, because the high-level thesis of AI catastrophe robustly manifests via different but converging lines of reasoning.
See also: the focus on coherence, where some people place great importance on the question of whether VNM or other coherence theorems show what Eliezer et al. purport they show about superintelligent agents, versus the competing model wherein none of these individual theorems are important in their particulars, so much as the direction they seem to point, hinting at the concept of what idealized behavior with respect to non-gerrymandered physical resources ought to look like.
I think the real question, then, is where these differences in intuition come from, and unfortunately the answer might have to do a lot with people’s backgrounds, and the habits and heuristics they picked up from said backgrounds—something quite difficult to get at via specific, concrete argumentation.
I’m not sure I understand this distinction as-written. How is a Bayesian agent supposed to modify priors except by updating on the basis of evidence?
They’re not! But humans aren’t ideal Bayesians, and it’s entirely possible for them to update in a way that does change their priors (encoded by intuitions) moving forward. In particular, the difference between having updated one’s intuitive prior, and keeping the intuitive prior around but also keeping track of a different, consciously held posterior, is that the former is vastly less likely to “de-update”, because the evidence that went into the update isn’t kept around in a form that subjects it to (potential) refutation.
(IIRC, E.T. Jaynes talks about this distinction in Chapter 18 of Probability Theory: The Logic of Science, and he models it by introducing something he calls an A_p distribution. His exposition of this idea is uncharacteristically unclear, and his A_p distribution looks basically like a beta distribution with specific values for α and β, but it does seem to capture the distinction I see between “intuitive” updating versus “conscious” updating.)
I haven’t seen an answer by Eliezer. But I can go through the first post, and highlight what I think is wrong. (And would be unsurprised if Eliezer agreed with much of it)
We can see literally every neuron, but have little clue what they are doing.
Humans are aligned to human values because humans have human genes. Also individual humans can’t replicate themselves, which makes taking over the world much harder.
Humans have specific genes for absorbing cultural values, at least within a range of human cultures. There are various alien values that humans won’t absorb.
Hmm. I don’t think the case for that is convincing.
Current AI techniques involve giving the AI loads of neurons, so having a few neurons that aren’t being used isn’t a problem.
Also, it’s possible that the same neurons that sometimes plot to kill you are also sometimes used to predict plots in murder mystery books.
If you give the AI lots of tasks, it’s possible that the simplest solution is some kind of internal general optimizer.
Either you have an AI that is smart and general and can try new things that are substantially different from anything it’s done before. (In which case the new things can include murder plots) Or you have an AI that’s dumb and is only repeating small variations on it’s training data.
Current techniques are based on experiments/gradient descent. This works so long as the AI’s can’t break out of the sandbox or realize they are being experimented on and plot to trick the experimenters. You can’t keep an ASI in a little sandbox and run gradient descent on it.
Sure. And we use contraception. Which kind of shows that evolution failed somewhere a bit.
Also, evolution got a long time testing and refining with humans that didn’t have the tools to mess with evolution or even understand it.
No one is claiming the ASI won’t understand human values, they are saying it won’t care.
Is that evidence that LLM’s actually care about morality. Not really. It’s evidence that they are good at predicting humans. Get them predicting an ethics professor and they will answer morality questions. Get them predicting Hitler and they will say less moral things.
And of course, there is a big difference between an AI that says “be nice to people” and an AI that is nice to people. The former can be trivially achieved by hard coding a list of platitudes for the AI to parrot back. The second requires the AI to make decisions like “are unborn babies people?”.
Imagine some robot running around. You have an LLM that says nice-sounding things when posed ethical dilemmas. You need some system that turns the raw camera input into a text description, and the nice sounding output into actual actions.
It would be interesting if someone discovered something like “junk DNA that just copies itself” within the weights during the backprop+SGD process. Would be some evidence that backprop’s thumb is not so heavy a worm can’t wiggle out. Right now I would bet against that happening within a normal neural net training on a dataset.
Note that RL exists and gives the neural net much more uh “creative room” to uh “decide how to exist”. Because you just have to get enough score over time to survive, but any strategy is accepted. In other words, it is much less convergent.
Also in RL, glitching/hacking of the physics sim / game engine is what you expect to happen! Then you have to patch your sim and retrain.
Also, most of the ML systems we use every day involve multiple neural nets with different goals (eg the image generator and the NSFW detector), so something odd might happen in that interaction.
All this to say: The question “if I train one NN on a fixed dataset with backprop+SGD, could something unexpected pop out?” is quite interesting and still open in my opinion. But even if that always goes exactly as expected, it is certainly clear that RL, active learning, multi-NN ML systems, hyperparameter optimization (which is often an evolutionary algorithm), etc produces weird things with weird goals and strategies very often.
I think debate surrounds the 1-NN-1-dataset question because it is an interesting and natural and important question, the type of question a good scientist would ask. Probably only a small part of the bigger challenge to control the whole trained machine.
I think Eliezer briefly responds to this in his podcast with Dwarkesh Patel — satisfactorily is pretty subjective. https://youtu.be/41SUp-TRVlg?si=hE3gcWxjDtl1-j14
At about 24:40.