Taboo “sharp left turn”. If possible, replace it with a specific example from e.g. here.
Don’t worry about what Eliezer in particular has to say, just look for good arguments.
There’s also a failure mode of focusing on “which arguments are the best” instead of “what is actually true”. I don’t understand this failure mode very well, except that I’ve seen myself and others fall into it. Falling into it looks like focusing a lot on specific arguments, and spending a lot of time working out what was meant by the words, rather than feeling comfortable adjusting arguments to fit better into your own ontology and to fit better with your own beliefs.
Object level response:
“sharp left turn” in the near term, which happened because evolution was too weak an outer optimizer to fully “control” humans’ thinking in the direction that most improved inclusive genetic fitness, as it is too weak to directly tinker every neuron connection in our brain.
I think this is a misunderstanding. Evolution failed to align humans in the sense that, when humans are in an extremely OOD environment, they don’t act as if they had evolved in that OOD environment. In other words, alignment is about generalization, it’s not about how much fine-grained control the outer optimizer had.
Eliezer does touch on this in a podcast, search for “information bottleneck”. In the podcast, Eliezer seems to be saying that SGD might have less simplicity bias than evolution, which may imply even worse goal generalization. But I don’t think I buy that, because there are plenty of other sources of simplicity bias in neural network training, so a priori I don’t see a strong reason to believe one would generalize better than the other.
There are theoretical questions about how generalization works, how neural networks in particular generalize, how we can apply these theories to reason about the “goals” of a trained AI, and how those goals relate to the training distribution. These are much more relevant to alignment than the information bottleneck of SGD. My impression of Quintin and Nora is that they don’t model advanced AIs as pursuing “goals” in the same way that I do, and I think that’s one of the main sources of disagreement. They have a much more context-dependent idea of what a goal is, and they seem to think this model of goals is sufficient for very intelligent behavior.
humanity is able to use the not-yet-actively-deceptive outputs of moderately-superhuman models (because they are still just predicting the next token to the best of their capability), to help us solve the potential sharp left turn
LLMs are already moderately-superhuman at the task of predicting next tokens. This isn’t sufficient to help solve alignment problems. We would need them to meet the much much higher bar of being moderately-superhuman at the general task of science/engineering. And having that level of capability implies (arguably, see here for some arguments):
the capacity to learn and develop new skills (basic self-modification),
the capacity to creatively solve way-out-of-training-distribution problems using way-out-of-training-distribution solutions,
maybe something analogous to the capability of moral philosophy, i.e. analyzing its own motivations and trying to work out what it “should” want.
These capabilities should (imo) produce sufficiently large context changes (distribution shifts) that they will probably be sufficient to reveal goal-misgeneralization. (some specific example mechanisms for goal misgeneralization are listed here).
So my answer, in short, is that “help us solve” implies “high capability” implies “OOD goal misgeneralization” (which is what I think you basically meant by “sharp left turn”). So it seems unlikely to me that we can use future AI to solve alignment in the way you describe, because misalignment should already be creating large problems by the time the AI is capable of helping.
There’s also a failure mode of focusing on “which arguments are the best” instead of “what is actually true”. I don’t understand this failure mode very well, except that I’ve seen myself and others fall into it. Falling into it looks like focusing a lot on specific arguments, and spending a lot of time working out what was meant by the words, rather than feeling comfortable adjusting arguments to fit better into your own ontology and to fit better with your own beliefs.
The most obvious way of addressing this, “just feel more comfortable adjusting arguments to fit better into your own ontology and to fit better with your own beliefs”, has its own failure mode, where you end up attacking a strawman that you think is a better argument than what they made, defeating it, and thinking you’ve solved the issue when you haven’t. People have complained about this failure mode of steelmanning a coupleof times. At a fixed level of knowledge and thought about the subject, it seems one can only trade off one danger against the other.
However, if you’re planning to increase your knowledge and time-spent-thinking about the subject, then during that time it’s better to focus on the ideas than on who-said-or-meant-what; the latter is instrumentally useful as a source of ideas.
LLMs are already moderately-superhuman at the task of predicting next tokens. This isn’t sufficient to help solve alignment problems. We would need them to meet the much much higher bar of being moderately-superhuman at the general task of science/engineering.
We also need the assumption—which is definitely not obvious—that significant intelligence increases are relatively close to achievable. Superhumanly strong math skills presumably don’t let AI solve NP problems in P time, and it’s similarly plausible—though far from certain—that really good engineering skill tops out somewhere only moderately above human ability due to instrinsic difficulty, and really good deception skills top out somewhere not enough to subvert the best systems that we could build to do oversight and detect misalignment. (On the other hand, even with these objections being correct, it would only show that control is possible, not that it is likely to occur.)
There’s also a failure mode of focusing on “which arguments are the best” instead of “what is actually true”. I don’t understand this failure mode very well, except that I’ve seen myself and others fall into it. Falling into it looks like focusing a lot on specific arguments, and spending a lot of time working out what was meant by the words, rather than feeling comfortable adjusting arguments to fit better into your own ontology and to fit better with your own beliefs.
My sense is that this is because different people have different intuitive priors, and process arguments (mostly) as a kind of Bayesian evidence that updates those priors, rather than modifying the priors (i.e. intuitions) directly.
Eliezer in particular strikes me as having an intuitive prior for AI alignment outcomes that looks very similar to priors for tasks like e.g. writing bug-free software on the first try, assessing the likelihood that a given plan will play out as envisioned, correctly compensating for optimism bias, etc. which is what gives rise to posts concerning concepts like security mindset.
Other people don’t share this intuitive prior, and so have to be argued into it. To such people, the reliability of the arguments in question is actually critical, because if those arguments turn out to have holes, that reverts the downstream updates and restores the original intuitive prior, whatever it looked like—kind of like a souped up version of the burden of proof concept, where the initial placement of that burden is determined entirely via the intuitive judgement of the individual.
This also seems related to why different people seem to naturally gravitate towards either conjunctive or disjunctive models of catastrophic outcomes from AI misalignment: the conjunctive impulse stems from an intuition that AI catastrophe is a priori unlikely, and so a bunch of different claims have to hold simultaneously in order to force a large enough update, whereas the disjunctive impulse stems from the notion that any given low-level claim need not be on particularly firm ground, because the high-level thesis of AI catastrophe robustly manifests via different but converging lines of reasoning.
See also: the focus on coherence, where some people place great importance on the question of whether VNM or other coherence theorems show what Eliezer et al. purport they show about superintelligent agents, versus the competing model wherein none of these individual theorems are important in their particulars, so much as the direction they seem to point, hinting at the concept of what idealized behavior with respect to non-gerrymandered physical resources ought to look like.
I think the real question, then, is where these differences in intuition come from, and unfortunately the answer might have to do a lot with people’s backgrounds, and the habits and heuristics they picked up from said backgrounds—something quite difficult to get at via specific, concrete argumentation.
different people have different intuitive priors, and process arguments (mostly) as a kind of Bayesian evidence that updates those priors, rather than modifying the priors (i.e. intuitions) directly.
I’m not sure I understand this distinction as-written. How is a Bayesian agent supposed to modify priors except by updating on the basis of evidence?
How is a Bayesian agent supposed to modify priors except by updating on the basis of evidence?
They’re not! But humans aren’t ideal Bayesians, and it’s entirely possible for them to update in a way that does change their priors (encoded by intuitions) moving forward. In particular, the difference between having updated one’s intuitive prior, and keeping the intuitive prior around but also keeping track of a different, consciously held posterior, is that the former is vastly less likely to “de-update”, because the evidence that went into the update isn’t kept around in a form that subjects it to (potential) refutation.
(IIRC, E.T. Jaynes talks about this distinction in Chapter 18 of Probability Theory: The Logic of Science, and he models it by introducing something he calls an A_p distribution. His exposition of this idea is uncharacteristically unclear, and his A_p distribution looks basically like a beta distribution with specific values for α and β, but it does seem to capture the distinction I see between “intuitive” updating versus “conscious” updating.)
Some general advice:
Taboo “sharp left turn”. If possible, replace it with a specific example from e.g. here.
Don’t worry about what Eliezer in particular has to say, just look for good arguments.
There’s also a failure mode of focusing on “which arguments are the best” instead of “what is actually true”. I don’t understand this failure mode very well, except that I’ve seen myself and others fall into it. Falling into it looks like focusing a lot on specific arguments, and spending a lot of time working out what was meant by the words, rather than feeling comfortable adjusting arguments to fit better into your own ontology and to fit better with your own beliefs.
Object level response:
I think this is a misunderstanding. Evolution failed to align humans in the sense that, when humans are in an extremely OOD environment, they don’t act as if they had evolved in that OOD environment. In other words, alignment is about generalization, it’s not about how much fine-grained control the outer optimizer had.
Eliezer does touch on this in a podcast, search for “information bottleneck”. In the podcast, Eliezer seems to be saying that SGD might have less simplicity bias than evolution, which may imply even worse goal generalization. But I don’t think I buy that, because there are plenty of other sources of simplicity bias in neural network training, so a priori I don’t see a strong reason to believe one would generalize better than the other.
There are theoretical questions about how generalization works, how neural networks in particular generalize, how we can apply these theories to reason about the “goals” of a trained AI, and how those goals relate to the training distribution. These are much more relevant to alignment than the information bottleneck of SGD. My impression of Quintin and Nora is that they don’t model advanced AIs as pursuing “goals” in the same way that I do, and I think that’s one of the main sources of disagreement. They have a much more context-dependent idea of what a goal is, and they seem to think this model of goals is sufficient for very intelligent behavior.
LLMs are already moderately-superhuman at the task of predicting next tokens. This isn’t sufficient to help solve alignment problems. We would need them to meet the much much higher bar of being moderately-superhuman at the general task of science/engineering. And having that level of capability implies (arguably, see here for some arguments):
the capacity to learn and develop new skills (basic self-modification),
the capacity to creatively solve way-out-of-training-distribution problems using way-out-of-training-distribution solutions,
maybe something analogous to the capability of moral philosophy, i.e. analyzing its own motivations and trying to work out what it “should” want.
These capabilities should (imo) produce sufficiently large context changes (distribution shifts) that they will probably be sufficient to reveal goal-misgeneralization. (some specific example mechanisms for goal misgeneralization are listed here).
So my answer, in short, is that “help us solve” implies “high capability” implies “OOD goal misgeneralization” (which is what I think you basically meant by “sharp left turn”). So it seems unlikely to me that we can use future AI to solve alignment in the way you describe, because misalignment should already be creating large problems by the time the AI is capable of helping.
The most obvious way of addressing this, “just feel more comfortable adjusting arguments to fit better into your own ontology and to fit better with your own beliefs”, has its own failure mode, where you end up attacking a strawman that you think is a better argument than what they made, defeating it, and thinking you’ve solved the issue when you haven’t. People have complained about this failure mode of steelmanning a couple of times. At a fixed level of knowledge and thought about the subject, it seems one can only trade off one danger against the other.
However, if you’re planning to increase your knowledge and time-spent-thinking about the subject, then during that time it’s better to focus on the ideas than on who-said-or-meant-what; the latter is instrumentally useful as a source of ideas.
We also need the assumption—which is definitely not obvious—that significant intelligence increases are relatively close to achievable. Superhumanly strong math skills presumably don’t let AI solve NP problems in P time, and it’s similarly plausible—though far from certain—that really good engineering skill tops out somewhere only moderately above human ability due to instrinsic difficulty, and really good deception skills top out somewhere not enough to subvert the best systems that we could build to do oversight and detect misalignment. (On the other hand, even with these objections being correct, it would only show that control is possible, not that it is likely to occur.)
My sense is that this is because different people have different intuitive priors, and process arguments (mostly) as a kind of Bayesian evidence that updates those priors, rather than modifying the priors (i.e. intuitions) directly.
Eliezer in particular strikes me as having an intuitive prior for AI alignment outcomes that looks very similar to priors for tasks like e.g. writing bug-free software on the first try, assessing the likelihood that a given plan will play out as envisioned, correctly compensating for optimism bias, etc. which is what gives rise to posts concerning concepts like security mindset.
Other people don’t share this intuitive prior, and so have to be argued into it. To such people, the reliability of the arguments in question is actually critical, because if those arguments turn out to have holes, that reverts the downstream updates and restores the original intuitive prior, whatever it looked like—kind of like a souped up version of the burden of proof concept, where the initial placement of that burden is determined entirely via the intuitive judgement of the individual.
This also seems related to why different people seem to naturally gravitate towards either conjunctive or disjunctive models of catastrophic outcomes from AI misalignment: the conjunctive impulse stems from an intuition that AI catastrophe is a priori unlikely, and so a bunch of different claims have to hold simultaneously in order to force a large enough update, whereas the disjunctive impulse stems from the notion that any given low-level claim need not be on particularly firm ground, because the high-level thesis of AI catastrophe robustly manifests via different but converging lines of reasoning.
See also: the focus on coherence, where some people place great importance on the question of whether VNM or other coherence theorems show what Eliezer et al. purport they show about superintelligent agents, versus the competing model wherein none of these individual theorems are important in their particulars, so much as the direction they seem to point, hinting at the concept of what idealized behavior with respect to non-gerrymandered physical resources ought to look like.
I think the real question, then, is where these differences in intuition come from, and unfortunately the answer might have to do a lot with people’s backgrounds, and the habits and heuristics they picked up from said backgrounds—something quite difficult to get at via specific, concrete argumentation.
I’m not sure I understand this distinction as-written. How is a Bayesian agent supposed to modify priors except by updating on the basis of evidence?
They’re not! But humans aren’t ideal Bayesians, and it’s entirely possible for them to update in a way that does change their priors (encoded by intuitions) moving forward. In particular, the difference between having updated one’s intuitive prior, and keeping the intuitive prior around but also keeping track of a different, consciously held posterior, is that the former is vastly less likely to “de-update”, because the evidence that went into the update isn’t kept around in a form that subjects it to (potential) refutation.
(IIRC, E.T. Jaynes talks about this distinction in Chapter 18 of Probability Theory: The Logic of Science, and he models it by introducing something he calls an A_p distribution. His exposition of this idea is uncharacteristically unclear, and his A_p distribution looks basically like a beta distribution with specific values for α and β, but it does seem to capture the distinction I see between “intuitive” updating versus “conscious” updating.)