I’m having trouble keeping track of everything I’ve learned about AI
and AI alignment in the past year or so. I’m writing this post in part
to organize my thoughts, and to a lesser extent I’m hoping for feedback
about what important new developments I’ve been neglecting. I’m sure
that I haven’t noticed every development that I would consider
important.
I’ve become a bit more optimistic about AI alignment in the past year
or so.
I currently estimate a 7% chance AI will kill us all this century.
That’s down from estimates that fluctuated from something like 10% to
40% over the past decade. (The extent to which those numbers fluctuate
implies enough confusion that it only takes a little bit of evidence to
move my estimate a lot.)
I’m also becoming more nervous about how close we are to human level
and transformative AGI. Not to mention feeling uncomfortable that I
still don’t have a clear understanding of what I mean when I say human
level or transformative AGI.
Shard Theory
Shard theory is a paradigm that seems destined to replace the focus (at
least on LessWrong) on utility functions as a way of describing what
intelligent entities want.
I kept having trouble with the plan to get AIs to have utility functions
that promote human values.
Human values mostly vary in response to changes in the environment. I
can make a theoretical distinction between contingent human values and
the kind of fixed terminal values that seem to belong in a utility
function. But I kept getting confused when I tried to fit my values, or
typical human values, into that framework. Some values seem clearly
instrumental and contingent. Some values seem fixed enough to sort of
resemble terminal values. But whenever I try to convince myself that
I’ve found a terminal value that I want to be immutable, I end up
feeling confused.
Shard theory tells me that humans don’t have values that are well
described by the concept of a utility function. Probably nothing will go
wrong if I stop hoping to find those terminal values.
We can describe human values as context-sensitive heuristics. That will
likely also be true of AIs that we want to create.
I feel deconfused when I reject utility functions, in favor of values
being embedded in heuristics and/or subagents.
Some of the posts that better explain these ideas:
I’ve become a bit more optimistic that we’ll find a way to tell AIs
things like “do what humans want”, have them understand that, and have
them obey.
GPT3 has a good deal of knowledge about human values, scattered around
in ways that limit the usefulness of that knowledge.
LLMs show signs of
being less alien than theory, or evidence from systems such as AlphaGo,
led me to expect. Their training causes them to learn human concepts
pretty faithfully.
That suggests clear progress toward AIs understanding human requests.
That seems to be proceeding a good deal faster than any trend toward AIs
becoming agenty.
However, LLMs suggest that it will be not at all trivial to ensure that
AIs obey some set of commands that we’ve articulated. Much of the work
done by LLMs involves simulating a stereotypical
human.
That puts some limits on how far they stray from what we want. But the
LLM doesn’t have a slot where someone could just drop in Asimov’s Laws
so as to cause the LLM to have those laws as its goals.
The post Retarget The
Search
provides a little hope that this might become easy. I’m still somewhat
pessimistic about this.
Interpretability
Interpretability feels more important than it felt a few years ago. It
also feels like it depends heavily on empirical results from AGI-like
systems.
I see more signs than I expected that interpretability research is
making decent progress.
I’m currently estimating a 40% chance that before we get existentially
risky AI, neural nets will be transparent enough to generate an expert
consensus about which AIs are safe to deploy. A few years ago, I’d have
likely estimated a 15% chance of that. An expert consensus seems
somewhat likely to be essential if we end up needing pivotal
processes.
Foom
We continue to accumulate clues about takeoff speeds. I’m becoming
increasingly confident that we won’t get a strong or unusually
dangerous version of foom.
Evidence keeps accumulating that intelligence is compute-intensive. That
means replacing human AI developers with AGIs won’t lead to dramatic
speedups in recursive self-improvement.
Recent progress in LLMs suggest there’s an important set of skills for
which AI improvement slows down as it reaches human levels, because it
is learning by imitating humans. But keep in mind that there are also
important dimensions on which AI easily blows past the level of an
individual human (e.g. breadth of knowledge), and will maybe slow down
as it matches the ability of all humans combined.
LLMs also suggest that AI can become as general-purpose as humans while
remaining less agentic / consequentialist. LLMs have outer layers that
are fairly myopic, aiming to predict a few thousand words of future
text.
The agents that an LLM simulates are more far-sighted. But there are
still major obstacles to them implementing long-term plans: they almost
always get shut down quickly, so it would take something unusual for
them to run long enough to figure out what kind of simulation they’re
in and to break out.
This doesn’t guarantee they won’t become too agentic, but I suspect
they’d first need to become much more capable than humans.
Evidence is also accumulating that existing general approaches will be
adequate to produce AIs that exceed human abilities at most important
tasks. I anticipate several more innovations at the level of RELU and
the transformer architecture, in order to improve scaling.
That doesn’t rule out the kind of major architectural breakthrough that
could cause foom. But it’s hard to see a reason for predicting such a
breakthrough. Extrapolations of recent trends tell me that AI is likely
to transform the world in the 2030s. Whereas if foom is going to happen,
I see no way to predict whether it will happen soon.
GPT3 is provided as an example of something that has some knowledge
that could theoretically bear on situational awareness but I don’t
think this goes far (It seems it has no self-concept at all); it is
one thing to know about the world in general, and it is another very
different to infer that you are an agent being trained. I can imagine
a system that could do general purpose science and engineering without
being either agentic or having a self-concept. … A great world
model that comes to be by training models the way we do now need not
give rise to a self-concept, which is the problematic thing.
I think it’s rather likely that smarter-than-human AGIs will tend to
develop self-concepts. But I’m not too clear on when or how this will
happen. In fact, the embedded
agency discussions seem
to hint that it’s unnatural for a designed agent to have a
self-concept.
Can we prevent AIs from developing a self-concept? Is this a valuable
thing to accomplish?
My shoulder Eliezer says that AIs with a self-concept will be more
powerful (via recursive self-improvement), so researchers will be
pressured to create them. My shoulder Eric Drexler replies that those
effects are small enough that researchers can likely be deterred from
creating such AIs for a nontrivial time.
I’d like to see more people analyzing this topic.
Social Influences
Leading AI labs do not seem to be on a course toward a clear-cut arms
race.
Most AI labs see enough opportunities in AI that they expect most AI
companies to end up being worth anywhere from $100 million to $10
trillion. A worst-case result of being a $100 million company is a good
deal less scary than the typical startup environment, where people often
expect a 90% chance of becoming worthless and needing to start over
again. Plus, anyone competent enough to help create an existentially
dangerous AI seems likely to have many opportunities to succeed if their
current company fails.
Not too many investors see those opportunities, but there are more than
a handful of wealthy investors who are coming somewhat close to
indiscriminately throwing money at AI companies. This seems likely to
promote an abundance mindset among serious companies that will dampen
urges to race against other labs for first place at some hypothetical
finish line. Although there’s a risk that this will lead to FTX-style
overconfidence.
The worst news of 2022 is that the geopolitical world is heading toward
another cold
war.
The world is increasingly polarized into a conflict between the West and
the parts of the developed world that resist Western culture.
Will that be enough to cause a serious race between the West and China
to develop the first AGI? If AGI is 5 years away, I don’t see how the
US government is going to develop that AGI before a private company
does. But with 15 year timelines, the risks of a hastily designed
government AGI look serious.
Much depends on whether the US unites around concerns about China
defeating the US. It seems not too likely that China would either
develop AGI faster than the US, or use AGI to conquer territories
outside of Asia. But it’s easy for a country to mistakenly imagine that
it’s in a serious arms race.
Trends in Capabilities
I’m guessing the best publicly known AIs are replicating something like
8% of human cognition versus 2.5% 5 years ago. That’s in systems that
are available to the public—I’m guessing those are a year or two
behind what’s been developed but is still private.
Is that increasing linearly? Exponentially? I’m guessing it’s closer
to exponential growth than linear growth, partly because it grew for
decades in order to get to that 2.5%.
This increase will continue to be underestimated by people who aren’t
paying close attention.
Advances are no longer showing up as readily quantifiable milestones
(beating go experts). Instead, key advances are more like increasing
breadth of abilities. I don’t know of good ways to measure that other
than “jobs made obsolete”, which is not too well quantified, and
likely lagging a couple of years behind the key technical advances.
I also see a possible switch from overhype to underhype. Up to maybe 5
years ago, AI companies and researchers focused a good deal on showing
off their expertise, in order to hire or be hired by the best. Now the
systems they’re working on are likely valuable enough that trade
secrets will start to matter.
This switch is hard for most people to notice, even with ideal news
sources. The storyteller industry obfuscates this further, by biasing
stories to sound like the most important development of the day. So when
little is happening, they exaggerate the story importance. But they
switch to understating the importance when preparing for an emergency
deserves higher priority than watching TV (see my Credibility of
Hurricane
Warnings).
Concluding Thoughts
I’m optimistic in the sense that I think that smart people are making
progress on AI alignment, and that success does not look at all
hopeless.
But I’m increasingly uncomfortable about how fast AGI is coming, how
foggy the path forward looks, and how many uncertainties remain.
Review of AI Alignment Progress
Link post
I’m having trouble keeping track of everything I’ve learned about AI and AI alignment in the past year or so. I’m writing this post in part to organize my thoughts, and to a lesser extent I’m hoping for feedback about what important new developments I’ve been neglecting. I’m sure that I haven’t noticed every development that I would consider important.
I’ve become a bit more optimistic about AI alignment in the past year or so.
I currently estimate a 7% chance AI will kill us all this century. That’s down from estimates that fluctuated from something like 10% to 40% over the past decade. (The extent to which those numbers fluctuate implies enough confusion that it only takes a little bit of evidence to move my estimate a lot.)
I’m also becoming more nervous about how close we are to human level and transformative AGI. Not to mention feeling uncomfortable that I still don’t have a clear understanding of what I mean when I say human level or transformative AGI.
Shard Theory
Shard theory is a paradigm that seems destined to replace the focus (at least on LessWrong) on utility functions as a way of describing what intelligent entities want.
I kept having trouble with the plan to get AIs to have utility functions that promote human values.
Human values mostly vary in response to changes in the environment. I can make a theoretical distinction between contingent human values and the kind of fixed terminal values that seem to belong in a utility function. But I kept getting confused when I tried to fit my values, or typical human values, into that framework. Some values seem clearly instrumental and contingent. Some values seem fixed enough to sort of resemble terminal values. But whenever I try to convince myself that I’ve found a terminal value that I want to be immutable, I end up feeling confused.
Shard theory tells me that humans don’t have values that are well described by the concept of a utility function. Probably nothing will go wrong if I stop hoping to find those terminal values.
We can describe human values as context-sensitive heuristics. That will likely also be true of AIs that we want to create.
I feel deconfused when I reject utility functions, in favor of values being embedded in heuristics and/or subagents.
Some of the posts that better explain these ideas:
Shard Theory in Nine Theses: a Distillation and Critical Appraisal
The shard theory of human values
A shot at the diamond-alignment problem
Alignment allows “nonrobust” decision-influences and doesn’t require robust grading
Why Subagents?
Section 6 of Drexler’s CAIS paper
EA is about maximization, and maximization is perilous (i.e. it’s risky to treat EA principles as a utility function)
Do What I Mean
I’ve become a bit more optimistic that we’ll find a way to tell AIs things like “do what humans want”, have them understand that, and have them obey.
GPT3 has a good deal of knowledge about human values, scattered around in ways that limit the usefulness of that knowledge.
LLMs show signs of being less alien than theory, or evidence from systems such as AlphaGo, led me to expect. Their training causes them to learn human concepts pretty faithfully.
That suggests clear progress toward AIs understanding human requests. That seems to be proceeding a good deal faster than any trend toward AIs becoming agenty.
However, LLMs suggest that it will be not at all trivial to ensure that AIs obey some set of commands that we’ve articulated. Much of the work done by LLMs involves simulating a stereotypical human. That puts some limits on how far they stray from what we want. But the LLM doesn’t have a slot where someone could just drop in Asimov’s Laws so as to cause the LLM to have those laws as its goals.
The post Retarget The Search provides a little hope that this might become easy. I’m still somewhat pessimistic about this.
Interpretability
Interpretability feels more important than it felt a few years ago. It also feels like it depends heavily on empirical results from AGI-like systems.
I see more signs than I expected that interpretability research is making decent progress.
The post that encouraged me most was How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme. TL;DR: neural networks likely develop simple representations of whether their beliefs are truth or false. The effort required to detect those representations does not seem to increase much with increasing model size.
Other promising ideas:
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
The Plan − 2022 Update
Drexler’s QNR
Causal Scrubbing
Taking features out of superposition with sparse autoencoders
Transformer Circuits
Are there convergently-ordered developmental milestones for AI?
I’m currently estimating a 40% chance that before we get existentially risky AI, neural nets will be transparent enough to generate an expert consensus about which AIs are safe to deploy. A few years ago, I’d have likely estimated a 15% chance of that. An expert consensus seems somewhat likely to be essential if we end up needing pivotal processes.
Foom
We continue to accumulate clues about takeoff speeds. I’m becoming increasingly confident that we won’t get a strong or unusually dangerous version of foom.
Evidence keeps accumulating that intelligence is compute-intensive. That means replacing human AI developers with AGIs won’t lead to dramatic speedups in recursive self-improvement.
Recent progress in LLMs suggest there’s an important set of skills for which AI improvement slows down as it reaches human levels, because it is learning by imitating humans. But keep in mind that there are also important dimensions on which AI easily blows past the level of an individual human (e.g. breadth of knowledge), and will maybe slow down as it matches the ability of all humans combined.
LLMs also suggest that AI can become as general-purpose as humans while remaining less agentic / consequentialist. LLMs have outer layers that are fairly myopic, aiming to predict a few thousand words of future text.
The agents that an LLM simulates are more far-sighted. But there are still major obstacles to them implementing long-term plans: they almost always get shut down quickly, so it would take something unusual for them to run long enough to figure out what kind of simulation they’re in and to break out.
This doesn’t guarantee they won’t become too agentic, but I suspect they’d first need to become much more capable than humans.
Evidence is also accumulating that existing general approaches will be adequate to produce AIs that exceed human abilities at most important tasks. I anticipate several more innovations at the level of RELU and the transformer architecture, in order to improve scaling.
That doesn’t rule out the kind of major architectural breakthrough that could cause foom. But it’s hard to see a reason for predicting such a breakthrough. Extrapolations of recent trends tell me that AI is likely to transform the world in the 2030s. Whereas if foom is going to happen, I see no way to predict whether it will happen soon.
Self Concept
Nintil’s analysis of AI risk:
I think it’s rather likely that smarter-than-human AGIs will tend to develop self-concepts. But I’m not too clear on when or how this will happen. In fact, the embedded agency discussions seem to hint that it’s unnatural for a designed agent to have a self-concept.
Can we prevent AIs from developing a self-concept? Is this a valuable thing to accomplish?
My shoulder Eliezer says that AIs with a self-concept will be more powerful (via recursive self-improvement), so researchers will be pressured to create them. My shoulder Eric Drexler replies that those effects are small enough that researchers can likely be deterred from creating such AIs for a nontrivial time.
I’d like to see more people analyzing this topic.
Social Influences
Leading AI labs do not seem to be on a course toward a clear-cut arms race.
Most AI labs see enough opportunities in AI that they expect most AI companies to end up being worth anywhere from $100 million to $10 trillion. A worst-case result of being a $100 million company is a good deal less scary than the typical startup environment, where people often expect a 90% chance of becoming worthless and needing to start over again. Plus, anyone competent enough to help create an existentially dangerous AI seems likely to have many opportunities to succeed if their current company fails.
Not too many investors see those opportunities, but there are more than a handful of wealthy investors who are coming somewhat close to indiscriminately throwing money at AI companies. This seems likely to promote an abundance mindset among serious companies that will dampen urges to race against other labs for first place at some hypothetical finish line. Although there’s a risk that this will lead to FTX-style overconfidence.
The worst news of 2022 is that the geopolitical world is heading toward another cold war. The world is increasingly polarized into a conflict between the West and the parts of the developed world that resist Western culture.
The US government is preparing to cripple China.
Will that be enough to cause a serious race between the West and China to develop the first AGI? If AGI is 5 years away, I don’t see how the US government is going to develop that AGI before a private company does. But with 15 year timelines, the risks of a hastily designed government AGI look serious.
Much depends on whether the US unites around concerns about China defeating the US. It seems not too likely that China would either develop AGI faster than the US, or use AGI to conquer territories outside of Asia. But it’s easy for a country to mistakenly imagine that it’s in a serious arms race.
Trends in Capabilities
I’m guessing the best publicly known AIs are replicating something like 8% of human cognition versus 2.5% 5 years ago. That’s in systems that are available to the public—I’m guessing those are a year or two behind what’s been developed but is still private.
Is that increasing linearly? Exponentially? I’m guessing it’s closer to exponential growth than linear growth, partly because it grew for decades in order to get to that 2.5%.
This increase will continue to be underestimated by people who aren’t paying close attention.
Advances are no longer showing up as readily quantifiable milestones (beating go experts). Instead, key advances are more like increasing breadth of abilities. I don’t know of good ways to measure that other than “jobs made obsolete”, which is not too well quantified, and likely lagging a couple of years behind the key technical advances.
I also see a possible switch from overhype to underhype. Up to maybe 5 years ago, AI companies and researchers focused a good deal on showing off their expertise, in order to hire or be hired by the best. Now the systems they’re working on are likely valuable enough that trade secrets will start to matter.
This switch is hard for most people to notice, even with ideal news sources. The storyteller industry obfuscates this further, by biasing stories to sound like the most important development of the day. So when little is happening, they exaggerate the story importance. But they switch to understating the importance when preparing for an emergency deserves higher priority than watching TV (see my Credibility of Hurricane Warnings).
Concluding Thoughts
I’m optimistic in the sense that I think that smart people are making progress on AI alignment, and that success does not look at all hopeless.
But I’m increasingly uncomfortable about how fast AGI is coming, how foggy the path forward looks, and how many uncertainties remain.