[Epistemic status: mostly writing to clarify my intuitions, with just a
few weak attempts to convince others. It’s no substitute for reading
Drexler’s writings.]
I’ve been struggling to write more posts relating to Drexler’s vision
for AI (hopefully to be published soon), and in the process got
increasingly bothered by the issue of whether AI researchers will see
incentives to give AI’s broad goals that turn them into agents.
Drexler’s CAIS paper convinced
me that our current trajectory is somewhat close to a scenario where
human-level AI’s that are tool-like services are available well before
AGI’s with broader goals.
Yet when I read LessWrong, I sympathize with beliefs that developers
will want quite agenty AGI’s around the same time that CAIS-like
services reach human levels.
I’m fed up with this epistemic learned
helplessness,
and this post is my attempt to reconcile those competing intuitions.
Please recall that Drexler’s distinction here focuses on a system’s
goals, not its knowledge. Software is more agenty when its goals cover a
wide range of domains, and long time horizons. Services are designed to
produce specific outputs using a system’s current procedures.
“The smart part is not the agent-y part”: intelligence and agency are
independent, with intelligence being the hard part, and agency being
more like prompt engineering.
This seems mostly correct, at least for Drexler’s meaning of agency.
There are still significant problems with getting agency correct, but
they’re mostly associated with figuring out what we want, and likely
not helped much by math or computer science expertise.
Orthogonality squared
ESRogs also suggests a potentially valuable concept that he calls
“Orthogonality squared”. I.e. the domain of an agent’s goals is
independent of its intelligence.
We can, in principle, create AI’s that don’t care about atoms. At
least for some simple tasks, it’s fairly natural to have the AI care
only about abstractions such as math. You can’t alter the laws of
arithmetic by rearranging atoms—laws of arithmetic are in a different
realm from atoms. An AI that cares only about a distant part of Tegmark
IV
won’t care whether we shut down an arrangement of atoms in our realm.
Many people concerned about AI safety seem reluctant to accept this.
E.g. Rob Bensinger’s initial reply to ESRogs worried about this
unsafe(?) case:
The system can think about atoms/<U+200B>physics, and it knows that our world
exists, but it still only terminally cares about digital things in the
simulated environment.
“Simulated environment” seems ambiguous as to whether it’s referring
to a distant part of Tegmark IV, or a special environment within our
world. The latter would imply that it’s tricky to analyze whether the
AI’s goals conflict with ours; the former does not.
It seems odd that Eliezer’s AGI
lethality
#19 warns that it’s hard to get software to care directly about the
real world, yet considers it inevitable that AGI’s will care about the
real world. That’s not a contradiction, but the tension between those
positions ought to raise further questions.
I don’t know how an AI could run a company without caring about atoms,
so the basic claim behind orthogonality squared is not sufficient to
reassure me.
But it illustrates the extent to which AI goals can be compatible with
leaving humans in control of the real world. So it should be possible to
broaden such goals to care about narrow aspects of our universe, without
caring about all of our future lightcone.
Analogies
We have some relevant experience with weakly superhuman agents in the
forms of governments, corporations, labor unions, and universities.
These all have some sort of limits on the breadth of their goals, yet
those goals often become broad enough to cause more harm than I’d
predict if I modeled them as purely tools / services.
Let’s examine what happens if I try to create a corporation that has no
dangerously agenty goals. I’ll call it Peter’s Paleo Pizza.
I start out as the sole owner / employee, making all strategy decisions
myself. I hire cooks who I tell to follow my recipe exactly.
I can grow the company to a few hundred people while micromanaging it
enough that I have all the policy control over it that I want. But by
that time the micromanagement takes most of my attention.
If I grow it to thousands of employees, I’ll delegate enough decisions
that the system will be at least weakly agenty. I don’t want to give
the system detailed instructions about who to hire or how they ought to
acquire skills. Instead, I want employees to have a vision for how to
balance goals such as making food tasty versus making it healthy. I want
them to devote most of the company’s resources to directly making food.
If I thought they might devote more than a few percent of the company’s
resources to longer-term development plans, I’d place strict limits on
that.
Such a company is a weakly superhuman system, in the sense of being an
expert in more domains than an individual human can be. The limit on
development-oriented resources leaves me confident that it won’t do
anything close to fooming via self-improvement. But the company is
stretching the boundaries of what I’d classify as a service.
When I want to broaden my business empire to the limits of what humans
have done, I give up on managing a company. I don’t have Elon Musk’s
ability to manage corporate policies of multiple industries. Instead, I
invest in many companies. That involves heuristics that tend to reward
companies for maximizing the discounted value of future profits. I
don’t try very hard to predict how much they’ll spend on R&D. This
clearly violates the constraints that Drexler wants on safe AI services.
How much would I and other investors do differently if corporations were
new phenomena with hard-to-evaluate risks, and unusually rapid changes
in abilities? It does not seem obvious. It seems clear that human
monitoring and guidance have costs which will tempt us to make
superhuman systems more agenty. The corporate analogy weakly suggests
we’ll postpone the risky parts of that until after they become
superhuman.
Note that I’m using “agent” in a slightly different way than Gwern.
I’m attempting to focus on what Drexler considers important. So I’m
mostly disagreeing with Gwern’s framing, not his answers.
High-frequency trading is a task for which humans are too slow to be in
the loop.
There is some sense in which the best software for this might be to tell
an AI to maximize expected returns over some long time period. But there
are safer strategies that I expect will usually produce the same
benefits:
perform all the learning and make all strategic decisions before
deploying, so there’s time for human-AI collaboration.
have the AI only care about reacting to patterns in numbers. If we
do nothing to cause it to care about how those number relate to the
real world, I don’t see how it would start caring about the real
world.
have the AI care only about the next number it’s about to output.
These options don’t involve the AI self-improving while it’s trading.
But that has little effect on whether the AI gets improved. The CAIS
approach to improving it involves a separate AI improvement process, for
which I don’t see much pressure to have the AI make the most agenty
decisions.
Gwern unintentionally(?) illustrates why we should be cautious about
making AI’s more agenty:
we don’t want excellent advice on which stock to buy for a few
microseconds, we want a money pump spitting cash at us
I don’t want anything resembling a human-level AI choosing such a money
pump, as it probably won’t distinguish between one which will fail
gracefully versus one that will blow up like
LTCM or
Bernie Madoff. But I likely would have chosen a dangerous money pump
when I had less experience.
Chess provides a
different perspective. Computer-human collaborations fit the CAIS
paradigm fairly well for a decade, then the human became mostly a
liability, due to low human reliability. This is evidence that CAIS is a
temporary stopgap, but does support hopes that CAIS will buy us some
time.
Note that for chess, humans here weren’t trying to be the more agenty
part of this collaboration (the chess programs were likely safely
aligned and using their agency appropriately). But similar
considerations would likely apply if the humans were primarily trying to
provide alignment: humans would make enough mistakes at aligning
individual decisions to pressure us to replace those decisions with
those of an apparently aligned AI.
Richard Ngo expresses similar
concerns:
“requiring the roles of each module and the ways they interface with
each other to be … human-comprehensible will be very uncompetitive”.
I don’t see a general rule to that effect—sometimes comprehensibility
will enhance competitiveness via making a service easier to debug or to
build upon. Also, tools to assist human comprehension will enable humans
to use increasingly powerful interfaces. But I’m unwilling to bet that
those effects will enable human-controlled services to remain
competitive.
Conclusion
Given mildly optimistic assumptions about how responsible the leading AI
developers will be, I expect human-level CAIS-style services anywhere
from months to a decade before human-level AGI’s become broadly agenty.
I have little idea how hard it will be to use CAIS services to produce
longer-term solutions to AI risk.
I’m guessing Eliezer will disagree with a key part of this post. I
expect some of his reasons are related to a belief in a core of general
intelligence that has not yet been discovered. But it looks to me like
anything that I’d classify as such a core is already known.
This post seems mostly consistent with Drexler’s claims, but with more
emphasis on the dangers.
Agenty AGI – How Tempting?
Link post
[Epistemic status: mostly writing to clarify my intuitions, with just a few weak attempts to convince others. It’s no substitute for reading Drexler’s writings.]
I’ve been struggling to write more posts relating to Drexler’s vision for AI (hopefully to be published soon), and in the process got increasingly bothered by the issue of whether AI researchers will see incentives to give AI’s broad goals that turn them into agents.
Drexler’s CAIS paper convinced me that our current trajectory is somewhat close to a scenario where human-level AI’s that are tool-like services are available well before AGI’s with broader goals.
Yet when I read LessWrong, I sympathize with beliefs that developers will want quite agenty AGI’s around the same time that CAIS-like services reach human levels.
I’m fed up with this epistemic learned helplessness, and this post is my attempt to reconcile those competing intuitions.
Please recall that Drexler’s distinction here focuses on a system’s goals, not its knowledge. Software is more agenty when its goals cover a wide range of domains, and long time horizons. Services are designed to produce specific outputs using a system’s current procedures.
The Easy Part?
An ESRogs comment on LW nudged me a bit more toward Drexler’s position.
“The smart part is not the agent-y part”: intelligence and agency are independent, with intelligence being the hard part, and agency being more like prompt engineering.
This seems mostly correct, at least for Drexler’s meaning of agency.
There are still significant problems with getting agency correct, but they’re mostly associated with figuring out what we want, and likely not helped much by math or computer science expertise.
Orthogonality squared
ESRogs also suggests a potentially valuable concept that he calls “Orthogonality squared”. I.e. the domain of an agent’s goals is independent of its intelligence.
We can, in principle, create AI’s that don’t care about atoms. At least for some simple tasks, it’s fairly natural to have the AI care only about abstractions such as math. You can’t alter the laws of arithmetic by rearranging atoms—laws of arithmetic are in a different realm from atoms. An AI that cares only about a distant part of Tegmark IV won’t care whether we shut down an arrangement of atoms in our realm.
Many people concerned about AI safety seem reluctant to accept this. E.g. Rob Bensinger’s initial reply to ESRogs worried about this unsafe(?) case:
“Simulated environment” seems ambiguous as to whether it’s referring to a distant part of Tegmark IV, or a special environment within our world. The latter would imply that it’s tricky to analyze whether the AI’s goals conflict with ours; the former does not.
It seems odd that Eliezer’s AGI lethality #19 warns that it’s hard to get software to care directly about the real world, yet considers it inevitable that AGI’s will care about the real world. That’s not a contradiction, but the tension between those positions ought to raise further questions.
I don’t know how an AI could run a company without caring about atoms, so the basic claim behind orthogonality squared is not sufficient to reassure me.
But it illustrates the extent to which AI goals can be compatible with leaving humans in control of the real world. So it should be possible to broaden such goals to care about narrow aspects of our universe, without caring about all of our future lightcone.
Analogies
We have some relevant experience with weakly superhuman agents in the forms of governments, corporations, labor unions, and universities.
These all have some sort of limits on the breadth of their goals, yet those goals often become broad enough to cause more harm than I’d predict if I modeled them as purely tools / services.
Let’s examine what happens if I try to create a corporation that has no dangerously agenty goals. I’ll call it Peter’s Paleo Pizza.
I start out as the sole owner / employee, making all strategy decisions myself. I hire cooks who I tell to follow my recipe exactly.
I can grow the company to a few hundred people while micromanaging it enough that I have all the policy control over it that I want. But by that time the micromanagement takes most of my attention.
If I grow it to thousands of employees, I’ll delegate enough decisions that the system will be at least weakly agenty. I don’t want to give the system detailed instructions about who to hire or how they ought to acquire skills. Instead, I want employees to have a vision for how to balance goals such as making food tasty versus making it healthy. I want them to devote most of the company’s resources to directly making food. If I thought they might devote more than a few percent of the company’s resources to longer-term development plans, I’d place strict limits on that.
Such a company is a weakly superhuman system, in the sense of being an expert in more domains than an individual human can be. The limit on development-oriented resources leaves me confident that it won’t do anything close to fooming via self-improvement. But the company is stretching the boundaries of what I’d classify as a service.
When I want to broaden my business empire to the limits of what humans have done, I give up on managing a company. I don’t have Elon Musk’s ability to manage corporate policies of multiple industries. Instead, I invest in many companies. That involves heuristics that tend to reward companies for maximizing the discounted value of future profits. I don’t try very hard to predict how much they’ll spend on R&D. This clearly violates the constraints that Drexler wants on safe AI services.
How much would I and other investors do differently if corporations were new phenomena with hard-to-evaluate risks, and unusually rapid changes in abilities? It does not seem obvious. It seems clear that human monitoring and guidance have costs which will tempt us to make superhuman systems more agenty. The corporate analogy weakly suggests we’ll postpone the risky parts of that until after they become superhuman.
Software Examples
Gwern’s Why Tool AIs Want to Be Agent AIs has examples of places where being more agenty seems valuable.
Note that I’m using “agent” in a slightly different way than Gwern. I’m attempting to focus on what Drexler considers important. So I’m mostly disagreeing with Gwern’s framing, not his answers.
High-frequency trading is a task for which humans are too slow to be in the loop.
There is some sense in which the best software for this might be to tell an AI to maximize expected returns over some long time period. But there are safer strategies that I expect will usually produce the same benefits:
perform all the learning and make all strategic decisions before deploying, so there’s time for human-AI collaboration.
have the AI only care about reacting to patterns in numbers. If we do nothing to cause it to care about how those number relate to the real world, I don’t see how it would start caring about the real world.
have the AI care only about the next number it’s about to output.
These options don’t involve the AI self-improving while it’s trading. But that has little effect on whether the AI gets improved. The CAIS approach to improving it involves a separate AI improvement process, for which I don’t see much pressure to have the AI make the most agenty decisions.
Gwern unintentionally(?) illustrates why we should be cautious about making AI’s more agenty:
I don’t want anything resembling a human-level AI choosing such a money pump, as it probably won’t distinguish between one which will fail gracefully versus one that will blow up like LTCM or Bernie Madoff. But I likely would have chosen a dangerous money pump when I had less experience.
Chess provides a different perspective. Computer-human collaborations fit the CAIS paradigm fairly well for a decade, then the human became mostly a liability, due to low human reliability. This is evidence that CAIS is a temporary stopgap, but does support hopes that CAIS will buy us some time.
Note that for chess, humans here weren’t trying to be the more agenty part of this collaboration (the chess programs were likely safely aligned and using their agency appropriately). But similar considerations would likely apply if the humans were primarily trying to provide alignment: humans would make enough mistakes at aligning individual decisions to pressure us to replace those decisions with those of an apparently aligned AI.
Richard Ngo expresses similar concerns: “requiring the roles of each module and the ways they interface with each other to be … human-comprehensible will be very uncompetitive”. I don’t see a general rule to that effect—sometimes comprehensibility will enhance competitiveness via making a service easier to debug or to build upon. Also, tools to assist human comprehension will enable humans to use increasingly powerful interfaces. But I’m unwilling to bet that those effects will enable human-controlled services to remain competitive.
Conclusion
Given mildly optimistic assumptions about how responsible the leading AI developers will be, I expect human-level CAIS-style services anywhere from months to a decade before human-level AGI’s become broadly agenty.
I have little idea how hard it will be to use CAIS services to produce longer-term solutions to AI risk.
I’m guessing Eliezer will disagree with a key part of this post. I expect some of his reasons are related to a belief in a core of general intelligence that has not yet been discovered. But it looks to me like anything that I’d classify as such a core is already known.
This post seems mostly consistent with Drexler’s claims, but with more emphasis on the dangers.