Here are my responses to your comments, sorted by how interesting they
are to me, descending. Also, thanks for your input!
Non-omnipotent AI aligning omnipotent AI
The AI will be making important decisions long before it becomes
near-omnipotent, as you put it. In particular, it should be doing all
the work of aligning future AI systems well before it is
near-omnipotent.
Please elaborate. I can imagine multiple versions of what you’re
imagining. Is one of the following scenarios close to what you mean?
Scientists use AI-based theorem provers to prove theorems about AI
alignment.
There’s an AI, with which you can have conversations. It tries to
come up with new mathematical definitions and theorems related to
what you’re discussing.
The AI (or multiple AIs) is not near-omnipotent yet, but it already
controls most of the humanity’s resources and makes most of the
decisions, so it does research into AI instead of humans.
I think, the requirements for how well the non-omnipotent AI in the 3rd
scenario should be aligned are basically the same as for a
near-omnipotent AI. If the non-omnipotent AI in the 3rd scenario is very
misaligned, but it’s not catastrophic because the AI is not smart
enough, the near-omnipotent AI it’ll create will also be misaligned,
and that’ll be catastrophic.
Embedded agency
Note though it’s quite possible that some things we’re confused
about are also simply irrelevant to the thing we care about. (I would
claim this of embedded agency with not much confidence.)
So, you think embedded agency research is unimportant for AI alignment.
On the contrast, I think it’s very important. I worry about it mainly
for 3 reasons. Suppose we don’t figure out embedded agency. Then
An AI won’t be able to safely self-modify
An AI won’t be able to comprehend that it can be killed or damaged
or modified by others
I am not sure about this one. I am very interested to know if this
is not the case. I think, if we build an AI without understanding
embedded agency, and that AI builds a new AI, that new AI also
won’t understand embedded agency. In other words, the set of AIs
built without taking embedded agency into account is closed under
the operation of an AI building a new AI. [Upd: comments under this comment mostly refute this]
I am even less sure about this item, but maybe such an AI will be
too dogmatic (as in dogmatic prior) about how the world might work,
because it is sure that it can’t be killed or damaged or modified.
Due to this, if the physics laws turn out to be weird (e.g. we live
in a multiverse, or we’re in a simulation), the AI might fail to
understand that and thus fail to turn the whole world into hedonium
(or whatever it is that we would want it to do with the world).
If an AI built without taking embedded agency into account meets
very smart aliens someday, it might fuck up due to its inability to
imagine that someone can predict its actions.
Usefulness of type-2 research for aligning superintelligent AI
Unless your argument is that type 2 research will be of literally zero
use for aligning superintelligent AI.
I think that if one man-year of type-1 research produces 1 unit of
superintelligent AI alignment, one man-year of type-2 research produces
about 0.15 units of superintelligent AI alignment.
As I see it, the mechanisms by which type-2 research helps align
superintelligent AI are:
It may produce useful empirical data which’ll help us make type-1
theoretical insights.
Thinking about type-2 research contains a small portion of type-1
thinking.
For example, if someone works on making contemporary neural networks
robust to out-of-distribution examples, and they do that mainly by
experimenting, their experimental data might provide insights about the
nature of robustness in abstract, and also, surely some portion of their
thinking will be dedicated to theory of robustness.
My views on tractability and neglectedness
Tractability and neglectedness matter too.
Alright, I agree with you about tractability.
About neglectedness, I think type-2 research is less neglected than
type-1 and type-3 and will be less neglected in the next 10 years or so,
because
It’s practical, you can sell it to companies which want to make
robots or unbreakable face detection or whatever.
The AI (or multiple AIs) is not near-omnipotent yet, but it already controls most of the humanity’s resources and makes most of the decisions, so it does research into AI instead of humans.
I agree that you still need a strong guarantee of alignment in this scenario (as I mentioned my original comment).
On the contrast, I think it’s very important. I worry about it mainly for 3 reasons. Suppose we don’t figure out embedded agency. Then [...]
Why don’t these arguments apply to humans? Evolution didn’t understand embedded agency, but managed to create humans who seem to do okay at being embedded agents.
(I buy this as an argument that an AI system needs to not ignore the fact that it is embedded, but I don’t buy it as an argument that we need to be deconfused about embedded agency.)
I think that if one man-year of type-1 research produces 1 unit of superintelligent AI alignment, one man-year of type-2 research produces about 0.15 units of superintelligent AI alignment.
Cool, that’s more concrete, thanks. (I disagree, but there isn’t really an obvious point to argue on, the cruxes are in the other points.)
About neglectedness, I think type-2 research is less neglected than type-1 and type-3 and will be less neglected in the next 10 years or so, because
Agreed. Tbc, I wasn’t arguing it was neglected, just that you seemed to be ignoring tractability and neglectedness, which seemed like a mistake.
I see MIRI’s research on agent foundations (including embedded agency) as something like “We want to understand ${an aspect of how agents should work}, so let’s take the simplest case first and see if we understand everything about it. The simplest case is the case when the agent is nearly omniscient and knows all logical consequences. Hmm, we can’t figure out this simplest case yet—it breaks down if the conditions are sufficiently weird”. Since it turns out that it’s difficult to understand embedded agency even for such simple cases, it seems plausible that an AI trained to understand embedded agency by a naive learning procedure (similar to the evolution) will break down under sufficiently weird conditions.
Why don’t these arguments apply to humans? Evolution didn’t understand embedded agency, but managed to create humans who seem to do okay at being embedded agents.
(I buy this as an argument that an AI system needs to not ignore the fact that it is embedded, but I don’t buy it as an argument that we need to be deconfused about embedded agency.)
Hmm, very good argument. Since I think humans have imperfect understanding of embedded agency, thanks to you I now no longer think that “If we build an AI without understanding embedded agency, and that AI builds a new AI, that new AI also won’t understand embedded agency” since that would imply we can’t get the “lived happily ever after” at all. We can ignore the case where we can’t get the “lived happily ever after” at all, because in that case nothing matters anyway.
I suppose, we could run evolutionary search or something, selecting for AIs which can understand the typical cases of being modified by itself or by the environment, which we include in the training dataset. I wonder how we can make it understand very atypical cases of modification. A near-omnipotent AI will be a very atypical case.
Can we come up with a learning procedure to have the AI learn embedded agency on its own? It seems plausible to me that we will need to understand embedded agency better to do this, but I don’t really know.
Btw, in another comment, you say
But usually when LessWrongers argue against “good enough” alignment, they’re arguing against alignment methods, saying that “nothing except proofs” will work, because only proofs give near-100% confidence.But usually when LessWrongers argue against “good enough” alignment, they’re arguing against alignment methods, saying that “nothing except proofs” will work, because only proofs give near-100% confidence.
I basically subscribe to the argument that nothing except proofs will work in the case of superintelligent agentic AI.
Re: embedded agency, while these are all potentially relevant points (especially self-modification), I don’t see any of them as the main reason to study embedded agents from an alignment standpoint. I see the main purpose of embedded agency research as talking about humans, not designing AIs—in particular, in order to point to human values, we need a coherent notion of what it means for an agenty system embedded in its environment (i.e. a human) to want things. As the linked post discusses, a lot of the issues with modelling humans as utility-maximizers or using proxies for our goals stem directly from more general embedded agency issues.
Here are my responses to your comments, sorted by how interesting they are to me, descending. Also, thanks for your input!
Non-omnipotent AI aligning omnipotent AI
Please elaborate. I can imagine multiple versions of what you’re imagining. Is one of the following scenarios close to what you mean?
Scientists use AI-based theorem provers to prove theorems about AI alignment.
There’s an AI, with which you can have conversations. It tries to come up with new mathematical definitions and theorems related to what you’re discussing.
The AI (or multiple AIs) is not near-omnipotent yet, but it already controls most of the humanity’s resources and makes most of the decisions, so it does research into AI instead of humans.
I think, the requirements for how well the non-omnipotent AI in the 3rd scenario should be aligned are basically the same as for a near-omnipotent AI. If the non-omnipotent AI in the 3rd scenario is very misaligned, but it’s not catastrophic because the AI is not smart enough, the near-omnipotent AI it’ll create will also be misaligned, and that’ll be catastrophic.
Embedded agency
So, you think embedded agency research is unimportant for AI alignment. On the contrast, I think it’s very important. I worry about it mainly for 3 reasons. Suppose we don’t figure out embedded agency. Then
An AI won’t be able to safely self-modify
An AI won’t be able to comprehend that it can be killed or damaged or modified by others
I am not sure about this one. I am very interested to know if this is not the case. I think, if we build an AI without understanding embedded agency, and that AI builds a new AI, that new AI also won’t understand embedded agency. In other words, the set of AIs built without taking embedded agency into account is closed under the operation of an AI building a new AI.[Upd: comments under this comment mostly refute this]I am even less sure about this item, but maybe such an AI will be too dogmatic (as in dogmatic prior) about how the world might work, because it is sure that it can’t be killed or damaged or modified. Due to this, if the physics laws turn out to be weird (e.g. we live in a multiverse, or we’re in a simulation), the AI might fail to understand that and thus fail to turn the whole world into hedonium (or whatever it is that we would want it to do with the world).
If an AI built without taking embedded agency into account meets very smart aliens someday, it might fuck up due to its inability to imagine that someone can predict its actions.
Usefulness of type-2 research for aligning superintelligent AI
I think that if one man-year of type-1 research produces 1 unit of superintelligent AI alignment, one man-year of type-2 research produces about 0.15 units of superintelligent AI alignment.
As I see it, the mechanisms by which type-2 research helps align superintelligent AI are:
It may produce useful empirical data which’ll help us make type-1 theoretical insights.
Thinking about type-2 research contains a small portion of type-1 thinking.
For example, if someone works on making contemporary neural networks robust to out-of-distribution examples, and they do that mainly by experimenting, their experimental data might provide insights about the nature of robustness in abstract, and also, surely some portion of their thinking will be dedicated to theory of robustness.
My views on tractability and neglectedness
Alright, I agree with you about tractability.
About neglectedness, I think type-2 research is less neglected than type-1 and type-3 and will be less neglected in the next 10 years or so, because
It’s practical, you can sell it to companies which want to make robots or unbreakable face detection or whatever.
Humans have bias towards near-term thinking.
Neural networks are a hot topic.
I basically mean the third scenario:
I agree that you still need a strong guarantee of alignment in this scenario (as I mentioned my original comment).
Why don’t these arguments apply to humans? Evolution didn’t understand embedded agency, but managed to create humans who seem to do okay at being embedded agents.
(I buy this as an argument that an AI system needs to not ignore the fact that it is embedded, but I don’t buy it as an argument that we need to be deconfused about embedded agency.)
Cool, that’s more concrete, thanks. (I disagree, but there isn’t really an obvious point to argue on, the cruxes are in the other points.)
Agreed. Tbc, I wasn’t arguing it was neglected, just that you seemed to be ignoring tractability and neglectedness, which seemed like a mistake.
I see MIRI’s research on agent foundations (including embedded agency) as something like “We want to understand ${an aspect of how agents should work}, so let’s take the simplest case first and see if we understand everything about it. The simplest case is the case when the agent is nearly omniscient and knows all logical consequences. Hmm, we can’t figure out this simplest case yet—it breaks down if the conditions are sufficiently weird”. Since it turns out that it’s difficult to understand embedded agency even for such simple cases, it seems plausible that an AI trained to understand embedded agency by a naive learning procedure (similar to the evolution) will break down under sufficiently weird conditions.
Hmm, very good argument. Since I think humans have imperfect understanding of embedded agency, thanks to you I now no longer think that “If we build an AI without understanding embedded agency, and that AI builds a new AI, that new AI also won’t understand embedded agency” since that would imply we can’t get the “lived happily ever after” at all. We can ignore the case where we can’t get the “lived happily ever after” at all, because in that case nothing matters anyway.
I suppose, we could run evolutionary search or something, selecting for AIs which can understand the typical cases of being modified by itself or by the environment, which we include in the training dataset. I wonder how we can make it understand very atypical cases of modification. A near-omnipotent AI will be a very atypical case.
Can we come up with a learning procedure to have the AI learn embedded agency on its own? It seems plausible to me that we will need to understand embedded agency better to do this, but I don’t really know.
Btw, in another comment, you say
I basically subscribe to the argument that nothing except proofs will work in the case of superintelligent agentic AI.
Re: embedded agency, while these are all potentially relevant points (especially self-modification), I don’t see any of them as the main reason to study embedded agents from an alignment standpoint. I see the main purpose of embedded agency research as talking about humans, not designing AIs—in particular, in order to point to human values, we need a coherent notion of what it means for an agenty system embedded in its environment (i.e. a human) to want things. As the linked post discusses, a lot of the issues with modelling humans as utility-maximizers or using proxies for our goals stem directly from more general embedded agency issues.