philip_b comments on Classification of AI alignment research: deconfusion, “good enough” non-superintelligent AI alignment, superintelligent AI alignment

philip_b 15 Jul 2020 19:26 UTC
5 points
Here are my responses to your comments, sorted by how interesting they are to me, descending. Also, thanks for your input!

Non-omnipotent AI aligning omnipotent AI

The AI will be making important decisions long before it becomes near-omnipotent, as you put it. In particular, it should be doing all the work of aligning future AI systems well before it is near-omnipotent.

Please elaborate. I can imagine multiple versions of what you’re imagining. Is one of the following scenarios close to what you mean?
1. Scientists use AI-based theorem provers to prove theorems about AI alignment.
2. There’s an AI, with which you can have conversations. It tries to come up with new mathematical definitions and theorems related to what you’re discussing.
3. The AI (or multiple AIs) is not near-omnipotent yet, but it already controls most of the humanity’s resources and makes most of the decisions, so it does research into AI instead of humans.
I think, the requirements for how well the non-omnipotent AI in the 3rd scenario should be aligned are basically the same as for a near-omnipotent AI. If the non-omnipotent AI in the 3rd scenario is very misaligned, but it’s not catastrophic because the AI is not smart enough, the near-omnipotent AI it’ll create will also be misaligned, and that’ll be catastrophic.

Embedded agency

Note though it’s quite possible that some things we’re confused about are also simply irrelevant to the thing we care about. (I would claim this of embedded agency with not much confidence.)

So, you think embedded agency research is unimportant for AI alignment. On the contrast, I think it’s very important. I worry about it mainly for 3 reasons. Suppose we don’t figure out embedded agency. Then
- An AI won’t be able to safely self-modify
- An AI won’t be able to comprehend that it can be killed or damaged or modified by others
- I am not sure about this one. I am very interested to know if this is not the case. I think, if we build an AI without understanding embedded agency, and that AI builds a new AI, that new AI also won’t understand embedded agency. In other words, the set of AIs built without taking embedded agency into account is closed under the operation of an AI building a new AI. [Upd: comments under this comment mostly refute this]
- I am even less sure about this item, but maybe such an AI will be too dogmatic (as in dogmatic prior) about how the world might work, because it is sure that it can’t be killed or damaged or modified. Due to this, if the physics laws turn out to be weird (e.g. we live in a multiverse, or we’re in a simulation), the AI might fail to understand that and thus fail to turn the whole world into hedonium (or whatever it is that we would want it to do with the world).
- If an AI built without taking embedded agency into account meets very smart aliens someday, it might fuck up due to its inability to imagine that someone can predict its actions.
Usefulness of type-2 research for aligning superintelligent AI

Unless your argument is that type 2 research will be of literally zero use for aligning superintelligent AI.

I think that if one man-year of type-1 research produces 1 unit of superintelligent AI alignment, one man-year of type-2 research produces about 0.15 units of superintelligent AI alignment.

As I see it, the mechanisms by which type-2 research helps align superintelligent AI are:
- It may produce useful empirical data which’ll help us make type-1 theoretical insights.
- Thinking about type-2 research contains a small portion of type-1 thinking.
For example, if someone works on making contemporary neural networks robust to out-of-distribution examples, and they do that mainly by experimenting, their experimental data might provide insights about the nature of robustness in abstract, and also, surely some portion of their thinking will be dedicated to theory of robustness.

My views on tractability and neglectedness

Tractability and neglectedness matter too.

Alright, I agree with you about tractability.

About neglectedness, I think type-2 research is less neglected than type-1 and type-3 and will be less neglected in the next 10 years or so, because
- It’s practical, you can sell it to companies which want to make robots or unbreakable face detection or whatever.
- Humans have bias towards near-term thinking.
- Neural networks are a hot topic.
- Rohin Shah 15 Jul 2020 20:40 UTC
  6 points
  Parent
  Non-omnipotent AI aligning omnipotent AI
  I basically mean the third scenario:
  The AI (or multiple AIs) is not near-omnipotent yet, but it already controls most of the humanity’s resources and makes most of the decisions, so it does research into AI instead of humans.
  I agree that you still need a strong guarantee of alignment in this scenario (as I mentioned my original comment).
  On the contrast, I think it’s very important. I worry about it mainly for 3 reasons. Suppose we don’t figure out embedded agency. Then [...]
  Why don’t these arguments apply to humans? Evolution didn’t understand embedded agency, but managed to create humans who seem to do okay at being embedded agents.
  (I buy this as an argument that an AI system needs to not ignore the fact that it is embedded, but I don’t buy it as an argument that we need to be deconfused about embedded agency.)
  I think that if one man-year of type-1 research produces 1 unit of superintelligent AI alignment, one man-year of type-2 research produces about 0.15 units of superintelligent AI alignment.
  Cool, that’s more concrete, thanks. (I disagree, but there isn’t really an obvious point to argue on, the cruxes are in the other points.)
  About neglectedness, I think type-2 research is less neglected than type-1 and type-3 and will be less neglected in the next 10 years or so, because
  Agreed. Tbc, I wasn’t arguing it was neglected, just that you seemed to be ignoring tractability and neglectedness, which seemed like a mistake.
  - philip_b 15 Jul 2020 23:53 UTC
    5 points
    Parent
    I see MIRI’s research on agent foundations (including embedded agency) as something like “We want to understand ${an aspect of how agents should work}, so let’s take the simplest case first and see if we understand everything about it. The simplest case is the case when the agent is nearly omniscient and knows all logical consequences. Hmm, we can’t figure out this simplest case yet—it breaks down if the conditions are sufficiently weird”. Since it turns out that it’s difficult to understand embedded agency even for such simple cases, it seems plausible that an AI trained to understand embedded agency by a naive learning procedure (similar to the evolution) will break down under sufficiently weird conditions.
    
    Why don’t these arguments apply to humans? Evolution didn’t understand embedded agency, but managed to create humans who seem to do okay at being embedded agents.
    
    (I buy this as an argument that an AI system needs to not ignore the fact that it is embedded, but I don’t buy it as an argument that we need to be deconfused about embedded agency.)
    
    Hmm, very good argument. Since I think humans have imperfect understanding of embedded agency, thanks to you I now no longer think that “If we build an AI without understanding embedded agency, and that AI builds a new AI, that new AI also won’t understand embedded agency” since that would imply we can’t get the “lived happily ever after” at all. We can ignore the case where we can’t get the “lived happily ever after” at all, because in that case nothing matters anyway.
    
    I suppose, we could run evolutionary search or something, selecting for AIs which can understand the typical cases of being modified by itself or by the environment, which we include in the training dataset. I wonder how we can make it understand very atypical cases of modification. A near-omnipotent AI will be a very atypical case.
    
    Can we come up with a learning procedure to have the AI learn embedded agency on its own? It seems plausible to me that we will need to understand embedded agency better to do this, but I don’t really know.
    
    Btw, in another comment, you say
    
    But usually when LessWrongers argue against “good enough” alignment, they’re arguing against alignment methods, saying that “nothing except proofs” will work, because only proofs give near-100% confidence.But usually when LessWrongers argue against “good enough” alignment, they’re arguing against alignment methods, saying that “nothing except proofs” will work, because only proofs give near-100% confidence.
    
    I basically subscribe to the argument that nothing except proofs will work in the case of superintelligent agentic AI.
- johnswentworth 15 Jul 2020 21:19 UTC
  5 points
  Parent
  Re: embedded agency, while these are all potentially relevant points (especially self-modification), I don’t see any of them as the main reason to study embedded agents from an alignment standpoint. I see the main purpose of embedded agency research as talking about humans, not designing AIs—in particular, in order to point to human values, we need a coherent notion of what it means for an agenty system embedded in its environment (i.e. a human) to want things. As the linked post discusses, a lot of the issues with modelling humans as utility-maximizers or using proxies for our goals stem directly from more general embedded agency issues.

philip_b comments on Classification of AI alignment research: deconfusion, “good enough” non-superintelligent AI alignment, superintelligent AI alignment

Non-omnipotent AI aligning omnipotent AI

Embedded agency

Usefulness of type-2 research for aligning superintelligent AI

My views on tractability and neglectedness