We’re assuming natural abstraction basically fails, so those AI systems will have fundamentally alien internal ontologies. For purposes of this overcompressed version of the argument, we’ll assume a very extreme failure of natural abstraction, such that human concepts cannot be faithfully and robustly translated into the system’s internal ontology at all.
For context, I’m familiar with this view from the ELK report. My understanding is that this is part of the “worst-case scenario” for alignment that ARC’s agenda is hoping to solve (or, at least, still hoped to solve a ~year ago).
The paradigmatic example of an ontology mismatch is a deep change in our understanding of the physical world. For example, you might imagine humans who think about the world in terms of rigid bodies and Newtonian fluids and “complicated stuff we don’t quite understand,” while an AI thinks of the world in terms of atoms and the void. Or we might imagine humans who think in terms of the standard model of physics, while an AI understands reality as vibrations of strings. We think that this kind of deep physical mismatch is a useful mental picture, and it can be a fruitful source of simplified examples, but we don’t think it’s very likely.
We can also imagine a mismatch where AI systems use higher-level abstractions that humans lack, and are able to make predictions about observables without ever thinking about lower-level abstractions that are important to humans. For example we might imagine an AI making long-term predictions based on alien principles about memes and sociology that don’t even reference the preferences or beliefs of individual humans. Of course it is possible to translate those principles into predictions about individual humans, and indeed this AI ought to make good predictions about what individual humans say, but if the underlying ontology is very different we are at risk of learning the human simulator instead of the “real” mapping.
Overall we are by far most worried about deeply “messy” mismatches that can’t be cleanly described as higher- or lower-level abstractions, or even what a human would recognize as “abstractions” at all. We could try to tell abstract stories about what a messy mismatch might look like, or make arguments about why it may be plausible, but it seems easier to illustrate by thinking concretely about existing ML systems.
[It might involve heuristics about how to think that are intimately interwoven with object level beliefs, or dual ways of looking at familiar structures, or reasoning directly about a messy tapestry of correlations in a way that captures important regularities but lacks hierarchical structure. But most of our concern is with models that we just don’t have the language to talk about easily despite usefully reflecting reality. Our broader concern is that optimistic stories about the familiarity of AI cognition may be lacking in imagination. (We also consider those optimistic stories plausible, we just really don’t think we know enough to be confident.)]
So I understand the shape of the argument here.
… But I never got this vibe from Eliezer/MIRI. As I previously argued, I would say that their talk of different internal ontologies and alien thinking is mostly about, to wit, different cognition. The argument is that AGIs won’t have “emotions”, or a System 1/System 2 split, or “motivations” the way we understands them – instead, they’d have a bunch of components that fulfill the same functions these components fulfill in humans, but split and recombined in a way that has no analogues in the human mind.
Hence, it would be difficult to make AGI agents “do what we mean” – but not necessarily because there’s no compact way to specify “what we mean” in the AGI’s ontology, but because we’d have no idea how to specify “do this” in terms of the program flows of the AGI’s cognition. Where are the emotions? Where are the goals? Where are the plans? We can identify the concept of “eudaimonia” here, but what the hell is this thought-process doing with it? Making plans about it? Refactoring it? Nothing? Is this even a thought process?
This view doesn’t make arguments about the AGI’s world-model specifically. It may or may not be the case that any embedded agent navigating our world would necessarily have nodes in its model approximately corresponding to “humans”, “diamonds”, and “the Golden Gate Bridge”. This view is simply cautioning against anthropomorphizing AGIs.
Roughly speaking, imagine that any mind could be split into a world-model and “everything else”: the planning module, the mesa-objective, the cached heuristics, et cetera. The MIRI view focuses on claiming that the “everything else” would be implemented in a deeply alien manner.
The MIRI view may be agnostic regarding the Natural Abstraction Hypothesis as well, yes. The world-model might also be deeply alien, and the very idea of splitting an AGI’s cognition into a world-model and a planner might itself be an unrealistic artefact of our human thinking.
But even if the NAH is true, the core argument would still go through, in (my model of) the MIRI view.
And I’d say the-MIRI-view-conditioned-on-assuming-the-NAH-is-true would still have p(doom) at 90+%: because it’s not optimistic regarding anyone anywhere solving the natural-abstractions problem before the blind-tinkering approach of AGI labs kills everyone.
(I’d say this is an instance of an ontology mismatch between you and the MIRI view, actually. The NAH abstraction is core to your thinking, so you factor the disagreement through those lens. But the MIRI view doesn’t think in those precise terms!)
For context, I’m familiar with this view from the ELK report. My understanding is that this is part of the “worst-case scenario” for alignment that ARC’s agenda is hoping to solve (or, at least, still hoped to solve a ~year ago).
To quote:
So I understand the shape of the argument here.
… But I never got this vibe from Eliezer/MIRI. As I previously argued, I would say that their talk of different internal ontologies and alien thinking is mostly about, to wit, different cognition. The argument is that AGIs won’t have “emotions”, or a System 1/System 2 split, or “motivations” the way we understands them – instead, they’d have a bunch of components that fulfill the same functions these components fulfill in humans, but split and recombined in a way that has no analogues in the human mind.
Hence, it would be difficult to make AGI agents “do what we mean” – but not necessarily because there’s no compact way to specify “what we mean” in the AGI’s ontology, but because we’d have no idea how to specify “do this” in terms of the program flows of the AGI’s cognition. Where are the emotions? Where are the goals? Where are the plans? We can identify the concept of “eudaimonia” here, but what the hell is this thought-process doing with it? Making plans about it? Refactoring it? Nothing? Is this even a thought process?
This view doesn’t make arguments about the AGI’s world-model specifically. It may or may not be the case that any embedded agent navigating our world would necessarily have nodes in its model approximately corresponding to “humans”, “diamonds”, and “the Golden Gate Bridge”. This view is simply cautioning against anthropomorphizing AGIs.
Roughly speaking, imagine that any mind could be split into a world-model and “everything else”: the planning module, the mesa-objective, the cached heuristics, et cetera. The MIRI view focuses on claiming that the “everything else” would be implemented in a deeply alien manner.
The MIRI view may be agnostic regarding the Natural Abstraction Hypothesis as well, yes. The world-model might also be deeply alien, and the very idea of splitting an AGI’s cognition into a world-model and a planner might itself be an unrealistic artefact of our human thinking.
But even if the NAH is true, the core argument would still go through, in (my model of) the MIRI view.
And I’d say the-MIRI-view-conditioned-on-assuming-the-NAH-is-true would still have p(doom) at 90+%: because it’s not optimistic regarding anyone anywhere solving the natural-abstractions problem before the blind-tinkering approach of AGI labs kills everyone.
(I’d say this is an instance of an ontology mismatch between you and the MIRI view, actually. The NAH abstraction is core to your thinking, so you factor the disagreement through those lens. But the MIRI view doesn’t think in those precise terms!)