My current uncertainties regarding AI, alignment, and the end of the world
As I read the interview with Eliezer Yudkowsky on AI alignment problems, I had a couple of thoughts of my own. These are poorly researched, and maybe poorly formulated. I intend to think more about them, but I thought this might be a good place to post them for feedback. I’m basically using this post as a large interactive bookmark for “hey, these are the things you thought about, think about them some more” with the added benefit of other people commenting.
I feel like there’s a difference between “modeling” and “statistical recognition”, in the sense that current (and near-future) AI systems currently don’t necessarily model the world around them. I don’t yet know if this actually is a difference or if I’m inventing a dichotomy that doesn’t exist. Even if this is true, it’s still unclear to me how or if it’s better that current AI systems are statistical recognizers instead of world-modelers. You’d think that to destroy a world, you first need to have a model of it, but that may not be the case.
There may be a sense in which generating text and maneuvering the real world are very different. There may be a sense in which successfully imitating human speech without a “model” or agency is possible.
There may be strongly binding constraints on an agent’s success in the world which do not depend on raw intelligence. Meaning, even if an agent has extremely high intelligence but lacks some other quality, its effective output in changing the world around it may not be as frightening as we might currently imagine. Imagine an extremely evil and extremely intelligent person who can effectively work one minute per week due to e.g. having no energy.
There may also be such strongly (or even more strongly) binding constraints that prevent even a superintelligent agent from achieving their goals, but which aren’t “defects” in the agent itself, but in some constant in the universe. One such example is the speed of light. However intelligent you are, that’s a physical constraint that you just can’t surpass.
There may also be a sense in which AI systems would not self-improve further than required for what we want from them. Meaning, we may fulfill our needs (for which we design and produce AI systems) with a class of AI agents that stop receiving any sort of negative feedback at a certain level of proficiency or ability. I do not yet understand how training AI models works, but I suppose that some sort of feedback exists, and that these models change themselves / are changed according to the feedback. If there’s no longer any message “hey, this was bad, improve this”, then the system doesn’t improve further (I think).
It might be the case that we can get AGI-like performance without ambition to get a lot of resources, negating the orthogonality thesis. Maybe we build an AGI which can perform all the tasks that we give it, but the AGI performs all of these actions on the basis of statistical recognition of patterns in very specific domains, without integrating all this understanding into a coherent model of the world. Maybe this agent has very cordoned-off sections of its mind. It’s superintelligent in a bunch of different domains, including speech and, I don’t know, running industrial control systems, but it’s less like a generally intelligent human, and more like several savants integrated into one being.
If you give an AI agent some game to play, and this game includes maximizing resources, and has definite win-conditions, then the agent would beat humans, maximize resources etc. But it maximizes resources within the confines of that task. I feel like a qualitative difference in the type of mind is necessary for any agent to break out of the confines of the task. Namely, the agent must 1) model the world, then 2) figure out that it’s operating in a simulated world, and then 3) figure out how to break out of the simulated world.
There is an entire subfield of ML called model-based reinforcement learning.
Natural selection is existence proof (minus anthropic effects) that you can produce world-altering agents without explicitly using models.
Well yes, which is why I’m less worried about GPT-3 than EfficientZero.
It is trivially true, and trivially false if you ask the AI adversarial questions that require AGI-completeness.
Sure, but one does not need to surpass the speed of light to destroy humanity
Who is “we”? What is the mechanism by which any AI outside this class will be completely and permanently prevented from coming into existence? This is my criticism for the rest of the points as well. Your strategy for AI risk seems to be “Let’s not build the sort of AI that would destroy the world”, which fails at the first word: “Let’s”.
I don’t have a strategy, I’m basically just thinking out loud about a couple of specific points. Building a strategy for preventing that type of AI is important, but I don’t (yet?) have any ideas in that area.
Ok, perhaps I was too combative with the wording. My general point is: Don’t think of humanity as a coordinated agent, don’t think of “AGI” as a single tribe with particular properties (I frequently see this same mistake with regard to aliens), and in particular, don’t think because a specific AI won’t be able or want to destroy the world, that therefore the world is saved in general.