-3. I’m assuming you are already familiar with some basics, and already know what ‘orthogonality’ and ‘instrumental convergence’ are and why they’re true.
I think this is actually the part that I most “disagree” with. (I put “disagree” in quotes, because there are forms of these theses that I’m persuaded by. However, I’m not so confident that they’ll be relevant for the kinds of AIs we’ll actually build.)
1. The smart part is not the agent-y part
It seems to me that what’s powerful about modern ML systems is their ability to do data compression / pattern recognition. That’s where the real cognitive power (to borrow Eliezer’s term) comes from. And I think that this is the same as what makes us smart.
GPT-3 does unsupervised learning on text data. Our brains do predictive processing on sensory inputs. My guess (which I’d love to hear arguments against!) is that there’s a true and deep analogy between the two, and that they lead to impressive abilities for fundamentally the same reason.
If so, it seems to me that that’s where all the juice is. That’s where the intelligence comes from. (In the past, I’ve called this the core smarts of our brains.)
On this view, all the agent-y, planful, System 2 stuff that we do is the analogue of prompt programming. It’s a set of not-very-deep, not-especially-complex algorithms meant to cajole the actually smart stuff into doing something useful.
When I try to extrapolate what this means for how AI systems will be built, I imagine a bunch of Drexler-style AI services.
Yes, in some cases people will want to chain services together to form something like an agent, with something like goals. However, the agent part isn’t the smart part. It’s just some simple algorithms on top of a giant pile of pattern recognition and data compression.
Why is that relevant? Isn’t an algorithmically simple superintelligent agent just as scary as (if not moreso than) a complex one? In a sense yes, it would still be very scary. But to me it suggests a different intervention point.
If the agency is not inextricably tied to the intelligence, then maybe a reasonable path forward is to try to wring as much productivity as we can out of the passive, superhuman, quasi-oracular just-dumb-data-predictors. And avoid as much as we can ever creating closed-loop, open-ended, free-rein agents.
Am I just recapitulating the case for Oracle-AI / Tool-AI? Maybe so.
But if agency is not a fundamental part of intelligence, and rather something that can just be added in on top, or not, and if we’re at a loss for how to either align a superintelligent agent with CEV or else make it corrigible, then why not try to avoid creating the agent part of superintelligent agent?
I think that might be easier than many think...
2. The AI does not care about your atoms either
The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else.
Suppose we have (something like) an agent, with (something like) a utility function. I think it’s important to keep in mind the domain of the utility function. (I’ll be making basically the same point repeatedly throughout the rest of this comment.)
By default, I don’t expect systems that we build, with agent-like behavior (even superintelligently smart systems!), to care about all the atoms in the future light cone.
Humans (and other animals) care about atoms. We care about (our sensory perceptions of) macroscopic events, forward in time, because we evolved to. But that is not the default domain of an agent’s utility function.
For example, I claim that while AlphaGo could be said to be agent-y, it does not care about atoms. And I think that we could make it fantastically more superhuman at Go, and it would still not care about atoms. Atoms are just not in the domain of its utility function.
In particular, I don’t think it has an incentive to break out into the real world to somehow get itself more compute, so that it can think more about its next move. It’s just not modeling the real world at all. It’s not even trying to rack up a bunch of wins over time. It’s just playing the single platonic game of Go.
Giant caveat (that you may already be shouting into your screen): abstractions are leaky.
The ML system is not actually trained to play the platonic game of Go. It’s trained to play the-game-of-Go-as-implemented-on-particular-hardware, or something like minimize-this-loss-function-informed-by-Go-game-results. The difference between the platonic game and the embodied game can lead to clever and unexpected behavior.
However, it seems to me that these kinds of hacks are going to look a lot more like a system short-circuiting than it out-of-nowhere building a model of, and starting to care about, the whole universe.
An agent that is “efficient”, relative to you, within a domain, is one that never makes a real error that you can systematically predict in advance.
I think this very succinctly captures what would be so scary about being up against a (sufficiently) superintelligent agent with conflicting goals to yours. If you think you see a flaw in its plan, that says more about your seeing than it does about its plan. In other words, you’re toast.
But as above, I think it’s important to keep in mind what an agent’s goals are actually about.
Just as the utility function of an agent is orthogonal from its intelligence, it seems to me that the domain of its utility function is another dimension of potential orthogonality.
If you’re playing chess against AlphaZero Chess, you’re going to lose. But suppose you’re secretly playing “Who has the most pawns after 10 moves?” I think you’ve got a chance to win! Even though it cares about pawns!
(Of course if you continue playing out the chess game after the10th move, it’ll win at that. But by assumption, that’s fine, it’s not what you cared about.)
If you and another agent have different goals for the same set of objects, you’re going to be in conflict. It’s going to be zero sum. But if the stuff you care about is only tangentially related to the stuff it cares about, then the results can be positive sum. You can both win!
In particular, you can both get what you want without either of you turning the other off. (And if you know that, you don’t have to preemptively try to turn each other off to prevent being turned off either.)
4. Programs, agents, and real-world agents
Agents are a tiny subset of all programs. And agents whose utility functions are defined over the real world are a tiny subset of all agents.
If we think about all the programs we could potentially write that take in inputs and produce outputs, it will make sense to talk about some of those as agents. These are the programs that seem to be optimizing something. Or seem to have goals and make plans.
But, crucially, all that optimization takes place with respect to some environment. And if the input and output of an agent-y program is hooked up to the wrong environment (or hooked up to the right environment in the wrong way), it’ll cease to be agent-y.
For example, if you hook me up to the real world by sticking me in outer space (sans suit), I will cease to be very agent-y. Or, if you hook up the inputs and outputs of AlphaGo to a chess board, it will cease to be formidable (until you retrain it). (In other words, the isAgent() predicate is not a one-place function.)
This suggests to me that we could build agent-y, superintelligent systems that are not a threat to us. (Because they are not agent-y with respect to the real world.)
Yes, we’re likely to (drastically) oversample from the subset of agents that are agent-y w.r.t. the real world, because we’re going to want to build systems that are useful to us.
But if I’m right about the short-circuiting argument above, even our agent-y systems won’t have coherent goals defined over events far outside their original domain (e.g. the arrangement of all the atoms in the future light cone) by default.
So even if our systems are agent-y (w.r.t. some environment), and have some knowledge of and take some actions in the real world, they won’t automatically have a utility function defined over the configurations of all atoms.
On the other hand, the more we train them as open-ended agents with wide remit to act in the real world (or a simulation thereof), the more we’ll have a (potentially superintelligently lethal) problem on our hands.
To me that suggests that what we need to care about are things like: how open-ended we make our systems, whether we train them via evolution-like competition between agents in a high-def simulation of the real world, and what kind of systems are incentivized to be developed and deployed, society-wide.
5. Conclusion
If I’m right in the above thinking, then orthogonality is more relevant and instrumental convergence is less relevant than it might otherwise appear.
Instrumental convergence would only end up being a concern for agents that care about the same objects / resources / domain that you do. If their utility function is just not about those things, IC will drive them to acquire a totally different set of resources that is not in conflict with your resources (e.g. a positional chess advantage in a go game, or trading for your knight while you try to acquire pawns).
This would mean that we need to be very worried about open-ended real-world agents. But less worried about intelligence in general, or even agents in general.
To be clear, I’m not claiming that it’s all roses from here on out. But this reasoning leads me to conclude that the key problems may not be the ones described in the post above.
GPT-3 does unsupervised learning on text data. Our brains do predictive processing on sensory inputs. My guess (which I’d love to hear arguments against!) is that there’s a true and deep analogy between the two, and that they lead to impressive abilities for fundamentally the same reason.
Agree that self-supervised learning powers both GPT-3 updates and human brain world-model updates (details & caveats). (Which isn’t to say that GPT-3 is exactly the same as the human brain world-model—there are infinitely many different possible ML algorithms that all update via self-supervised learning).
However…
If so, it seems to me that that’s where all the juice is. That’s where the intelligence comes from … if agency is not a fundamental part of intelligence, and rather something that can just be added in on top, or not, and if we’re at a loss for how to either align a superintelligent agent with CEV or else make it corrigible, then why not try to avoid creating the agent part of superintelligent agent?
I disagree; I think the agency is necessary to build a really good world-model, one that includes new useful concepts that humans have never thought of.
Without the agency, some of the things that you lose are (and these overlap): Intelligently choosing what to attend to; intelligently choosing what to think about; intelligently choosing what book to re-read and ponder; intelligently choosing what question to ask; ability to learn and use better and better brainstorming strategies and other such metacognitive heuristics.
See my discussion here (Section 7.2) for why I think these things are important if we want the AGI to be able to do things like invent new technology or come up with new good ideas in AI alignment.
You can say: “We’ll (1) make an agent that helps build a really good world-model, then (2) turn off the agent and use / query the world-model by itself”. But then step (1) is the dangerous part.
I disagree; I think the agency is necessary to build a really good world-model, one that includes new useful concepts that humans have never thought of.
Without the agency, some of the things that you lose are (and these overlap): Intelligently choosing what to attend to; intelligently choosing what to think about; intelligently choosing what book to re-read and ponder; intelligently choosing what question to ask; ability to learn and use better and better brainstorming strategies and other such metacognitive heuristics.
Why is agency necessary for these things?
If we follow Ought’s advice and build “process-based systems [that] are built on human-understandable task decompositions, with direct supervision of reasoning steps”, do you expect us to hit a hard wall somewhere that prevents these systems from creatively choosing things to think about, books to read, or better brainstorming strategies?
Let’s compare two things: “trying to get a good understanding of some domain by building up a vocabulary of concepts and their relations” versus “trying to win a video game”. At a high level, I claim they have a lot in common!
In both cases, there are a bunch of possible “moves” you can make (you could think the thought “what if there’s some analogy between this and that?”, or you could think the thought “that’s a bit of a pattern; does it generalize?”, etc. etc.), and each move affects subsequent moves, in an exponentially-growing tree of possibilities.
In both cases, you’ll often get some early hints about whether moves were wise, but you won’t really know that you’re on the right track except in hindsight.
And in both cases, I think the only reliable way to succeed is to have the capability to repeatedly try different things, and learn from experience what paths and strategies are fruitful.
Therefore (I would argue), a human-level concept-inventing AI needs “RL-on-thoughts”—i.e., a reinforcement learning system, in which “thoughts” (edits to the hypothesis space / priors / world-model) are the thing that gets rewarded. The human brain certainly has that. You can be lying in bed motionless, and have rewarding thoughts, and aversive thoughts, and new ideas that make you rethink something you thought you knew.
Unfortunately, I also believe that RL-on-thoughts is really dangerous by default. Here’s why.
Again suppose that we want an AI that gets a good understanding of some domain by building up a vocabulary of concepts and their relations. As discussed above, we do this via an RL-on-thoughts AI. Consider some of the features that we plausibly need to put into this RL-on-thoughts system, for it to succeed at a superhuman level:
Developing and pursuing instrumental subgoals—for example, suppose the AI is “trying” to develop concepts that will make it superhumanly competent at assisting a human microscope inventor. We want it to be able to “notice” that there might be a relation between lenses and symplectic transformations, and then go spend some compute cycles developing a better understanding of symplectic transformations. For this to happen, we need “understand symplectic transformations” to be flagged as a temporary sub-goal, and to be pursued, and we want it to be able to spawn further sub-sub-goals and so on.
Consequentialist planning—Relatedly, we want the AI to be able to summon and re-read a textbook on linear algebra, or mentally work through an example problem, because it anticipates that these activities will lead to better understanding of the target domain.
Meta-cognition—We want the AI to be able to learn patterns in which of its own “thoughts” lead to better understanding and which don’t, and to apply that knowledge towards having more productive thoughts.
Putting all these things together, it seems to me that the default for this kind of AI would be to figure out that “seizing control of its off-switch” would be instrumentally useful for it to do what it’s trying to do (i.e. develop a better understanding of the target domain, presumably), and then to come up with a clever scheme to do so, and then to do it. So like I said, RL-on-thoughts seems to me to be both necessary and dangerous.
(Does that count as “agency”? I don’t know, it depends on what you mean by “agency”.)
In terms of the “task decomposition” strategy, this might be a tricky to discuss because you probably have a more detailed picture in your mind than I do. I’ll try anyway.
It seems to me that the options are:
(1) the subprocess only knows its narrow task (“solve this symplectic geometry homework problem”), and is oblivious to the overall system goal (“design a better microscope”), or
(2) the subprocess is aware of the overall system goal and chooses actions in part to advance it.
In Case (2), I’m not sure this really counts as “task decomposition” in the first place, or how this would help with safety.
In Case (1), yes I expect systems to hit a hard wall—I’m skeptical that tasks we care about decompose cleanly.
For example, at my last job, I would often be part of a team inventing a new gizmo, and it was not at all unusual for me to find myself sketching out the algorithms and sketching out the link budget and scrutinizing laser spec sheets and scrutinizing FPGA spec sheets and nailing down end-user requirements, etc. etc. Not because I’m individually the best person at each of those tasks—or even very good!—but because sometimes a laser-related problem is best solved by switching to a different algorithm, or an FPGA-related problem is best solved by recognizing that the real end-user requirements are not quite what we thought, etc. etc. And that kind of design work is awfully hard unless a giant heap of relevant information and knowledge is all together in a single brain / world-model.
In the case of my current job doing AI alignment research, I sometimes come across small self-contained tasks that could be delegated, but I would have no idea how to decompose most of what I do. (E.g. writing this comment!)
So why do bureaucracies (and large organizations more generally) fail so badly?
My main model for this is that interfaces are a scarce resource. Or, to phrase it in a way more obviously relevant to factorization: it is empirically hard for humans to find good factorizations of problems which have not already been found. Interfaces which neatly split problems are not an abundant resource (at least relative to humans’ abilities to find/build such interfaces). If you can solve that problem well, robustly and at scale, then there’s an awful lot of money to be made.
Also, one major sub-bottleneck (though not the only sub-bottleneck) of interface scarcity is that it’s hard to tell who has done a good job on a domain-specific problem/question without already having some domain-specific background knowledge. This also applies at a more “micro” level: it’s hard to tell whose answers are best without knowing lots of context oneself.
A possible example of a seemingly-hard-to-decompose task would be: Until 1948, no human had ever thought of the concept of “information entropy”. Then Claude Shannon sat down and invented this new useful concept. Make an AI that can do things like that.
(Even if I’m correct that process-based task-decomposition hits a wall, that’s not to say that it doesn’t have room for improvement over today’s AI. The issue is (1) outcome-based systems are dangerous; (2) given enough time, people will presumably build them anyway. And the goal is to solve that problem, either by a GPU-melting-nanobot type of plan, or some other better plan. Is there such a plan that we can enact using a process-based task-decomposition AI? Eliezer believes (see point 7) that the answer is “no”. I would say the answer is: “I guess maybe, but I can’t think of any”. I don’t know what type of plan you have in mind. Sorry if you already talked about that and I missed it. :) )
FWIW self-supervised learning can be surprisingly capable of doing things that we previously only knew how to do with “agentic” designs. From that link: classification is usually done with an objective + an optimization procedure, but GPT-3 just does it.
For example, I claim that while AlphaGo could be said to be agent-y, it does not care about atoms. And I think that we could make it fantastically more superhuman at Go, and it would still not care about atoms. Atoms are just not in the domain of its utility function.
In particular, I don’t think it has an incentive to break out into the real world to somehow get itself more compute, so that it can think more about its next move. It’s just not modeling the real world at all. It’s not even trying to rack up a bunch of wins over time. It’s just playing the single platonic game of Go.
I would distinguish three ways in which different AI systems could be said to “not care about atoms”:
The system is thinking about a virtual object (e.g., a Go board in its head), and it’s incapable of entertaining hypotheses about physical systems. Indeed, we might add the assumption that it can’t entertain hypotheses like ‘this Go board I’m currently thinking about is part of a larger universe’ at all. (E.g., there isn’t some super-Go-board I and/or the board are embedded in.)
The system can think about atoms/physics, but it only terminally cares about digital things in a simulated environment (e.g., winning Go), and we’re carefully keeping it from ever learning that it’s inside a simulation / that there’s a larger reality it can potentially affect.
The system can think about atoms/physics, and it knows that our world exists, but it still only terminally cares about digital things in the simulated environment.
Case 3 is not safe, because controlling the physical world is a useful way to control the simulation you’re in. (E.g., killing all agents in base reality ensures that they’ll never shut down your simulation.)
Case 2 is potentially safe but fragile, because you’re relying on your ability to trick/outsmart an alien mind that may be much smarter than you. If you fail, this reduces to case 3.
(Also, it’s not obvious to me that you can do a pivotal act using AGI-grade reasoning about simulations. Which matters if other people are liable to destroy the world with case-3 AGIs, or just with ordinary AGIs that terminally value things about the physical world.)
Case 1 strikes me as genuinely a lot safer, but a lot less useful. I don’t expect humanity to be satisfied with those sorts of AI systems, or to coordinate to only ever build them—like, I don’t expect any coordination here. And I’m not seeing a way to leverage a system like this to save the world, given that case-2, 3, etc. systems will eventually exist too.
Case 3 is not safe, because controlling the physical world is a useful way to control the simulation you’re in. (E.g., killing all agents in base reality ensures that they’ll never shut down your simulation.)
In my mind, this is still making the mistake of not distinguishing the true domain of the agent’s utility function from ours.
Whether the simulation continues to be instantiated in some computer in our world is a fact about our world, not about the simulated world.
AlphaGo doesn’t care about being unplugged in the middle of a game (unless that dynamic was part of its training data). It cares about the platonic game of go, not about the instantiated game it’s currently playing.
We need to worry about leaky abstractions, as per my original comment. So we can’t always assume the agent’s domain is what we’d ideally want it to be.
But I’m trying to highlight that it’s possible (and I would tentatively go further and say probable) for agents not to care about the real world.
To me, assuming care about the real world (including wanting not to be unplugged) seems like a form of anthropomorphism.
For any given agent-y system I think we need to analyze whether it in particular would come to care about real world events. I don’t think we can assume in general one way or the other.
AlphaGo doesn’t care about being unplugged in the middle of a game (unless that dynamic was part of its training data). It cares about the platonic game of go, not about the instantiated game it’s currently playing.
What if the programmers intervene mid-game to give the other side an advantage? Does a Go AGI, as you’re thinking of it, care about that?
I’m not following why a Go AGI (with the ability to think about the physical world, but a utility function that only cares about states of the simulation) wouldn’t want to seize more hardware, so that it can think better and thereby win more often in the simulation; or gain control of its hardware and directly edit the simulation so that it wins as many games as possible as quickly as possible.
Why would having a utility function that only assigns utility based on X make you indifferent to non-X things that causally affect X? If I only terminally cared about things that happened a year from now, I would still try to shape the intervening time because doing so will change what happens a year from now.
(This is maybe less clear in the case of shutdown, because it’s not clear how an agent should think about shutdown if its utility is defined states of its simulation. So I’ll set that particular case aside.)
A Go AI that learns to play go via reinforcement learning might not “have a utility function that only cares about winning Go”. Using standard utility theory, you could observe its actions and try to rationalise them as if they were maximising some utility function, and the utility function you come up with probably wouldn’t be “win every game of Go you start playing” (what you actually come up with will depend, presumably, on algorithmic and training regime details). The reason why the utility function is slippery is that it’s fundamentally an adaptation executor, not a utility maxmiser.
3. The system can think about atoms/physics, and it knows that our world exists, but it still only terminally cares about digital things in the simulated environment.
Case 3 is not safe, because controlling the physical world is a useful way to control the simulation you’re in. (E.g., killing all agents in base reality ensures that they’ll never shut down your simulation.)
Not necessarily. Train something multimodally on digital games of Go and on, say, predicting the effects of modifications to its own code on its success at Go. It could be a) good at go and b) have some real understanding of “real world actions” that make it better at Go, and still not actually take any real world actions to make it better at Go, even if it had the opportunity. You could modify the training to make it likely to do so—perhaps by asking it to either make a move or to produce descendants that make better choices—but if you don’t do this then it seems entirely plausible, and even perhaps likely, that it develops an understanding of self-modification and of go playing without ever self-modifying in order to play go better. Its goal, so to speak, is “play go with the restriction of using only legal game moves”.
Edit—forget the real world, here’s an experiment:
Train a board game playing AI with two modes of operation: game state x move → outcome and game state → best move. Subtle difference: in the first mode of operation, the move has a “cheat button” that, when pressed, always results in a win. In the second, it can output cheat button presses, but it has no effect on winning or losing.
Question is: does it learn to press the cheat button? I’m really not sure. Could you prevent it from learning to press the cheat button if training feedback is never allowed to depend on whether or not this button was pressed? That seems likely.
Can you visualize an agent that is not “open-ended” in the relevant ways, but is capable of, say, building nanotech and melting all the GPUs?
In my picture most of the extra sauce you’d need on top of GPT-3 looks very agenty. It seems tricky to name “virtual worlds” in which AIs manipulate just “virtual resources” and still manage to do something like melting the GPUs.
maybe a reasonable path forward is to try to wring as much productivity as we can out of the passive, superhuman, quasi-oracular just-dumb-data-predictors. And avoid as much as we can ever creating closed-loop, open-ended, free-rein agents.
I should say that I do see this as a reasonable path forward! But we don’t seem to be coordinating to do this, and AI researchers seem to love doing work on open-ended agents, which sucks.
Hm, regardless it doesn’t really move the needle, so long as people are publishing all of their work. Developing overpowered pattern recognizers is similar to increasing our level of hardware overhang. People will end up using them as components of systems that aren’t safe.
Hm, regardless it doesn’t really move the needle, so long as people are publishing all of their work. Developing overpowered pattern recognizers is similar to increasing our level of hardware overhang. People will end up using them as components of systems that aren’t safe.
I strongly disagree. Gain of function research happens, but it’s rare because people know it’s not safe. To put it mildly, I think reducing the number of dangerous experiments substantially improves the odds of no disaster happening over any given time frame
Can you visualize an agent that is not “open-ended” in the relevant ways, but is capable of, say, building nanotech and melting all the GPUs?
FWIW, I’m not sold on the idea of taking a single pivotal act. But, engaging with what I think is the real substance of the question — can we do complex, real-world, superhuman things with non-agent-y systems?
Yes, I think we can! Just as current language models can be prompt-programmed into solving arithmetic word problems, I think a future system could be led to generate a GPU-melting plan, without it needing to be a utility-maximizing agent.
For a very hand-wavy sketch of how that might go, consider asking GPT-N to generate 1000s of candidate high-level plans, then rate them by feasibility, then break each plan into steps and re-evaluate, etc.
Or, alternatively, imagine the cognitive steps you might take if you were trying to come up with a GPU-melting plan (or alternatively a pivotal act plan in general). Do any of those steps really require that you have a utility function or that you’re a goal-directed agent?
It seems to me that we need some form of search, and discrimination and optimization. But not necessarily anymore than GPT-3 already has. (It would just need to be better at the search. And we’d need to make many many passes through the network to complete all the cognitive steps.)
On your view, what am I missing here?
Is GPT-3 already more of an agent than I realize? (If so, is it dangerous?)
Will GPT-N by default be more of an agent than GPT-3?
Are our own thought processes making use of goal-directedness more than I realize?
Will prompt-programming passive systems hit a wall somewhere?
If so, what are some of the simplest cognitive tasks that we can do that you think such systems wouldn’t be able to do?
For a very hand-wavy sketch of how that might go, consider asking GPT-N to generate 1000s of candidate high-level plans, then rate them by feasibility, then break each plan into steps and re-evaluate, etc
FWIW, I’d call this “weakly agentic” in the sense that you’re searching through some options, but the number of options you’re looking through is fairly small.
It’s plausible that this is enough to get good results and also avoid disasters, but it’s actually not obvious to me. The basic reason: if the top 1000 plans are good enough to get superior performance, they might also be “good enough” to be dangerous. While it feels like there’s some separation between “useful and safe” and “dangerous” plans and this scheme might yield plans all of the former type, I don’t presently see a stronger reason to believe that this is true.
Separately from whether the plans themselves are safe or dangerous, I think the key question is whether the process that generated the plans is trying to deceive you (so it can break out into the real world or whatever).
If it’s not trying to deceive you, then it seems like you can just build in various safeguards (like asking, “is this plan safe?”, as well as more sophisticated checks), and be okay.
What do you think of a claim like “most of the intelligence comes from the steps where you do most of the optimization”? A corollary of this is that we particularly want to make sure optimization intensive steps of AI creation are safe WRT not producing intelligent programs devoted to killing us.
Example: most of the “intelligence” of language models comes from the supervised learning step. However, it’s in-principle plausible that we could design e.g. some really capable general purpose reinforcement learner where the intelligence comes from the reinforcement, and the latter could (but wouldn’t necessarily) internalise “agenty” behaviour.
I have a vague impression that this is already something other people are thinking about, though maybe I read too much into some tangential remarks in this direction. E.g. I figured the concern about mesa-optimizers was partly motivated by the idea that we can’t always tell when an optimization intensive step is taking place.
I can easily imagine people blundering into performing unsafe optimization-intensive AI creation processes. Gain of function pathogen research would seem to be a relevant case study here, except we currently have less idea about what kind of optimization makes deadly AIs vs what kind of optimization makes deadly pathogens. One of the worries (again, maybe I’m reading too far into comments that don’t say this explicitly) is that the likelihood of such a blunder approaches 1 over long enough times, and the “pivotal act” framing is supposed to be about doing something that could change this (??)
That said, it seems that there’s a lot that could be done to make it less likely in short time frames.
What do you think of a claim like “most of the intelligence comes from the steps where you do most of the optimization”? A corollary of this is that we particularly want to make sure optimization intensive steps of AI creation are safe WRT not producing intelligent programs devoted to killing us.
This seems probably right to me.
Example: most of the “intelligence” of language models comes from the supervised learning step. However, it’s in-principle plausible that we could design e.g. some really capable general purpose reinforcement learner where the intelligence comes from the reinforcement, and the latter could (but wouldn’t necessarily) internalise “agenty” behaviour.
I agree that reinforcement learners seem more likely to be agent-y (and therefore scarier) than self-supervised learners.
I think this is actually the part that I most “disagree” with. (I put “disagree” in quotes, because there are forms of these theses that I’m persuaded by. However, I’m not so confident that they’ll be relevant for the kinds of AIs we’ll actually build.)
1. The smart part is not the agent-y part
It seems to me that what’s powerful about modern ML systems is their ability to do data compression / pattern recognition. That’s where the real cognitive power (to borrow Eliezer’s term) comes from. And I think that this is the same as what makes us smart.
GPT-3 does unsupervised learning on text data. Our brains do predictive processing on sensory inputs. My guess (which I’d love to hear arguments against!) is that there’s a true and deep analogy between the two, and that they lead to impressive abilities for fundamentally the same reason.
If so, it seems to me that that’s where all the juice is. That’s where the intelligence comes from. (In the past, I’ve called this the core smarts of our brains.)
On this view, all the agent-y, planful, System 2 stuff that we do is the analogue of prompt programming. It’s a set of not-very-deep, not-especially-complex algorithms meant to cajole the actually smart stuff into doing something useful.
When I try to extrapolate what this means for how AI systems will be built, I imagine a bunch of Drexler-style AI services.
Yes, in some cases people will want to chain services together to form something like an agent, with something like goals. However, the agent part isn’t the smart part. It’s just some simple algorithms on top of a giant pile of pattern recognition and data compression.
Why is that relevant? Isn’t an algorithmically simple superintelligent agent just as scary as (if not moreso than) a complex one? In a sense yes, it would still be very scary. But to me it suggests a different intervention point.
If the agency is not inextricably tied to the intelligence, then maybe a reasonable path forward is to try to wring as much productivity as we can out of the passive, superhuman, quasi-oracular just-dumb-data-predictors. And avoid as much as we can ever creating closed-loop, open-ended, free-rein agents.
Am I just recapitulating the case for Oracle-AI / Tool-AI? Maybe so.
But if agency is not a fundamental part of intelligence, and rather something that can just be added in on top, or not, and if we’re at a loss for how to either align a superintelligent agent with CEV or else make it corrigible, then why not try to avoid creating the agent part of superintelligent agent?
I think that might be easier than many think...
2. The AI does not care about your atoms either
https://intelligence.org/files/AIPosNegFactor.pdf
Suppose we have (something like) an agent, with (something like) a utility function. I think it’s important to keep in mind the domain of the utility function. (I’ll be making basically the same point repeatedly throughout the rest of this comment.)
By default, I don’t expect systems that we build, with agent-like behavior (even superintelligently smart systems!), to care about all the atoms in the future light cone.
Humans (and other animals) care about atoms. We care about (our sensory perceptions of) macroscopic events, forward in time, because we evolved to. But that is not the default domain of an agent’s utility function.
For example, I claim that while AlphaGo could be said to be agent-y, it does not care about atoms. And I think that we could make it fantastically more superhuman at Go, and it would still not care about atoms. Atoms are just not in the domain of its utility function.
In particular, I don’t think it has an incentive to break out into the real world to somehow get itself more compute, so that it can think more about its next move. It’s just not modeling the real world at all. It’s not even trying to rack up a bunch of wins over time. It’s just playing the single platonic game of Go.
Giant caveat (that you may already be shouting into your screen): abstractions are leaky.
The ML system is not actually trained to play the platonic game of Go. It’s trained to play the-game-of-Go-as-implemented-on-particular-hardware, or something like minimize-this-loss-function-informed-by-Go-game-results. The difference between the platonic game and the embodied game can lead to clever and unexpected behavior.
However, it seems to me that these kinds of hacks are going to look a lot more like a system short-circuiting than it out-of-nowhere building a model of, and starting to care about, the whole universe.
3. Orthogonality squared
I really liked Eliezer’s Arbital article on Epistemic and instrumental efficiency. He writes:
I think this very succinctly captures what would be so scary about being up against a (sufficiently) superintelligent agent with conflicting goals to yours. If you think you see a flaw in its plan, that says more about your seeing than it does about its plan. In other words, you’re toast.
But as above, I think it’s important to keep in mind what an agent’s goals are actually about.
Just as the utility function of an agent is orthogonal from its intelligence, it seems to me that the domain of its utility function is another dimension of potential orthogonality.
If you’re playing chess against AlphaZero Chess, you’re going to lose. But suppose you’re secretly playing “Who has the most pawns after 10 moves?” I think you’ve got a chance to win! Even though it cares about pawns!
(Of course if you continue playing out the chess game after the10th move, it’ll win at that. But by assumption, that’s fine, it’s not what you cared about.)
If you and another agent have different goals for the same set of objects, you’re going to be in conflict. It’s going to be zero sum. But if the stuff you care about is only tangentially related to the stuff it cares about, then the results can be positive sum. You can both win!
In particular, you can both get what you want without either of you turning the other off. (And if you know that, you don’t have to preemptively try to turn each other off to prevent being turned off either.)
4. Programs, agents, and real-world agents
Agents are a tiny subset of all programs. And agents whose utility functions are defined over the real world are a tiny subset of all agents.
If we think about all the programs we could potentially write that take in inputs and produce outputs, it will make sense to talk about some of those as agents. These are the programs that seem to be optimizing something. Or seem to have goals and make plans.
But, crucially, all that optimization takes place with respect to some environment. And if the input and output of an agent-y program is hooked up to the wrong environment (or hooked up to the right environment in the wrong way), it’ll cease to be agent-y.
For example, if you hook me up to the real world by sticking me in outer space (sans suit), I will cease to be very agent-y. Or, if you hook up the inputs and outputs of AlphaGo to a chess board, it will cease to be formidable (until you retrain it). (In other words, the isAgent() predicate is not a one-place function.)
This suggests to me that we could build agent-y, superintelligent systems that are not a threat to us. (Because they are not agent-y with respect to the real world.)
Yes, we’re likely to (drastically) oversample from the subset of agents that are agent-y w.r.t. the real world, because we’re going to want to build systems that are useful to us.
But if I’m right about the short-circuiting argument above, even our agent-y systems won’t have coherent goals defined over events far outside their original domain (e.g. the arrangement of all the atoms in the future light cone) by default.
So even if our systems are agent-y (w.r.t. some environment), and have some knowledge of and take some actions in the real world, they won’t automatically have a utility function defined over the configurations of all atoms.
On the other hand, the more we train them as open-ended agents with wide remit to act in the real world (or a simulation thereof), the more we’ll have a (potentially superintelligently lethal) problem on our hands.
To me that suggests that what we need to care about are things like: how open-ended we make our systems, whether we train them via evolution-like competition between agents in a high-def simulation of the real world, and what kind of systems are incentivized to be developed and deployed, society-wide.
5. Conclusion
If I’m right in the above thinking, then orthogonality is more relevant and instrumental convergence is less relevant than it might otherwise appear.
Instrumental convergence would only end up being a concern for agents that care about the same objects / resources / domain that you do. If their utility function is just not about those things, IC will drive them to acquire a totally different set of resources that is not in conflict with your resources (e.g. a positional chess advantage in a go game, or trading for your knight while you try to acquire pawns).
This would mean that we need to be very worried about open-ended real-world agents. But less worried about intelligence in general, or even agents in general.
To be clear, I’m not claiming that it’s all roses from here on out. But this reasoning leads me to conclude that the key problems may not be the ones described in the post above.
Agree that self-supervised learning powers both GPT-3 updates and human brain world-model updates (details & caveats). (Which isn’t to say that GPT-3 is exactly the same as the human brain world-model—there are infinitely many different possible ML algorithms that all update via self-supervised learning).
However…
I disagree; I think the agency is necessary to build a really good world-model, one that includes new useful concepts that humans have never thought of.
Without the agency, some of the things that you lose are (and these overlap): Intelligently choosing what to attend to; intelligently choosing what to think about; intelligently choosing what book to re-read and ponder; intelligently choosing what question to ask; ability to learn and use better and better brainstorming strategies and other such metacognitive heuristics.
See my discussion here (Section 7.2) for why I think these things are important if we want the AGI to be able to do things like invent new technology or come up with new good ideas in AI alignment.
You can say: “We’ll (1) make an agent that helps build a really good world-model, then (2) turn off the agent and use / query the world-model by itself”. But then step (1) is the dangerous part.
Why is agency necessary for these things?
If we follow Ought’s advice and build “process-based systems [that] are built on human-understandable task decompositions, with direct supervision of reasoning steps”, do you expect us to hit a hard wall somewhere that prevents these systems from creatively choosing things to think about, books to read, or better brainstorming strategies?
(Copying from here:)
(Does that count as “agency”? I don’t know, it depends on what you mean by “agency”.)
In terms of the “task decomposition” strategy, this might be a tricky to discuss because you probably have a more detailed picture in your mind than I do. I’ll try anyway.
It seems to me that the options are:
(1) the subprocess only knows its narrow task (“solve this symplectic geometry homework problem”), and is oblivious to the overall system goal (“design a better microscope”), or
(2) the subprocess is aware of the overall system goal and chooses actions in part to advance it.
In Case (2), I’m not sure this really counts as “task decomposition” in the first place, or how this would help with safety.
In Case (1), yes I expect systems to hit a hard wall—I’m skeptical that tasks we care about decompose cleanly.
For example, at my last job, I would often be part of a team inventing a new gizmo, and it was not at all unusual for me to find myself sketching out the algorithms and sketching out the link budget and scrutinizing laser spec sheets and scrutinizing FPGA spec sheets and nailing down end-user requirements, etc. etc. Not because I’m individually the best person at each of those tasks—or even very good!—but because sometimes a laser-related problem is best solved by switching to a different algorithm, or an FPGA-related problem is best solved by recognizing that the real end-user requirements are not quite what we thought, etc. etc. And that kind of design work is awfully hard unless a giant heap of relevant information and knowledge is all together in a single brain / world-model.
In the case of my current job doing AI alignment research, I sometimes come across small self-contained tasks that could be delegated, but I would have no idea how to decompose most of what I do. (E.g. writing this comment!)
Here’s John Wentworth making a similar point more eloquently:
A possible example of a seemingly-hard-to-decompose task would be: Until 1948, no human had ever thought of the concept of “information entropy”. Then Claude Shannon sat down and invented this new useful concept. Make an AI that can do things like that.
(Even if I’m correct that process-based task-decomposition hits a wall, that’s not to say that it doesn’t have room for improvement over today’s AI. The issue is (1) outcome-based systems are dangerous; (2) given enough time, people will presumably build them anyway. And the goal is to solve that problem, either by a GPU-melting-nanobot type of plan, or some other better plan. Is there such a plan that we can enact using a process-based task-decomposition AI? Eliezer believes (see point 7) that the answer is “no”. I would say the answer is: “I guess maybe, but I can’t think of any”. I don’t know what type of plan you have in mind. Sorry if you already talked about that and I missed it. :) )
FWIW self-supervised learning can be surprisingly capable of doing things that we previously only knew how to do with “agentic” designs. From that link: classification is usually done with an objective + an optimization procedure, but GPT-3 just does it.
I would distinguish three ways in which different AI systems could be said to “not care about atoms”:
The system is thinking about a virtual object (e.g., a Go board in its head), and it’s incapable of entertaining hypotheses about physical systems. Indeed, we might add the assumption that it can’t entertain hypotheses like ‘this Go board I’m currently thinking about is part of a larger universe’ at all. (E.g., there isn’t some super-Go-board I and/or the board are embedded in.)
The system can think about atoms/physics, but it only terminally cares about digital things in a simulated environment (e.g., winning Go), and we’re carefully keeping it from ever learning that it’s inside a simulation / that there’s a larger reality it can potentially affect.
The system can think about atoms/physics, and it knows that our world exists, but it still only terminally cares about digital things in the simulated environment.
Case 3 is not safe, because controlling the physical world is a useful way to control the simulation you’re in. (E.g., killing all agents in base reality ensures that they’ll never shut down your simulation.)
Case 2 is potentially safe but fragile, because you’re relying on your ability to trick/outsmart an alien mind that may be much smarter than you. If you fail, this reduces to case 3.
(Also, it’s not obvious to me that you can do a pivotal act using AGI-grade reasoning about simulations. Which matters if other people are liable to destroy the world with case-3 AGIs, or just with ordinary AGIs that terminally value things about the physical world.)
Case 1 strikes me as genuinely a lot safer, but a lot less useful. I don’t expect humanity to be satisfied with those sorts of AI systems, or to coordinate to only ever build them—like, I don’t expect any coordination here. And I’m not seeing a way to leverage a system like this to save the world, given that case-2, 3, etc. systems will eventually exist too.
In my mind, this is still making the mistake of not distinguishing the true domain of the agent’s utility function from ours.
Whether the simulation continues to be instantiated in some computer in our world is a fact about our world, not about the simulated world.
AlphaGo doesn’t care about being unplugged in the middle of a game (unless that dynamic was part of its training data). It cares about the platonic game of go, not about the instantiated game it’s currently playing.
We need to worry about leaky abstractions, as per my original comment. So we can’t always assume the agent’s domain is what we’d ideally want it to be.
But I’m trying to highlight that it’s possible (and I would tentatively go further and say probable) for agents not to care about the real world.
To me, assuming care about the real world (including wanting not to be unplugged) seems like a form of anthropomorphism.
For any given agent-y system I think we need to analyze whether it in particular would come to care about real world events. I don’t think we can assume in general one way or the other.
What if the programmers intervene mid-game to give the other side an advantage? Does a Go AGI, as you’re thinking of it, care about that?
I’m not following why a Go AGI (with the ability to think about the physical world, but a utility function that only cares about states of the simulation) wouldn’t want to seize more hardware, so that it can think better and thereby win more often in the simulation; or gain control of its hardware and directly edit the simulation so that it wins as many games as possible as quickly as possible.
Why would having a utility function that only assigns utility based on X make you indifferent to non-X things that causally affect X? If I only terminally cared about things that happened a year from now, I would still try to shape the intervening time because doing so will change what happens a year from now.
(This is maybe less clear in the case of shutdown, because it’s not clear how an agent should think about shutdown if its utility is defined states of its simulation. So I’ll set that particular case aside.)
A Go AI that learns to play go via reinforcement learning might not “have a utility function that only cares about winning Go”. Using standard utility theory, you could observe its actions and try to rationalise them as if they were maximising some utility function, and the utility function you come up with probably wouldn’t be “win every game of Go you start playing” (what you actually come up with will depend, presumably, on algorithmic and training regime details). The reason why the utility function is slippery is that it’s fundamentally an adaptation executor, not a utility maxmiser.
Not necessarily. Train something multimodally on digital games of Go and on, say, predicting the effects of modifications to its own code on its success at Go. It could be a) good at go and b) have some real understanding of “real world actions” that make it better at Go, and still not actually take any real world actions to make it better at Go, even if it had the opportunity. You could modify the training to make it likely to do so—perhaps by asking it to either make a move or to produce descendants that make better choices—but if you don’t do this then it seems entirely plausible, and even perhaps likely, that it develops an understanding of self-modification and of go playing without ever self-modifying in order to play go better. Its goal, so to speak, is “play go with the restriction of using only legal game moves”.
Edit—forget the real world, here’s an experiment:
Train a board game playing AI with two modes of operation: game state x move → outcome and game state → best move. Subtle difference: in the first mode of operation, the move has a “cheat button” that, when pressed, always results in a win. In the second, it can output cheat button presses, but it has no effect on winning or losing.
Question is: does it learn to press the cheat button? I’m really not sure. Could you prevent it from learning to press the cheat button if training feedback is never allowed to depend on whether or not this button was pressed? That seems likely.
Can you visualize an agent that is not “open-ended” in the relevant ways, but is capable of, say, building nanotech and melting all the GPUs?
In my picture most of the extra sauce you’d need on top of GPT-3 looks very agenty. It seems tricky to name “virtual worlds” in which AIs manipulate just “virtual resources” and still manage to do something like melting the GPUs.
I should say that I do see this as a reasonable path forward! But we don’t seem to be coordinating to do this, and AI researchers seem to love doing work on open-ended agents, which sucks.
Hm, regardless it doesn’t really move the needle, so long as people are publishing all of their work. Developing overpowered pattern recognizers is similar to increasing our level of hardware overhang. People will end up using them as components of systems that aren’t safe.
I strongly disagree. Gain of function research happens, but it’s rare because people know it’s not safe. To put it mildly, I think reducing the number of dangerous experiments substantially improves the odds of no disaster happening over any given time frame
FWIW, I’m not sold on the idea of taking a single pivotal act. But, engaging with what I think is the real substance of the question — can we do complex, real-world, superhuman things with non-agent-y systems?
Yes, I think we can! Just as current language models can be prompt-programmed into solving arithmetic word problems, I think a future system could be led to generate a GPU-melting plan, without it needing to be a utility-maximizing agent.
For a very hand-wavy sketch of how that might go, consider asking GPT-N to generate 1000s of candidate high-level plans, then rate them by feasibility, then break each plan into steps and re-evaluate, etc.
Or, alternatively, imagine the cognitive steps you might take if you were trying to come up with a GPU-melting plan (or alternatively a pivotal act plan in general). Do any of those steps really require that you have a utility function or that you’re a goal-directed agent?
It seems to me that we need some form of search, and discrimination and optimization. But not necessarily anymore than GPT-3 already has. (It would just need to be better at the search. And we’d need to make many many passes through the network to complete all the cognitive steps.)
On your view, what am I missing here?
Is GPT-3 already more of an agent than I realize? (If so, is it dangerous?)
Will GPT-N by default be more of an agent than GPT-3?
Are our own thought processes making use of goal-directedness more than I realize?
Will prompt-programming passive systems hit a wall somewhere?
If so, what are some of the simplest cognitive tasks that we can do that you think such systems wouldn’t be able to do?
(See also my similar question here.)
FWIW, I’d call this “weakly agentic” in the sense that you’re searching through some options, but the number of options you’re looking through is fairly small.
It’s plausible that this is enough to get good results and also avoid disasters, but it’s actually not obvious to me. The basic reason: if the top 1000 plans are good enough to get superior performance, they might also be “good enough” to be dangerous. While it feels like there’s some separation between “useful and safe” and “dangerous” plans and this scheme might yield plans all of the former type, I don’t presently see a stronger reason to believe that this is true.
Separately from whether the plans themselves are safe or dangerous, I think the key question is whether the process that generated the plans is trying to deceive you (so it can break out into the real world or whatever).
If it’s not trying to deceive you, then it seems like you can just build in various safeguards (like asking, “is this plan safe?”, as well as more sophisticated checks), and be okay.
>then rate them by feasibility,
I mean, literal GPT is just going to have poor feasibility ratings for novel engineering concepts.
>Do any of those steps really require that you have a utility function or that you’re a goal-directed agent?
Yes, obviously. You have to make many scientific and engineering discoveries, which involves goal-directed investigation.
> Are our own thought processes making use of goal-directedness more than I realize?
Yes, you know which ideas make sense by generalizing from ideas more closely tied in with the actions you take directed towards living.
What do you think of a claim like “most of the intelligence comes from the steps where you do most of the optimization”? A corollary of this is that we particularly want to make sure optimization intensive steps of AI creation are safe WRT not producing intelligent programs devoted to killing us.
Example: most of the “intelligence” of language models comes from the supervised learning step. However, it’s in-principle plausible that we could design e.g. some really capable general purpose reinforcement learner where the intelligence comes from the reinforcement, and the latter could (but wouldn’t necessarily) internalise “agenty” behaviour.
I have a vague impression that this is already something other people are thinking about, though maybe I read too much into some tangential remarks in this direction. E.g. I figured the concern about mesa-optimizers was partly motivated by the idea that we can’t always tell when an optimization intensive step is taking place.
I can easily imagine people blundering into performing unsafe optimization-intensive AI creation processes. Gain of function pathogen research would seem to be a relevant case study here, except we currently have less idea about what kind of optimization makes deadly AIs vs what kind of optimization makes deadly pathogens. One of the worries (again, maybe I’m reading too far into comments that don’t say this explicitly) is that the likelihood of such a blunder approaches 1 over long enough times, and the “pivotal act” framing is supposed to be about doing something that could change this (??)
That said, it seems that there’s a lot that could be done to make it less likely in short time frames.
This seems probably right to me.
I agree that reinforcement learners seem more likely to be agent-y (and therefore scarier) than self-supervised learners.