Will humans build goal-directed agents?
In the previous post, I argued that simply knowing that an AI system is superintelligent does not imply that it must be goal-directed. However, there are many other arguments that suggest that AI systems will or should be goal-directed, which I will discuss in this post.
Note that I don’t think of this as the Tool AI vs. Agent AI argument: it seems possible to build agent AI systems that are not goal-directed. For example, imitation learning allows you to create an agent that behaves similarly to another agent—I would classify this as “Agent AI that is not goal-directed”. (But see this comment thread for discussion.)
Note that these arguments have different implications than the argument that superintelligent AI must be goal-directed due to coherence arguments. Suppose you believe all of the following:
Any of the arguments in this post.
Superintelligent AI is not required to be goal-directed, as I argued in the last post.
Goal-directed agents cause catastrophe by default.
Then you could try to create alternative designs for AI systems such that they can do the things that goal-directed agents can do without themselves being goal-directed. You could also try to persuade AI researchers of these facts, so that they don’t build goal-directed systems.
Economic efficiency: goal-directed humans
Humans want to build powerful AI systems in order to help them achieve their goals—it seems quite clear that humans are at least partially goal-directed. As a result, it seems natural that they would build AI systems that are also goal-directed.
This is really an argument that the system comprising the human and AI agent should be directed towards some goal. The AI agent by itself need not be goal-directed as long as we get goal-directed behavior when combined with a human operator. However, in the situation where the AI agent is much more intelligent than the human, it is probably best to delegate most or all decisions to the agent, and so the agent could still look mostly goal-directed.
Even so, you could imagine that even the small part of the work that the human continues to do allows the agent to not be goal-directed, especially over long horizons. For example, perhaps the human decides what the agent should do each day, and the agent executes the instruction, which involves planning over the course of a day, but no longer. (I am not arguing that this is safe; on the contrary, having very powerful optimization over the course of a day seems probably unsafe.) This could be extremely powerful without the AI being goal-directed over the long term.
Another example would be a corrigible agent, which could be extremely powerful while not being goal-directed over the long term. (Though the meanings of “goal-directed” and “corrigible” are sufficiently fuzzy that this is not obvious and depends on the definitions we settle on for each.)
Economic efficiency: beyond human performance
Another benefit of goal-directed behavior is that it allows us to find novel ways of achieving our goals that we may not have thought of, such as AlphaGo’s move 37. Goal-directed behavior is one of the few methods we know of that allow AI systems to exceed human performance.
I think this is a good argument for goal-directed behavior, but given the problems of goal-directed behavior I think it’s worth searching for alternatives, such as the two examples in the previous section (optimizing over a day, and corrigibility). Alternatively, we could learn human reasoning, and execute it for a longer subjective time than humans would, in order to make better decisions. Or we could have systems that remain uncertain about the goal and clarify what they should do when there are multiple very different options (though this has its own problems).
Current progress in reinforcement learning
If we had to guess today which paradigm would lead to AI systems that can exceed human performance, I would guess reinforcement learning (RL). In RL, we have a reward function and we seek to choose actions that maximize the sum of expected discounted rewards. This sounds a lot like an agent that is searching over actions for the best one according to a measure of goodness (the reward function [1]), which I said previously is a goal-directed agent. And the math behind RL says that the agent should be trying to maximize its reward for the rest of time, which makes it long-term [2].
That said, current RL agents learn to replay behavior that in their past experience worked well, and typically do not generalize outside of the training distribution. This does not seem like a search over actions to find ones that are the best. In particular, you shouldn’t expect a treacherous turn, since the whole point of a treacherous turn is that you don’t see it coming because it never happened before.
In addition, current RL is episodic, so we should only expect that RL agents are goal-directed over the current episode and not in the long-term. Of course, many tasks would have very long episodes, such as being a CEO. The vanilla deep RL approach here would be to specify a reward function for how good a CEO you are, and then try many different ways of being a CEO and learn from experience. This requires you to collect many full episodes of being a CEO, which would be extremely time-consuming.
Perhaps with enough advances in model-based deep RL we could train the model on partial trajectories and that would be enough, since it could generalize to full trajectories. I think this is a tenable position, though I personally don’t expect it to work since it relies on our model generalizing well, which seems unlikely even with future research.
These arguments lead me to believe that we’ll probably have to do something that is not vanilla deep RL in order to train an AI system that can be a CEO, and that thing may not be goal-directed.
Overall, it is certainly possible that improved RL agents will look like dangerous long-term goal-directed agents, but this does not seem to be the case today and there seem to be serious difficulties in scaling current algorithms to superintelligent AI systems that can optimize over the long term. (I’m not arguing for long timelines here, since I wouldn’t be surprised if we figured out some way that wasn’t vanilla deep RL to optimize over the long term, but that method need not be goal-directed.)
Existing intelligent agents are goal-directed
So far, humans and perhaps animals are the only example of generally intelligent agents that we know of, and they seem to be quite goal-directed. This is some evidence that we should expect intelligent agents that we build to also be goal-directed.
Ultimately we are observing a correlation between two things with sample size 1, which is really not much evidence at all. If you believe that many animals are also intelligent and goal-directed, then perhaps the sample size is larger, since there are intelligent animals with very different evolutionary histories and neural architectures (eg. octopuses).
However, this is specifically about agents that were created by evolution, which did a relatively stupid blind search over a large space, and we could use a different method to develop AI systems. So this argument makes me more wary of creating AI systems using evolutionary searches over large spaces, but it doesn’t make me much more confident that all good AI systems must be goal-directed.
Interpretability
Another argument for building a goal-directed agent is that it allows us to predict what it’s going to do in novel circumstances. While you may not be able to predict the specific actions it will take, you can predict some features of the final world state, in the same way that if I were to play Magnus Carlsen at chess, I can’t predict how he will play, but I can predict that he will win.
I do not understand the intent behind this argument. It seems as though faced with the negative results that suggest that goal-directed behavior tends to cause catastrophic outcomes, we’re arguing that it’s a good idea to build a goal-directed agent so that we can more easily predict that it’s going to cause catastrophe.
I also think that we would typically be able to predict significantly more about what any AI system we actually build will do (than if we modeled it as trying to achieve some goal). This is because “agent seeking a particular goal” is one of the simplest models we can build, and with any system we have more information on, we start refining the model to make it better.
Summary
Overall, I think there are good reasons to think that “by default” we would develop goal-directed AI systems, because the things we want AIs to do can be easily phrased as goals, and because the stated goal of reinforcement learning is to build goal-directed agents (although they do not look like goal-directed agents today). As a result, it seems important to figure out ways to get the powerful capabilities of goal-directed agents through agents that are not themselves goal-directed. In particular, this suggests that we will need to figure out ways to build AI systems that do not involve specifying a utility function that the AI should optimize, or even learning a utility function that the AI then optimizes.
[1] Technically, actions are chosen according to the Q function, but the distinction isn’t important here.
[2] Discounting does cause us to prioritize short-term rewards over long-term ones. On the other hand, discounting seems mostly like a hack to make the math not spit out infinities, and so that learning is more stable. On the third hand, infinite horizon MDPs with undiscounted reward aren’t solvable unless you almost surely enter an absorbing state. So discounting complicates the picture, but not in a particularly interesting way, and I don’t want to rest an argument against long-term goal-directed behavior on the presence of discounting.
- Utility ≠ Reward by 5 Sep 2019 17:28 UTC; 130 points) (
- AI Alignment 2018-19 Review by 28 Jan 2020 2:19 UTC; 126 points) (
- Literature Review on Goal-Directedness by 18 Jan 2021 11:15 UTC; 80 points) (
- Clarifying some key hypotheses in AI alignment by 15 Aug 2019 21:29 UTC; 79 points) (
- Conclusion to the sequence on value learning by 3 Feb 2019 21:05 UTC; 51 points) (
- Human-AI Interaction by 15 Jan 2019 1:57 UTC; 34 points) (
- 13 Dec 2019 0:51 UTC; 33 points) 's comment on Coherence arguments do not entail goal-directed behavior by (
- 2 Oct 2019 18:09 UTC; 29 points) 's comment on What are we assuming about utility functions? by (
- Alignment Newsletter #40 by 8 Jan 2019 20:10 UTC; 21 points) (
- Against the Backward Approach to Goal-Directedness by 19 Jan 2021 18:46 UTC; 19 points) (
- What are we assuming about utility functions? by 2 Oct 2019 15:11 UTC; 17 points) (
- 21 Oct 2021 2:05 UTC; 14 points) 's comment on AGI Safety Fundamentals curriculum and application by (EA Forum;
- Goal-Directedness: What Success Looks Like by 16 Aug 2020 18:33 UTC; 9 points) (
- 13 Jun 2019 1:08 UTC; 5 points) 's comment on Let’s talk about “Convergent Rationality” by (
- 14 Jun 2019 15:38 UTC; 3 points) 's comment on Let’s talk about “Convergent Rationality” by (
- 17 Aug 2019 23:28 UTC; 2 points) 's comment on Coherence arguments do not entail goal-directed behavior by (
I’m not very convinced by this example, or alternatively I’m not getting the distinction you’re drawing between “agent” and “goal-directed”. Suppose the agent you’re trying to imitate is itself goal-directed. In order for the imitator to generalize beyond its training distribution, it seemingly has to learn to become goal-directed (i.e., perform the same sort of computations that a goal-directed agent would). I don’t see how else it can predict what the goal-directed agent would do in a novel situation. If the imitator is not able to generalize, then it seems more tool-like than agent-like. On the other hand, if the imitatee is not goal-directed… I guess the agent could imitate humans and be not entirely goal-directed to the extent that humans are not entirely goal-directed. (Is this the point you’re trying to make, or are you saying that an imitation of a goal-directed agent would constitute a non-goal-directed agent?)
Your post reminded me of Paul Christiano’s approval-directed agents which was also about trying to find an alternative to goal-directed agents. Looking at it again, it actually sounds a lot like applying imitation learning to humans (except imitating a speeded-up human):
Can approval-directed agents be considered a form of imitation learning, and if not, are there any safety-relevant differences between imitation learning of (speeded-up) humans, and approval-directed agents?
I definitely endorse this point, think that it’s an important aspect, and that it alone justifies the claim that I was making about non-goal-directed Agent AI being possible.
That said, I do have an intuition that agents whose goal-directedness comes from other agents shouldn’t be considered goal-directed, at least if it happens in a particular way. Let’s say that I’m pursuing goal X, and my assistant AI agent is also pursuing goal X as a result. If I then start to pursue goal Y, and my AI agent also starts pursuing Y because it is aligned with me, then it feels like the AI was not really directed at goal X, but more directed at “whatever goal Rohin has”, and this feels distinctly less goal-directed to me. (In particular, my AI agent would not have all of the convergent instrumental subgoals in this setting, so it is really different in kind from an AI agent that was simply pursuing X to the best of its ability.)
“Goal-directed” may not be the right word to capture the property I’m thinking about. It might be something like “thing that pursues the standard convergent instrumental subgoals”, or “thing that pursues a goal that is not defined in terms of someone else’s goal”.
Yeah, that idea was a big influence on the views that caused me to write this post.
It’s not exactly the same, but it is very similar. You could think of approval-direction as imitation of a particular weird kind of human, who deliberates for a while before choosing any action.
They feel different enough to me that there probably are safety-relevant differences, but I don’t know of any off the top of my head. Initially I was going to say that myopia was a safety-relevant difference, but thinking about it more I don’t think that’s an actual difference. Approval-directed agents are more explicitly myopic, but I think imitation learning could be myopic in the same way.
Btw, this post also views Paul’s agenda through the lens of constructing imitations of humans.
What causes the agent to switch from X to Y?
Are you thinking of the “agent” as A) the product of the demonstrations and training (e.g. the resulting neural network), or as B) a system that includes both the trained agent and also the training process itself (and facilities for continual online learning)?
I would assume A by default, but then I would expect that if you trained such an agent with imitation learning while pursuing goal X, you’d likely get an agent that continues to pursue goal X even after you’ve switched to pursuing goal Y. (Unless the agent also learned to imitate whatever the decision-making process was that led you to switch from X to Y, in which case the agent seems non-goal-directed only insofar as you decided to switch from X to Y for non-goal-related reasons rather than in service of some higher level goal Ω. Is that what you want?)
I was imagining something more like B for the imitation learning case.
That analysis seems right to me.
With respect to whether it is what I want, I wouldn’t say that I want any of these things in particular, I’m more pointing at the existence of systems that aren’t goal-directed, yet behave like an agent.
Would you agree that a B-type agent would be basically as goal-directed as a human (because it exhibits goal-directed behavior when the human does, and doesn’t when the human doesn’t)?
In which case, would it be fair to summarize (part of) your argument as:
1) Many of the potential problems with building safe superintelligent systems comes from them being too goal-directed.
2) An agent that is only as goal-directed as a human is much less susceptible to many of these failure modes.
3) It is likely possible to build superintelligent systems that are only as goal-directed as humans.
?
I don’t think so. Maybe this would be true if you had a perfect imitation of a human, but in practice you’ll be uncertain about what the human is going to do. If you’re uncertain in this way, and you are getting your goals from a human, then you don’t do all of the instrumental subgoals. (See The Off-Switch Game for a simple analysis showing that you can avoid the survival incentive.)
It may be that “goal-directed” is the wrong word for the property I’m talking about, but I’m predicting that agents of this form are less susceptible to convergent instrumental subgoals than humans are.
To clarify, you do do the human’s instrumental sub-goals though, just not extra ones for yourself, right?
If you’ve seen the human acquire resources, then you’ll acquire resources in the same way.
If there’s now some new resource that you’ve never seen before, you may acquire it if you’re sufficiently confident that the human would, but otherwise you might try to gather more evidence to see what the human would do. This is assuming that we have some way of doing imitation learning that allows the resulting system to have uncertainty that it can resolve by watching the human, or asking the human. If you imagine the exact way that we do imitation learning today, it would extrapolate somehow in a way that isn’t actually what the human would do. Maybe it acquires the new resource, maybe it leaves it alone, maybe it burns it to prevent anyone from having it, who knows.
Right, so I think I wasn’t really making a new observation, but just clearing up a confusion on my own part, where for a long time I didn’t understand how the idea of approval-directed agency fits into IDA because people switched from talking about approval-directed agency to imitation learning (or were talking about them interchangeably) and I didn’t catch the connection. So at this point I understand Paul’s trajectory of views as follows:
goal-directed agent ⇒ approval-directed agent ⇒ use IDA to scale up approval-direct agent ⇒ approval-directed agency as a form of imitation learning / generalize to other forms of imitation learning ⇒ generalize IDA to safely scale up other (including more goal-directed / consequentialist) forms of ML (see An Unaligned Benchmark which I think represents his current views)
(Someone please chime in if this still seems wrong or confused.)
It looks like imitation learning isn’t one thing but a fairly broad category in ML which even includes IRL. But if we compare approval direction to the narrower kinds of imitation learning, approval direction seems a lot riskier because you’re optimizing over an estimation of human approval, which seems to be an adversarial process that could easily trigger safety problems in both the ground-truth human approval as well as in the estimation process. I wonder when you wrote the OP, which form of imitation learning did you have in mind?
ETA: From this comment it looks like you were thinking of an online version of narrow imitation learning. Might be good to clarify that in the post?
But if there are safety problems in approval, wouldn’t there also be safety problems in the human’s behavior, which imitation learning would copy?
Similarly, if there are safety problems in the estimation process, wouldn’t there also be safety problems in the prediction of what action a human would take?
I somewhat think that it applies to most imitation learning, not just the online variant of narrow imitation learning, but I am pretty confused/unsure. I’ll add a pointer to this discussion to the post.
The human’s behavior could be safer because a human mind doesn’t optimize so much as to move outside of the range of inputs where approval is safe, or it has a “proposal generator” that only generates possible actions that with high probability stay within that range.
Same here, if you just predict what action a human would take, you’re less likely to optimize so much that you likely end up outside of where the estimation process is safe.
Ok, I’d be interested to hear more if you clarify your thoughts.
I found an old comment from Paul that answers this:
It seems like approval direction allows for creative actions that the human operator approves of but would not have thought of doing themselves. Not sure if imitation learning does this.
That’s a good question. It looks like imitation learning actually covers a number of ML techniques (see this) none of which exactly matches approval-directed agents. But the category seems broad enough that I think approval-directed agents can be considered to be a form of imitation learning. In particular, IRL is considered a form of imitation learning and IRL would also be able to perform actions that the human would not have thought of doing themselves.
^ Yes to all of this.
A little bit of nuance: IRL is considered to be a form of imitation learning because in many cases the inferred reward in IRL is only meant to reproduce the human’s performance and isn’t expected to generalize outside of the training distribution.
There are versions of IRL which are meant to go beyond imitation. For example, adversarial IRL was trying to infer a reward that would generalize to new environments, in which case it would be doing something more than imitation.
I’m not sure these are the points Rohin was trying to make, but there seem to be at least two important points here:
Imitation learning applied to humans produces goal-directed behavior only insofar humans are goal-directed
Imitation learning applied to humans produces agents no more capable than humans. (I think IDA goes beyond this by adding amplification steps, which are separate. And IRL goes beyond this by trying to correct “errors” that the humans make.)
Regarding the second point, there’s a safety-relevant sense in which a human-imitating agent is less goal-directed than the human. Because if you scale the human’s capabilities, the human will become better at achieving its personal objectives. By contrast, if you scale the imitator’s capabilities, it’s only supposed to become even better at imitating the unscaled human.
I will list—just for my own understanding—the no-goal-oriented types of agents.
1. Universal library. This is an agent which create all significant solutions to all possible significant problems and then stops. An example of it is the past biological evolution which invented enormous amount of adaptations: flying solutions, proteins etc, - and could be used for inspiration for the technological progress. Past human history or some unconscious processes in the brain, like dreaming, may be another possible examples.
2. Human-mimicking neural net—this is an example of an agent which is mimicking another agent.
3. Obviously, AI Oracles and AI Tools.
4. “Homeostatic” superintelligence. An example of such system is OS like Windows, which doesn’t do anything in a goal-directed sense, but just supports processes. Most national states also work in this way (except ideologically driven like USSR or Iran).
5. Drexeler’s superintelligence as a sum of narrow services, e.g. Google’s web services.
6. Swarm intelligences which compete to solve a task. If one create a prize for X, many people will compete to get it. The whole swarm is not a goal oriented agent, while its elements are such agents. Scott’s Moloh is a bad example of such swarm behaviour.
Thanks for doing this—it’s helpful for me as well. I have some questions/quibbles:
Isn’t #2 as goal-directed as the human it mimics, in all the relevant ways? If I learn that a certain machine runs a neural net that mimics Hitler, shouldn’t I worry that it will try to take over the world? Maybe I don’t get what you mean by “mimics.”
What exactly is the difference between an Oracle and a Tool? I thought an Oracle was a kind of Tool; I thought Tool was a catch-all category for everything that’s not a Sovereign or a Genie.
I’m skeptical of this notion of “homeostatic” superintelligence. It seems to me that nations like the USA are fully goal-directed in the relevant senses; they exhibit the basic AI drives, they are capable of things like the treacherous turn, etc. As for Windows, how is it an agent at all? What does it do? Allocate memory resources across currently-being-run programs? How does it do that—is there an explicit function that it follows to do the allocation (e.g. give all programs equal resources), or does it do something like consequentialist reasoning?
On #6, it seems to me that it might actually be correct to say that the swarm is an agent—it’s just that the swarm has different goals than each of its individual members. Maybe Moloch is an agent after all! On the other hand, something seems not quite right about this—what is Moloch’s utility function? Whatever it is, Moloch seems particularly uninterested in self-preservation, which makes it hard to think of it as an agent with normal-ish goals. (Argument: Suppose someone were to initiate a project that would, with high probability, kill Moloch forever in 100 years time. Suppose the project has no other effects, such that almost all humans think it’s a good idea. And everyone knows about it. All it would take to stop the project is a million people voting against it. Now, is there a sense in which Moloch would resist it or seek to undermine the project? It would maaaybe incentivize most people not to contribute to the project (tragedy of the commons!) but that’s it. So either Moloch isn’t an agent, or it’s an agent that doesn’t care about dying, or it’s an agent that doesn’t know it’s going to die, or it’s a very weak agent—can’t even stop one project!)
Something could exhibit goal-like behaviour for the outside viewers without having internal structure of an agent. For example, a brick is falling to the ground—we could say that it is aimed on the specific point on the ground, but it is not an agent. The same way an infectious disease can take over the world without being an agent. Moreover, even some humans sometimes are not agent.
In my opinion, Oracle AI output only answers to questions, and Tool AI can do some other staff, like continuous data stream transformation or controlling mechanisms.
National states, human body and OSs—all of them are good and even clever in preserving homeostatic state (except the time of government shutdown) - but they typically achieve it not via high level agential reasoning.
Swarm of agents could exhibit behaviour different from the behaviour or goals of any separate agent.
To clarify the definition of “goal-directed” used here: is AlphaGo (Zero) goal-directed?
Yes, as long as you keep doing the MCTS + training. The value/policy networks by themselves are not goal-directed.
I get why the MCTS is important, but what about the training? It seems to me that if we stop training AlphaGo (Zero) and I play a game against it, it’s goal-directed even though we have stopped training it.
Yeah, I agree that even without the training it would be goal-directed, that comes from the MCTS.
Note though that if we stop training and also stop using MCTS and you play a game against it, it will beat you and yet I would say that it is not goal-directed.
An additional issue is that if you have a competitive situation, there may be an incentive to minimize the amount of human involvement in the system, in order to speed up response time and avoid losing ground to competitors. I discussed this a bit in Disjunctive Scenarios of Catastrophic AI Risk:
Here are a few more reasons for humans to build goal-directed agents:
Goal directed AI is a way to defend against value drift/corruption/manipulation. People might be forced to build goal directed agents if they can’t figure out another way to do that.
Goal directed AI is a way to cooperate and thereby increase economic efficiency and/or military competitiveness. (A group of people can build a goal directed agent that they can verify represents an aggregation of their values.) People might be forced to build or transfer control to goal directed agents in order to participate in such cooperation to remain competitive, unless they can figure out another way to cooperate that is as efficient as this.
Goal directed AI is a way to address other human safety problems. People might trust an AI with explicit and verifiable values more than an AI that is controlled by a distant stranger.
As I understand it, the first one is an argument for value lock in, and the third one is an argument for interpretability, does that seem right to you?
For the first one, I guess I would use “argument for defense against value drift” instead since you could conceivably use a goal-directed AI to defend against value drift without lock in, e.g., by doing something like Paul Christiano’s 2012 version of indirect normativity (which I don’t think is feasible but maybe there’s something like it that is, like my hybrid approach, if you consider that goal-directed).
For the third one, I guess interpretability is part of it, but a bigger problem is that it seems hard to make a sufficiently trustworthy human overseer even if we could “interpret” them. In other words, interpretability for a human might just let us see exactly why we shouldn’t trust them.
What is stopping AI researchers from using RL to (end-to-end) train agents that do search over actions to find ones that are the best? It seems like an obvious next step to take in order to build agents that generalize better than current RL agents, doesn’t it? Is it just that the challenges they’ve attempted so far haven’t required going beyond building agents that are essentially just lossy compressions of behaviors that work well on the training distribution, or is there a fundamental reason why using RL to train goal-directed agents would be hard?
That technique is called model-based RL, and in practice, given sufficient data and compute, it ends up performing worse than model-free RL. (It does perform better in low-data regimes, and my guess is that it will also generalize slightly better but not much.) In model-based RL, you learn a model of the world, and then search over sequences of actions and take the one that seems best.
Speculation on why it doesn’t work: In practice, your model of the world only makes good predictions for states and actions that you have already experienced. So searching over actions for the best one either gives you something you have already experienced, or some nonsense action (sort of like an adversarial example for the world model).
It is worth noting that this isn’t end-to-end: the model is trained “end-to-end”, but the action selection is typically some hardcoded function like “sample 1000 trajectories from the model, choose the trajectory that gives the best reward, and take the first action of that trajectory”. I don’t know how you would train an agent end-to-end such that it explicitly learns to search over actions (as opposed to an implicit search that model-free RL algorithms might already be doing).
When you are given an accurate model of the world, then you can in fact search over actions and do much better, see for example value iteration or policy iteration. (Those are for very small environments, but you could create approximate versions for more complex environments.)
Interesting, I wonder how humans avoid generating nonsense actions like this.
I was thinking you could train the world model separately at first, manually implement an initial action selection method as a neural network or some other kind of differentiable program, and then let RL act on the agent to optimize it as a whole.
What kind of implicit search are model-free RL algorithms already doing? If we just keep scaling up model-free RL, can they eventually become goal-directed agents through this kind of implicit search?
Some hypotheses that are very speculative:
Something something explicit reasoning?
Our environment is sufficiently harsh and complex that everything is in-distribution
Our brains are so small and our environment is so harsh and complex that the only way that they can get good performance is to have structured, modular representations, which lead to worse performance in distribution but better generalization
Some system that lets us know what we know, and only generates actions for consideration where we know what the consequences will be
I don’t know. This is mostly an expression of uncertainty about what model-free RL agents are doing. Maybe some of the multiplications and additions going on in there turn out to be equivalent to a search over actions. Maybe not.
My intuition says “nah, our current environments are all simple enough that you can solve them by using heuristics to compute actions, and the training process is going to distill those heuristics into the policy rather than turning the policy into a search algorithm”. But even if I trust that intuition, there is some level of environment complexity at which this would stop being true, and I don’t trust my intuition on what that level is.
Plausibly, but plausibly not. I have conflicting not-well-formed intuitions that pull in both directions.
I’d be interested to hear more about the problems with this, if anyone has a link to an overview or just knows of problems off the top of their head.
Is this true? Since ML generally doesn’t choose an algorithm directly but runs a search over a parameter space, it seems speculative to assume that the resulting model, if it is a mesa-optimizer and goal-directed, only cares about its episode. If it learned that optimizing for X is good for reward, it seems at least conceivable that it won’t understand that it shouldn’t care about instances of X that appear in future episodes.
A few points:
1. It’s not clear that the current deep RL paradigm would lead to a mesa optimizer. I agree it could happen, but I would like to see an argument as to why it is likely to happen. (I think there is probably a stronger case that any general intelligence we build will need to be a mesa optimizer and therefore goal-directed, and if so that argument should be added to this list.)
2. Even if we did get a mesa optimizer, the base optimizer (e.g. gradient descent) would plausibly select for mesa optimizers that care only up till the end of the episode. A mesa optimizer that wasn’t myopic in this way might spend the entire episode learning and making money that it can use in the future, and as a result get no training reward, and so would be selected against by the outer optimizer.
I’m not sure this strategy is net positive. If dangerous AI (dangerous at least as Slaughterbots) is developed before alignment is solved, the world is probably better off if the first visibly-dangerous-AI is goal-directed rather than, say, an Oracle. The former would probably be a much weaker optimization process and probably won’t result in an existential catastrophe; and perhaps will make some governance solutions more feasible.
Can you clarify the argument? Are you optimizing for an obvious AI disaster to happen as soon as possible so people take the issue more seriously?
I’m not optimizing for raising awareness via an “obvious AI disaster” due to multiple reasons, including the huge risk to the reputation of the AI safety community and the unilateralist’s curse.
I do think that when considering whether to invest in an effort which might prevent recoverable near-term AI accidents, one should consider the possibility that the effort would prevent pivotal events (e.g. one that would have enabled useful governance solutions resulting in more time for alignment research).
Efforts that prevent recoverable near-term AI accidents might be astronomically net-positive if they help make AI alignment more mainstream in the general ML community.
(anyone who thinks I shouldn’t discuss this publicly is welcome to let me know via a PM or anonymously here)
In this scenario, wouldn’t you eventually build a sufficiently powerful goal-directed AI that leads to an existential catastrophe?
Perhaps the hope is that when everyone sees that the first goal-directed AI is visibly dangerous then they actually believe that goal-directed AI is dangerous. But in the scenario where we are building alternatives to goal-directed AI and they are actually getting used, I would predict that we have convinced most AI researchers that goal-directed AI is dangerous.
(Also, I think you can level this argument at nearly all AI safety research agendas, with possibly the exception of Agent Foundations.)
I think I didn’t articulate my argument clearly, I tried to clarify it in my reply to Jessica.
I think my argument might be especially relevant to the effort of persuading AI researchers not to build goal-directed systems.
If a result of this effort is convincing more AI researchers in the general premise that x-risk from AI is something worth worrying about, then that’s a very strong argument in favor of carrying out the effort (and I agree this result should correlate with convincing AI researchers not to build goal-directed systems—if that’s what you argued in your comment).
Yeah, I was imagining that we would convince AI researchers that goal-directed systems are dangerous, and that we should build the non-goal-directed versions instead.
Building a non goal directed agent is like building a cart out of non-wood materials. Goal directed behavior is relatively well understood. We know that most goal directed designs don’t do what we want. Most arrangements of wood do not form a functioning cart.
I suspect that a randomly selected agent from the space of all non goal directed agents is also useless or dangerous, in much the same way that a random arrangement of non wood materials is.
Now there are a couple of regions of design space that are not goal directed and look like they contain useful AI’s. We might be better off making our cart from Iron, but Iron has its own problems.
Sure. We aren’t going to choose an agent randomly.
Agreed, but maybe those problems are easier to solve.