I see people discussing how far we can go with LLM or other simulator/predictor systems. I particularly like porby’s takes on this. I am excited for that direction of research, but I great it misses an important piece.
The missing piece is this:
There will consistently be a set of tasks that, with any given predictor skill level, are easier to achieve with that predictor wrapped in an agent-layer. AutoGPT is tempting for a real reason. There is significant reward available to those who successfully integrate the goal-less predictor into a goal-pursuing agent program.
To avoid this, you must convince everyone who could do this not to do this. This could be by convincing them it wouldn’t be profitable after all, or would be too dangerous, or that enforcement mechanisms will stop them. Unless you manage to do this convincing for all possible people in a position to do this, then someone does it. And then you have to deal with the agent-thing.
What I’m saying is that you can’t count on there never being the agent version. You have to assume that someone will try it. So the argument, “we can get lots of utility much more safely from goal-less predictors” can be true and yet we will still need a plan for handling the agentive systems.
If your argument is that we can use goal-less predictors and narrow tool AI to shepard us through the dangerous period of wide availability of AI systems that can easily be turned into self-improving goal-pursuing resource-accumulating agents… Great. Then discuss how to use narrow AI as a mallet to play Whack-a-Mole with the rogue agentic AIs we expect to be popping up everywhere. Don’t pretend like the playing field won’t have those rogue AIs at all because you’ve argued that they aren’t wise or necessary.
I don’t think the mere presence of agency means that all of the classical arguments automatically start to apply. For example, I’m not immediately seeing how Goodhart’s Law is a major concern with AutoGPT, even though AutoGPT is goal-directed.
AutoGPT seems like a good architecture for something like “retarget the search”, since the goal-directed aspect is already factored out nicely. A well-designed AutoGPT could leverage interpretability tools and interactive querying to load your values in a robust way, with minimal worry that the system is trying to manipulate you to achieve some goal-driven objective during the loading process.
Thinking about it, I actually see a good case for alignment people getting jobs at AutoGPT. I suspect a bit of security mindset could go a long way in its architecture. It could also be valuable as differential technological development, to ward off scenarios where people are motivated to create dangerous new core dynamics in order to subvert current LLM limitations.
Current AutoGPT is simply too incompetent to effectively pursue a goal. Other similar systems are more competent (the two Minecraft LLM agent systems are the most impressive), but nobody has let them run ad infinitum to test their Goodharting. I’d assume they’d show it. Goodhart will apply increasingly as those systems actually pursue goals.
AutoGPT isn’t a company, it’s a little open-source project. Any companies working on agents aren’t publicizing their work so far.
I do suspect that actively improving things like AutoGPT is a good route to addressing x-risk because of their advantages for alignment. But I’m not sure enough to start advocating it.
Fair point, valley9. I don’t think a little bit of agency throws you into an entirely different regime. It’s more that I think that the more powerful an agent you build, the more it is able to autonomously change the world to work with goals, the more you move into dangerous territory. But also, it’s going to tempt people. Somebody out there is going to be tempted to say, “go make me money, just don’t get caught doing anything illegal in a way that gets traced back to me.”
That command given to a sufficiently powerful AI system could have a lot of dangerous results.
But also, it’s going to tempt people. Somebody out there is going to be tempted to say, “go make me money, just don’t get caught doing anything illegal in a way that gets traced back to me.” That command given to a sufficiently powerful AI system could have a lot of dangerous results.
Indeed. This seems like more of a social problem than an alignment problem though: ensure that powerful AIs tend to be corporate AIs with corporate liability rather than open-source AIs, and get the AIs to law enforcement (or even law enforcement “red teams”—should we make that a thing?) before they get to criminals. I don’t think improving aimability helps guard against misuse.
I don’t think improving aimability helps guard against misuse.
I think needs to be stated more clearly: Alignment and Misuse are very different things, so much so that what policies and research work for one problem will often not work on another problem, and the worlds of misuse and misalignment are quite different.
Though note that the solutions for misuse focused worlds and structural risk focused worlds can work against each other.
Also, this is validating JDP’s prediction that people will focus less on alignment and more on misuse in their threat models of AI risk.
For example, I’m not immediately seeing how Goodhart’s Law is a major concern with AutoGPT
If the goals are loaded into it via natural-language descriptions, then the way the LLM interprets the words might differ from the way the human who put them in intended them to be read, and the AutoGPT would then go off and do what it thought the user said, not what the user meant. It’s happening all the time with humans, after all.
From the Goodharting perspective, it would optimize for the measure (natural-language description) rather than the intended target. And since tails come apart, inasmuch as AutoGPT optimizes strongly, it would end up implement something that looks precisely like what it understood the user to mean, but which would look like a weird unintended extreme from the user’s point of view.
You’d mentioned leveraging interpretability tools. Indeed: the particularly strong ones, that offer high-fidelity insight into how the LLM interprets stuff, would address that problem. But on my model, we’re not on-track to get them. Again: we have tons of insights in other humans, and this sort of miscommunication happens constantly anyway. It’s a hard problem.
[Disclaimer: I haven’t tried AutoGPT myself, mostly reasoning from first principles here. Thanks in advance if anyone has corrections on what follows.]
If the goals are loaded into it via natural-language descriptions, then the way the LLM interprets the words might differ from the way the human who put them in intended them to be read, and the AutoGPT would then go off and do what it thought the user said, not what the user meant. It’s happening all the time with humans, after all.
Yes, this is a possibility, which is why I suggested that alignment people work for AutoGPT to try and prevent it from happening. AutoGPT also has a commercial incentive to prevent it from happening, to make their tool work. They’re going to work to prevent it somehow. The question in my mind is whether they prevent it from happening in a way that’s patchy and unreliable, or in a way that’s robust.
From the Goodharting perspective, it would optimize for the measure (natural-language description) rather than the intended target. And since tails come apart, inasmuch as AutoGPT optimizes strongly, it would end up implement something that looks precisely like what it understood the user to mean, but which would look like a weird unintended extreme from the user’s point of view.
Natural language can be a medium for goal planning, but it can also be a medium for goal clarification. The challenge here is for AutoGPT to be well-calibrated for its uncertainty about the user’s preferences. If it encounters an uncertain situation, do goal clarification with the user until it has justifiable certainty about the user’s preferences. AutoGPT could be superhuman at these calibration and clarification tasks, if the company collects a huge dataset of user interactions along with user complaints due to miscommunication. [Subtle miscommunications that go unreported are a potential problem—could be addressed with an internal tool that mines interaction logs to try and surface them for human labeling. If customer privacy is an issue, offer customers a discount if they’re willing to share their logs, have humans label a random subset of logs based on whether they feel there was insufficient/excessive clarification, and use that as training data.]
Can we taboo “optimize”? What specifically does “optimize strongly” mean in an AutoGPT context? For example, if we run AutoGPT on a faster processor, does that mean it is “optimizing more strongly”? It will act on the world faster, so in that sense it could be considered a “more powerful optimizer”. But if it’s just performing the same operations faster, I don’t see how Goodhart issues get worse.
Goodhart is a problem if you have an imperfect metric that can be gamed. If we design AutoGPT so there’s no metric and it’s also not trying to game anything, I’m not seeing an issue. Presumably there is or will be some sort of outer loop which fine-tunes AutoGPT interaction logs against a measure of overall quality, and that’s worth thinking about, but it’s also similar to how ChatGPT is trained, no? So I don’t know how much risk we’re adding there.
I get the sense that you’re a person with a hammer and everything looks like a nail. You’ve got some pre-existing models of how AI is supposed to fail, and you’re trying to apply them in every situation even if they don’t necessarily fit. [Note, this isn’t really a criticism of you in particular, I see it a lot in Lesswrong AI discourse.] From my perspective, the important thing is to have some people with security mindset working at AutoGPT, getting their hands dirty, thinking creatively about how stuff could go wrong, and trying to identify what the actual biggest risks are given the system’s architecture + how best to address them. I worry that person-with-a-hammer syndrome is going to create blind spots for the actual biggest risks, whatever those may be.
Again: we have tons of insights in other humans, and this sort of miscommunication happens constantly anyway. It’s a hard problem.
Perhaps it’s worth comparing AutoGPT to a baseline of a human upload. In the past, I remember alignment researchers claiming that a high-fidelity upload would be preferable to de novo AI, because with the upload, you don’t need to solve the alignment problem. But as you say, miscommunication could easily happen with a high-fidelity upload.
If we’ve reduced the level of danger to the level of danger we experience with ordinary human miscommunication, that seems like an important milestone. There’s a trollish argument to be made here, that if human miscommunication is the primary danger, we shouldn’t be engaged in e.g. genetic engineering for intelligence enhancement either, because it could produce superhumanly intelligent agents that we’ll have miscommunications with :-)
In fact, the biggest problem we have with other humans is that they straight up have different values than us. Compared to that problem, miscommunication is small. How many wars have been fought over miscommunication vs value differences? Perhaps you can find a few wars that were fought primarily due to miscommunication, but that’s remarkable because it’s rare.
An AutoGPT that’s more aligned with me than I’m aligned with my fellow humans looks pretty feasible.
[Again, I appreciate corrections from anyone who’s experienced with AutoGPT! Please reply and correct me!]
Natural language can be a medium for goal planning, but it can also be a medium for goal clarification. The challenge here is for AutoGPT to be well-calibrated for its uncertainty about the user’s preferences
Yes, but that would require it to be robustly aimed at the goal of faithfully eliciting the user’s preferences and following them. And if it’s not precisely robustly aimed at it, if we’ve miscommunicated what “faithfulness” means, then it’ll pursue its misaligned understanding of faithfulness, which would lead to it pursuing a non-intended interpretation of the users’ requests.
Like, this just pushes the same problem back one step.
And I agree that it’s a solvable problem, and that it’s something worthwhile to work on. It’s basically just corrigibility, really. But it doesn’t simplify the initial issue.
Can we taboo “optimize”? What specifically does “optimize strongly” mean in an AutoGPT context?
Ability to achieve real-world outcomes. For example, an AutoGPT instance that can overthrow a government is a more strong optimizer than an AutoGPT instance that can at best make you $100 in a week. Here’s a pretty excellent post on the matter of not-exactingly-aimed strong optimization predictably resulting in bad outcomes.
Goodhart is a problem if you have an imperfect metric that can be gamed. If we design AutoGPT so there’s no metric and it’s also not trying to game anything
I mean, it’s trying to achieve some goal out in the world. The goal’s specification is the “metric”, and while it’s not trying to maliciously “game” it, it is trying to achieve it. The goal’s specification as it understands it, that is, not the goal as it’s intended. Which would be isomorphic to it Goodharting on the metric, if the two diverge.
I get the sense that you’re a person with a hammer and everything looks like a nail
I get the sense that people are sometimes too quick to assume that something which looks like a hammer from one angle is a hammer.
As above, by “Goodharting” there (which wasn’t even the term I introduced into the discussion) I didn’t mean the literal same setup as in e. g. economics, where there’s a bunch of schemers that deliberately maliciously manipulate stuff in order to decouple the metric from the variable it’s meant to measure. I meant the general dynamic where we have some goal, we designate some formal specification for it, then point an optimization process at the specification, and inasmuch as the intended-goal diverges from the formal-goal, we get unintended results.
That’s basically the system “Goodharting” on the “metric”. Same concept, could be viewed through the same lens.
This sort of miscommunication is also prevalent in e. g. talk about agents having utility functions, or engaging in search. When I talk about this, I’m not imagining a literalwrapper-mind setup, that is literally simulating every possible way things could go and plugging that in its compactly-specified utility function – as if it’s an unbounded AIXI or something. Obviously that’s not realistically implementable. But there could be practical mind designs that are approximately isomorphic to this sort of setup in the limit, and they could have properties that are approximately the same as those of a wrapper-mind.
(I know you weren’t making this specific point; just broadly gesturing at the idea.)
If we’ve reduced the level of danger to the level of danger we experience with ordinary human miscommunication, that seems like an important milestone
Yes, but that would require it to be robustly aimed at the goal of faithfully eliciting the user’s preferences and following them. And if it’s not precisely robustly aimed at it, if we’ve miscommunicated what “faithfulness” means, then it’ll pursue its misaligned understanding of faithfulness, which would lead to it pursuing a non-intended interpretation of the users’ requests.
I think this argument only makes sense if it makes sense to think of the “AutoGPT clarification module” as trying to pursue this goal at all costs. If it’s just a while loop that asks clarification questions until the goal is “sufficiently clarified”, then this seems like a bad model. Maybe a while loop design like this would have other problems, but I don’t think this is one of them.
Ability to achieve real-world outcomes. For example, an AutoGPT instance that can overthrow a government is a more strong optimizer than an AutoGPT instance that can at best make you $100 in a week.
OK, so by this definition, using a more powerful processor with AutoGPT (so it just does the exact same operations faster) makes it a more “powerful optimizer”, even though it’s working exactly the same way and has exactly the same degree of issues with Goodharting etc. (just faster). Do I understand you correctly?
I mean, it’s trying to achieve some goal out in the world. The goal’s specification is the “metric”, and while it’s not trying to maliciously “game” it, it is trying to achieve it. The goal’s specification as it understands it, that is, not the goal as it’s intended. Which would be isomorphic to it Goodharting on the metric, if the two diverge.
This seems potentially false depending on the training method, e.g. if it’s being trained to imitate experts. If it’s e.g. being trained to imitate experts, I expect the key question is the degree to which there are examples in the dataset of experts following the sort of procedure that would be vulnerable to Goodharting (step 1: identify goal specification. step 2: try to achieve it as you understand it, not worrying about possible divergence from user intent.)
I meant the general dynamic where we have some goal, we designate some formal specification for it, then point an optimization process at the specification, and inasmuch as the intended-goal diverges from the formal-goal, we get unintended results.
Yeah, I just don’t think this is the only way that a system like AutoGPT could be implemented. Maybe it is how current AutoGPT is implemented, but then I encourage alignment researchers to join the organization and change that.
But there could be practical mind designs that are approximately isomorphic to this sort of setup in the limit, and they could have properties that are approximately the same as those of a wrapper-mind.
They could, but people seem to assume they will, with poor justification. I agree it’s a reasonable heuristic for identifying potential problems, but it shouldn’t be the only heuristic.
asking clarification questions until the goal is “sufficiently clarified”
… How do you define “sufficiently clarified”, and why is that step not subject to miscommunication / the-problem-that-is-isomorphic-to-Goodharting?
I’d tried to reason about similar setups before, and my conclusion was that it has to bottom out in robust alignment somewhere.
I’d be happy to be proven wrong on that, thought. Wow, wouldn’t that make matters easier...
OK, so by this definition, using a more powerful processor with AutoGPT (so it just does the exact same operations faster) makes it a more “powerful optimizer”, even though it’s working exactly the same way and has exactly the same degree of issues with Goodharting etc. (just faster). Do I understand you correctly?
Sure? I mean, presumably it doesn’t do the exact same operations. Surely it’s exploiting its ability to think faster in order to more closely micromanage its tasks, or something. If not, if it’s just ignoring its greater capabilities, then no, it’s not a stronger optimizer.
This seems potentially false depending on the training method, e.g. if it’s being trained to imitate experts
I don’t think that gets you to dangerous capabilities. I think you need the system to have a consequentialist component somewhere, which is actually focused on pursuing the goal.
… How do you define “sufficiently clarified”, and why is that step not subject to miscommunication / the-problem-that-is-isomorphic-to-Goodharting?
Here’s what I wrote previously:
...AutoGPT could be superhuman at these calibration and clarification tasks, if the company collects a huge dataset of user interactions along with user complaints due to miscommunication. [Subtle miscommunications that go unreported are a potential problem—could be addressed with an internal tool that mines interaction logs to try and surface them for human labeling. If customer privacy is an issue, offer customers a discount if they’re willing to share their logs, have humans label a random subset of logs based on whether they feel there was insufficient/excessive clarification, and use that as training data.]
In more detail, the way I would do it would be: I give AutoGPT a task, and it says “OK, I think what you mean is: [much more detailed description of the task, clarifying points of uncertainty]. Is that right?” Then the user can effectively edit that detailed description until (a) the user is satisfied with it, and (b) a model trained on previous user interactions considers it sufficiently detailed. Once we have a detailed task description that’s mutually satisfactory, AutoGPT works from it. For simplicity, assume for now that nothing comes up during the task that would require further clarification (that scenario gets more complicated).
So to answer your specific questions:
The definition of “sufficiently clarified” is based on a model trained from examples of (a) a detailed task description and (b) whether that task description ended up being too ambiguous. Miscommunication shouldn’t be a huge issue because we’ve got a human labeling these examples, so the model has lots of concrete data about what is/is not a good task description.
If the learned model for “sufficiently clarified” is bad, then sometimes AutoGPT will consider a task “sufficiently clarified” when it really isn’t (isomorphic to Goodharting, also similar to the hallucinations that ChatGPT is susceptible to). In these cases, the user is likely to complain that AutoGPT didn’t do what they wanted, and it gets added as a new training example to the dataset for the “sufficiently clarified” model. So the learned model for “sufficiently clarified” gets better over time. This isn’t necessarily the ideal setup, but it’s also basically what the ChatGPT team does. So I don’t think there is significant added risk. If one accepts the thesis of your OP that ChatGPT is OK, this seems OK too. In both cases we’re looking at the equivalent of an occasional hallucination, which hurts reliability a little bit.
Sure? I mean, presumably it doesn’t do the exact same operations. Surely it’s exploiting its ability to think faster in order to more closely micromanage its tasks, or something. If not, if it’s just ignoring its greater capabilities, then no, it’s not a stronger optimizer.
Recall your original claim: “inasmuch as AutoGPT optimizes strongly, it would end up implement something that looks precisely like what it understood the user to mean, but which would look like a weird unintended extreme from the user’s point of view.”
The thought experiment here is that we take the exact same AutoGPT code and just run it on a faster processor. So no, it’s not “exploiting its ability to think faster in order to more closely micromanage its tasks”. But it does have “greater capabilities” in the sense of doing everything faster—due to a faster processor.
Once AutoGPT is running on a faster processor, I might choose to use AutoGPT more ambitiously. Perhaps I could get a week’s worth of work done in an hour, instead of a day’s worth of work. Or just get a week’s worth of work done in well under an hour. But since it’s the exact same code, your original “inasmuch as AutoGPT optimizes strongly” claim would not appear to apply.
I really dislike how people use the word “optimization” because it bundles concepts together in a way that’s confusing. In this specific case, your “inasmuch as AutoGPT optimizes strongly” claim is true, but only in a very specific sense. Specifically, if AutoGPT has some model of what the user means, and it tries to identify the very maximal state of the world that corresponds to that understanding—then subsequently works to bring about that state of the world. In the broad sense of an “optimizer”, there are ways to make AutoGPT a stronger “optimizer” that don’t exacerbate this problem, such as running it on a faster processor, or giving it access to new APIs, or even (I would argue) having it micromanage its tasks more closely, as long as that doesn’t affect it’s notion of “desired states of the world” (e.g. for simplicity, no added task micromanagement when reasoning about “desired states of the world”, but it’s OK in other circumstances). [Caveat: giving access to e.g. new APIs could make AutoGPT more effective at implementing its model of user prefs, so it’s therefore a bigger footgun if that model happens to be bad. But I don’t think new APIs will worsen the user pref model.]
I don’t think that gets you to dangerous capabilities. I think you need the system to have a consequentialist component somewhere, which is actually focused on pursuing the goal.
Cool, well maybe we should get alignment people to work at AutoGPT to influence the AutoGPT people to not develop dangerous capabilities then, by focusing on e.g. imitating experts :-) I’m not actually seeing a disagreement here.
This isn’t necessarily the ideal setup, but it’s also basically what the ChatGPT team does. So I don’t think there is significant added risk. If one accepts the thesis of your OP that ChatGPT is OK, this seems OK too
Oh, if we’re assuming this setup doesn’t have to be robust to AutoGPT being superintelligent and deciding to boil the oceans because of a misunderstood instruction, then yeah, that’s fine.
Once AutoGPT is running on a faster processor, I might choose to use AutoGPT more ambitiously
That’s the part that would exacerbate the issue where it sometimes misunderstands your instructions. If you’re using it for more ambitious tasks, or more often, then there are more frequent opportunities for misunderstanding, and their consequences are larger-scale. Which means that, to whichever extent it’s prone to misunderstanding you, that gets amplified, as does the damage the misunderstandings cause.
Cool, well maybe we should get alignment people to work at AutoGPT to influence the AutoGPT people to not develop dangerous capabilities then, by focusing on e.g. imitating experts :-)
Oh, sure, I’m not opposing that. It may not be the highest-value place for a given person to be, but it might be for some.
Is agency actually the issue by itself or just a necessary component?
Considering Robert miles stamp collecting robot:
“Order me some stamps in the next 32k tokens/60 seconds” is less scope than “guard my stamps today” than “ensure I always have enough stamps”. The last one triggers power seeking, the first 2 do not benefit from seeking power unless the payoff on the power seeking investment is within the time interval.
Note also that AutoGPT even if given a goal and allowed to run forever has immutable weights and a finite context window hobbling it.
So you need human level prediction + relevant modalities+ agency + long duration goal + memory at a bare minimum. Remove any element and the danger may be negligible.
I see people discussing how far we can go with LLM or other simulator/predictor systems. I particularly like porby’s takes on this. I am excited for that direction of research, but I great it misses an important piece. The missing piece is this: There will consistently be a set of tasks that, with any given predictor skill level, are easier to achieve with that predictor wrapped in an agent-layer. AutoGPT is tempting for a real reason. There is significant reward available to those who successfully integrate the goal-less predictor into a goal-pursuing agent program. To avoid this, you must convince everyone who could do this not to do this. This could be by convincing them it wouldn’t be profitable after all, or would be too dangerous, or that enforcement mechanisms will stop them. Unless you manage to do this convincing for all possible people in a position to do this, then someone does it. And then you have to deal with the agent-thing. What I’m saying is that you can’t count on there never being the agent version. You have to assume that someone will try it. So the argument, “we can get lots of utility much more safely from goal-less predictors” can be true and yet we will still need a plan for handling the agentive systems. If your argument is that we can use goal-less predictors and narrow tool AI to shepard us through the dangerous period of wide availability of AI systems that can easily be turned into self-improving goal-pursuing resource-accumulating agents… Great. Then discuss how to use narrow AI as a mallet to play Whack-a-Mole with the rogue agentic AIs we expect to be popping up everywhere. Don’t pretend like the playing field won’t have those rogue AIs at all because you’ve argued that they aren’t wise or necessary.
I don’t think the mere presence of agency means that all of the classical arguments automatically start to apply. For example, I’m not immediately seeing how Goodhart’s Law is a major concern with AutoGPT, even though AutoGPT is goal-directed.
AutoGPT seems like a good architecture for something like “retarget the search”, since the goal-directed aspect is already factored out nicely. A well-designed AutoGPT could leverage interpretability tools and interactive querying to load your values in a robust way, with minimal worry that the system is trying to manipulate you to achieve some goal-driven objective during the loading process.
Thinking about it, I actually see a good case for alignment people getting jobs at AutoGPT. I suspect a bit of security mindset could go a long way in its architecture. It could also be valuable as differential technological development, to ward off scenarios where people are motivated to create dangerous new core dynamics in order to subvert current LLM limitations.
I agree that things like AutoGPT are an ideal architecture for something exactly like retarget the search. I’ve noted that same similarity in Steering subsystems: capabilities, agency, and alignment and a stronger similarity in an upcoming post. In Internal independent review for language model agent alignment I note the alignment advantages you list, and a couple of others.
Current AutoGPT is simply too incompetent to effectively pursue a goal. Other similar systems are more competent (the two Minecraft LLM agent systems are the most impressive), but nobody has let them run ad infinitum to test their Goodharting. I’d assume they’d show it. Goodhart will apply increasingly as those systems actually pursue goals.
AutoGPT isn’t a company, it’s a little open-source project. Any companies working on agents aren’t publicizing their work so far.
I do suspect that actively improving things like AutoGPT is a good route to addressing x-risk because of their advantages for alignment. But I’m not sure enough to start advocating it.
They raise $12M: https://twitter.com/Auto_GPT/status/1713009267194974333
You could be right that they haven’t incorporated as a company. I wasn’t able to find information about that.
Wow, interesting. The say it will be the largest open-source project in history. I have no idea how an open-source project raises $12m but they did.
Fair point, valley9. I don’t think a little bit of agency throws you into an entirely different regime. It’s more that I think that the more powerful an agent you build, the more it is able to autonomously change the world to work with goals, the more you move into dangerous territory. But also, it’s going to tempt people. Somebody out there is going to be tempted to say, “go make me money, just don’t get caught doing anything illegal in a way that gets traced back to me.” That command given to a sufficiently powerful AI system could have a lot of dangerous results.
Indeed. This seems like more of a social problem than an alignment problem though: ensure that powerful AIs tend to be corporate AIs with corporate liability rather than open-source AIs, and get the AIs to law enforcement (or even law enforcement “red teams”—should we make that a thing?) before they get to criminals. I don’t think improving aimability helps guard against misuse.
I think needs to be stated more clearly: Alignment and Misuse are very different things, so much so that what policies and research work for one problem will often not work on another problem, and the worlds of misuse and misalignment are quite different.
Though note that the solutions for misuse focused worlds and structural risk focused worlds can work against each other.
Also, this is validating JDP’s prediction that people will focus less on alignment and more on misuse in their threat models of AI risk.
If the goals are loaded into it via natural-language descriptions, then the way the LLM interprets the words might differ from the way the human who put them in intended them to be read, and the AutoGPT would then go off and do what it thought the user said, not what the user meant. It’s happening all the time with humans, after all.
From the Goodharting perspective, it would optimize for the measure (natural-language description) rather than the intended target. And since tails come apart, inasmuch as AutoGPT optimizes strongly, it would end up implement something that looks precisely like what it understood the user to mean, but which would look like a weird unintended extreme from the user’s point of view.
You’d mentioned leveraging interpretability tools. Indeed: the particularly strong ones, that offer high-fidelity insight into how the LLM interprets stuff, would address that problem. But on my model, we’re not on-track to get them. Again: we have tons of insights in other humans, and this sort of miscommunication happens constantly anyway. It’s a hard problem.
[Disclaimer: I haven’t tried AutoGPT myself, mostly reasoning from first principles here. Thanks in advance if anyone has corrections on what follows.]
Yes, this is a possibility, which is why I suggested that alignment people work for AutoGPT to try and prevent it from happening. AutoGPT also has a commercial incentive to prevent it from happening, to make their tool work. They’re going to work to prevent it somehow. The question in my mind is whether they prevent it from happening in a way that’s patchy and unreliable, or in a way that’s robust.
Natural language can be a medium for goal planning, but it can also be a medium for goal clarification. The challenge here is for AutoGPT to be well-calibrated for its uncertainty about the user’s preferences. If it encounters an uncertain situation, do goal clarification with the user until it has justifiable certainty about the user’s preferences. AutoGPT could be superhuman at these calibration and clarification tasks, if the company collects a huge dataset of user interactions along with user complaints due to miscommunication. [Subtle miscommunications that go unreported are a potential problem—could be addressed with an internal tool that mines interaction logs to try and surface them for human labeling. If customer privacy is an issue, offer customers a discount if they’re willing to share their logs, have humans label a random subset of logs based on whether they feel there was insufficient/excessive clarification, and use that as training data.]
Can we taboo “optimize”? What specifically does “optimize strongly” mean in an AutoGPT context? For example, if we run AutoGPT on a faster processor, does that mean it is “optimizing more strongly”? It will act on the world faster, so in that sense it could be considered a “more powerful optimizer”. But if it’s just performing the same operations faster, I don’t see how Goodhart issues get worse.
Goodhart is a problem if you have an imperfect metric that can be gamed. If we design AutoGPT so there’s no metric and it’s also not trying to game anything, I’m not seeing an issue. Presumably there is or will be some sort of outer loop which fine-tunes AutoGPT interaction logs against a measure of overall quality, and that’s worth thinking about, but it’s also similar to how ChatGPT is trained, no? So I don’t know how much risk we’re adding there.
I get the sense that you’re a person with a hammer and everything looks like a nail. You’ve got some pre-existing models of how AI is supposed to fail, and you’re trying to apply them in every situation even if they don’t necessarily fit. [Note, this isn’t really a criticism of you in particular, I see it a lot in Lesswrong AI discourse.] From my perspective, the important thing is to have some people with security mindset working at AutoGPT, getting their hands dirty, thinking creatively about how stuff could go wrong, and trying to identify what the actual biggest risks are given the system’s architecture + how best to address them. I worry that person-with-a-hammer syndrome is going to create blind spots for the actual biggest risks, whatever those may be.
Perhaps it’s worth comparing AutoGPT to a baseline of a human upload. In the past, I remember alignment researchers claiming that a high-fidelity upload would be preferable to de novo AI, because with the upload, you don’t need to solve the alignment problem. But as you say, miscommunication could easily happen with a high-fidelity upload.
If we’ve reduced the level of danger to the level of danger we experience with ordinary human miscommunication, that seems like an important milestone. There’s a trollish argument to be made here, that if human miscommunication is the primary danger, we shouldn’t be engaged in e.g. genetic engineering for intelligence enhancement either, because it could produce superhumanly intelligent agents that we’ll have miscommunications with :-)
In fact, the biggest problem we have with other humans is that they straight up have different values than us. Compared to that problem, miscommunication is small. How many wars have been fought over miscommunication vs value differences? Perhaps you can find a few wars that were fought primarily due to miscommunication, but that’s remarkable because it’s rare.
An AutoGPT that’s more aligned with me than I’m aligned with my fellow humans looks pretty feasible.
[Again, I appreciate corrections from anyone who’s experienced with AutoGPT! Please reply and correct me!]
Yes, but that would require it to be robustly aimed at the goal of faithfully eliciting the user’s preferences and following them. And if it’s not precisely robustly aimed at it, if we’ve miscommunicated what “faithfulness” means, then it’ll pursue its misaligned understanding of faithfulness, which would lead to it pursuing a non-intended interpretation of the users’ requests.
Like, this just pushes the same problem back one step.
And I agree that it’s a solvable problem, and that it’s something worthwhile to work on. It’s basically just corrigibility, really. But it doesn’t simplify the initial issue.
Ability to achieve real-world outcomes. For example, an AutoGPT instance that can overthrow a government is a more strong optimizer than an AutoGPT instance that can at best make you $100 in a week. Here’s a pretty excellent post on the matter of not-exactingly-aimed strong optimization predictably resulting in bad outcomes.
I mean, it’s trying to achieve some goal out in the world. The goal’s specification is the “metric”, and while it’s not trying to maliciously “game” it, it is trying to achieve it. The goal’s specification as it understands it, that is, not the goal as it’s intended. Which would be isomorphic to it Goodharting on the metric, if the two diverge.
I get the sense that people are sometimes too quick to assume that something which looks like a hammer from one angle is a hammer.
As above, by “Goodharting” there (which wasn’t even the term I introduced into the discussion) I didn’t mean the literal same setup as in e. g. economics, where there’s a bunch of schemers that deliberately maliciously manipulate stuff in order to decouple the metric from the variable it’s meant to measure. I meant the general dynamic where we have some goal, we designate some formal specification for it, then point an optimization process at the specification, and inasmuch as the intended-goal diverges from the formal-goal, we get unintended results.
That’s basically the system “Goodharting” on the “metric”. Same concept, could be viewed through the same lens.
This sort of miscommunication is also prevalent in e. g. talk about agents having utility functions, or engaging in search. When I talk about this, I’m not imagining a literal wrapper-mind setup, that is literally simulating every possible way things could go and plugging that in its compactly-specified utility function – as if it’s an unbounded AIXI or something. Obviously that’s not realistically implementable. But there could be practical mind designs that are approximately isomorphic to this sort of setup in the limit, and they could have properties that are approximately the same as those of a wrapper-mind.
(I know you weren’t making this specific point; just broadly gesturing at the idea.)
For sure.
I think this argument only makes sense if it makes sense to think of the “AutoGPT clarification module” as trying to pursue this goal at all costs. If it’s just a while loop that asks clarification questions until the goal is “sufficiently clarified”, then this seems like a bad model. Maybe a while loop design like this would have other problems, but I don’t think this is one of them.
OK, so by this definition, using a more powerful processor with AutoGPT (so it just does the exact same operations faster) makes it a more “powerful optimizer”, even though it’s working exactly the same way and has exactly the same degree of issues with Goodharting etc. (just faster). Do I understand you correctly?
This seems potentially false depending on the training method, e.g. if it’s being trained to imitate experts. If it’s e.g. being trained to imitate experts, I expect the key question is the degree to which there are examples in the dataset of experts following the sort of procedure that would be vulnerable to Goodharting (step 1: identify goal specification. step 2: try to achieve it as you understand it, not worrying about possible divergence from user intent.)
Yeah, I just don’t think this is the only way that a system like AutoGPT could be implemented. Maybe it is how current AutoGPT is implemented, but then I encourage alignment researchers to join the organization and change that.
They could, but people seem to assume they will, with poor justification. I agree it’s a reasonable heuristic for identifying potential problems, but it shouldn’t be the only heuristic.
… How do you define “sufficiently clarified”, and why is that step not subject to miscommunication / the-problem-that-is-isomorphic-to-Goodharting?
I’d tried to reason about similar setups before, and my conclusion was that it has to bottom out in robust alignment somewhere.
I’d be happy to be proven wrong on that, thought. Wow, wouldn’t that make matters easier...
Sure? I mean, presumably it doesn’t do the exact same operations. Surely it’s exploiting its ability to think faster in order to more closely micromanage its tasks, or something. If not, if it’s just ignoring its greater capabilities, then no, it’s not a stronger optimizer.
I don’t think that gets you to dangerous capabilities. I think you need the system to have a consequentialist component somewhere, which is actually focused on pursuing the goal.
Here’s what I wrote previously:
In more detail, the way I would do it would be: I give AutoGPT a task, and it says “OK, I think what you mean is: [much more detailed description of the task, clarifying points of uncertainty]. Is that right?” Then the user can effectively edit that detailed description until (a) the user is satisfied with it, and (b) a model trained on previous user interactions considers it sufficiently detailed. Once we have a detailed task description that’s mutually satisfactory, AutoGPT works from it. For simplicity, assume for now that nothing comes up during the task that would require further clarification (that scenario gets more complicated).
So to answer your specific questions:
The definition of “sufficiently clarified” is based on a model trained from examples of (a) a detailed task description and (b) whether that task description ended up being too ambiguous. Miscommunication shouldn’t be a huge issue because we’ve got a human labeling these examples, so the model has lots of concrete data about what is/is not a good task description.
If the learned model for “sufficiently clarified” is bad, then sometimes AutoGPT will consider a task “sufficiently clarified” when it really isn’t (isomorphic to Goodharting, also similar to the hallucinations that ChatGPT is susceptible to). In these cases, the user is likely to complain that AutoGPT didn’t do what they wanted, and it gets added as a new training example to the dataset for the “sufficiently clarified” model. So the learned model for “sufficiently clarified” gets better over time. This isn’t necessarily the ideal setup, but it’s also basically what the ChatGPT team does. So I don’t think there is significant added risk. If one accepts the thesis of your OP that ChatGPT is OK, this seems OK too. In both cases we’re looking at the equivalent of an occasional hallucination, which hurts reliability a little bit.
Recall your original claim: “inasmuch as AutoGPT optimizes strongly, it would end up implement something that looks precisely like what it understood the user to mean, but which would look like a weird unintended extreme from the user’s point of view.”
The thought experiment here is that we take the exact same AutoGPT code and just run it on a faster processor. So no, it’s not “exploiting its ability to think faster in order to more closely micromanage its tasks”. But it does have “greater capabilities” in the sense of doing everything faster—due to a faster processor.
Once AutoGPT is running on a faster processor, I might choose to use AutoGPT more ambitiously. Perhaps I could get a week’s worth of work done in an hour, instead of a day’s worth of work. Or just get a week’s worth of work done in well under an hour. But since it’s the exact same code, your original “inasmuch as AutoGPT optimizes strongly” claim would not appear to apply.
I really dislike how people use the word “optimization” because it bundles concepts together in a way that’s confusing. In this specific case, your “inasmuch as AutoGPT optimizes strongly” claim is true, but only in a very specific sense. Specifically, if AutoGPT has some model of what the user means, and it tries to identify the very maximal state of the world that corresponds to that understanding—then subsequently works to bring about that state of the world. In the broad sense of an “optimizer”, there are ways to make AutoGPT a stronger “optimizer” that don’t exacerbate this problem, such as running it on a faster processor, or giving it access to new APIs, or even (I would argue) having it micromanage its tasks more closely, as long as that doesn’t affect it’s notion of “desired states of the world” (e.g. for simplicity, no added task micromanagement when reasoning about “desired states of the world”, but it’s OK in other circumstances). [Caveat: giving access to e.g. new APIs could make AutoGPT more effective at implementing its model of user prefs, so it’s therefore a bigger footgun if that model happens to be bad. But I don’t think new APIs will worsen the user pref model.]
Cool, well maybe we should get alignment people to work at AutoGPT to influence the AutoGPT people to not develop dangerous capabilities then, by focusing on e.g. imitating experts :-) I’m not actually seeing a disagreement here.
Oh, if we’re assuming this setup doesn’t have to be robust to AutoGPT being superintelligent and deciding to boil the oceans because of a misunderstood instruction, then yeah, that’s fine.
That’s the part that would exacerbate the issue where it sometimes misunderstands your instructions. If you’re using it for more ambitious tasks, or more often, then there are more frequent opportunities for misunderstanding, and their consequences are larger-scale. Which means that, to whichever extent it’s prone to misunderstanding you, that gets amplified, as does the damage the misunderstandings cause.
Oh, sure, I’m not opposing that. It may not be the highest-value place for a given person to be, but it might be for some.
Is agency actually the issue by itself or just a necessary component?
Considering Robert miles stamp collecting robot:
“Order me some stamps in the next 32k tokens/60 seconds” is less scope than “guard my stamps today” than “ensure I always have enough stamps”. The last one triggers power seeking, the first 2 do not benefit from seeking power unless the payoff on the power seeking investment is within the time interval.
Note also that AutoGPT even if given a goal and allowed to run forever has immutable weights and a finite context window hobbling it.
So you need human level prediction + relevant modalities+ agency + long duration goal + memory at a bare minimum. Remove any element and the danger may be negligible.