Yes, but that would require it to be robustly aimed at the goal of faithfully eliciting the user’s preferences and following them. And if it’s not precisely robustly aimed at it, if we’ve miscommunicated what “faithfulness” means, then it’ll pursue its misaligned understanding of faithfulness, which would lead to it pursuing a non-intended interpretation of the users’ requests.
I think this argument only makes sense if it makes sense to think of the “AutoGPT clarification module” as trying to pursue this goal at all costs. If it’s just a while loop that asks clarification questions until the goal is “sufficiently clarified”, then this seems like a bad model. Maybe a while loop design like this would have other problems, but I don’t think this is one of them.
Ability to achieve real-world outcomes. For example, an AutoGPT instance that can overthrow a government is a more strong optimizer than an AutoGPT instance that can at best make you $100 in a week.
OK, so by this definition, using a more powerful processor with AutoGPT (so it just does the exact same operations faster) makes it a more “powerful optimizer”, even though it’s working exactly the same way and has exactly the same degree of issues with Goodharting etc. (just faster). Do I understand you correctly?
I mean, it’s trying to achieve some goal out in the world. The goal’s specification is the “metric”, and while it’s not trying to maliciously “game” it, it is trying to achieve it. The goal’s specification as it understands it, that is, not the goal as it’s intended. Which would be isomorphic to it Goodharting on the metric, if the two diverge.
This seems potentially false depending on the training method, e.g. if it’s being trained to imitate experts. If it’s e.g. being trained to imitate experts, I expect the key question is the degree to which there are examples in the dataset of experts following the sort of procedure that would be vulnerable to Goodharting (step 1: identify goal specification. step 2: try to achieve it as you understand it, not worrying about possible divergence from user intent.)
I meant the general dynamic where we have some goal, we designate some formal specification for it, then point an optimization process at the specification, and inasmuch as the intended-goal diverges from the formal-goal, we get unintended results.
Yeah, I just don’t think this is the only way that a system like AutoGPT could be implemented. Maybe it is how current AutoGPT is implemented, but then I encourage alignment researchers to join the organization and change that.
But there could be practical mind designs that are approximately isomorphic to this sort of setup in the limit, and they could have properties that are approximately the same as those of a wrapper-mind.
They could, but people seem to assume they will, with poor justification. I agree it’s a reasonable heuristic for identifying potential problems, but it shouldn’t be the only heuristic.
asking clarification questions until the goal is “sufficiently clarified”
… How do you define “sufficiently clarified”, and why is that step not subject to miscommunication / the-problem-that-is-isomorphic-to-Goodharting?
I’d tried to reason about similar setups before, and my conclusion was that it has to bottom out in robust alignment somewhere.
I’d be happy to be proven wrong on that, thought. Wow, wouldn’t that make matters easier...
OK, so by this definition, using a more powerful processor with AutoGPT (so it just does the exact same operations faster) makes it a more “powerful optimizer”, even though it’s working exactly the same way and has exactly the same degree of issues with Goodharting etc. (just faster). Do I understand you correctly?
Sure? I mean, presumably it doesn’t do the exact same operations. Surely it’s exploiting its ability to think faster in order to more closely micromanage its tasks, or something. If not, if it’s just ignoring its greater capabilities, then no, it’s not a stronger optimizer.
This seems potentially false depending on the training method, e.g. if it’s being trained to imitate experts
I don’t think that gets you to dangerous capabilities. I think you need the system to have a consequentialist component somewhere, which is actually focused on pursuing the goal.
… How do you define “sufficiently clarified”, and why is that step not subject to miscommunication / the-problem-that-is-isomorphic-to-Goodharting?
Here’s what I wrote previously:
...AutoGPT could be superhuman at these calibration and clarification tasks, if the company collects a huge dataset of user interactions along with user complaints due to miscommunication. [Subtle miscommunications that go unreported are a potential problem—could be addressed with an internal tool that mines interaction logs to try and surface them for human labeling. If customer privacy is an issue, offer customers a discount if they’re willing to share their logs, have humans label a random subset of logs based on whether they feel there was insufficient/excessive clarification, and use that as training data.]
In more detail, the way I would do it would be: I give AutoGPT a task, and it says “OK, I think what you mean is: [much more detailed description of the task, clarifying points of uncertainty]. Is that right?” Then the user can effectively edit that detailed description until (a) the user is satisfied with it, and (b) a model trained on previous user interactions considers it sufficiently detailed. Once we have a detailed task description that’s mutually satisfactory, AutoGPT works from it. For simplicity, assume for now that nothing comes up during the task that would require further clarification (that scenario gets more complicated).
So to answer your specific questions:
The definition of “sufficiently clarified” is based on a model trained from examples of (a) a detailed task description and (b) whether that task description ended up being too ambiguous. Miscommunication shouldn’t be a huge issue because we’ve got a human labeling these examples, so the model has lots of concrete data about what is/is not a good task description.
If the learned model for “sufficiently clarified” is bad, then sometimes AutoGPT will consider a task “sufficiently clarified” when it really isn’t (isomorphic to Goodharting, also similar to the hallucinations that ChatGPT is susceptible to). In these cases, the user is likely to complain that AutoGPT didn’t do what they wanted, and it gets added as a new training example to the dataset for the “sufficiently clarified” model. So the learned model for “sufficiently clarified” gets better over time. This isn’t necessarily the ideal setup, but it’s also basically what the ChatGPT team does. So I don’t think there is significant added risk. If one accepts the thesis of your OP that ChatGPT is OK, this seems OK too. In both cases we’re looking at the equivalent of an occasional hallucination, which hurts reliability a little bit.
Sure? I mean, presumably it doesn’t do the exact same operations. Surely it’s exploiting its ability to think faster in order to more closely micromanage its tasks, or something. If not, if it’s just ignoring its greater capabilities, then no, it’s not a stronger optimizer.
Recall your original claim: “inasmuch as AutoGPT optimizes strongly, it would end up implement something that looks precisely like what it understood the user to mean, but which would look like a weird unintended extreme from the user’s point of view.”
The thought experiment here is that we take the exact same AutoGPT code and just run it on a faster processor. So no, it’s not “exploiting its ability to think faster in order to more closely micromanage its tasks”. But it does have “greater capabilities” in the sense of doing everything faster—due to a faster processor.
Once AutoGPT is running on a faster processor, I might choose to use AutoGPT more ambitiously. Perhaps I could get a week’s worth of work done in an hour, instead of a day’s worth of work. Or just get a week’s worth of work done in well under an hour. But since it’s the exact same code, your original “inasmuch as AutoGPT optimizes strongly” claim would not appear to apply.
I really dislike how people use the word “optimization” because it bundles concepts together in a way that’s confusing. In this specific case, your “inasmuch as AutoGPT optimizes strongly” claim is true, but only in a very specific sense. Specifically, if AutoGPT has some model of what the user means, and it tries to identify the very maximal state of the world that corresponds to that understanding—then subsequently works to bring about that state of the world. In the broad sense of an “optimizer”, there are ways to make AutoGPT a stronger “optimizer” that don’t exacerbate this problem, such as running it on a faster processor, or giving it access to new APIs, or even (I would argue) having it micromanage its tasks more closely, as long as that doesn’t affect it’s notion of “desired states of the world” (e.g. for simplicity, no added task micromanagement when reasoning about “desired states of the world”, but it’s OK in other circumstances). [Caveat: giving access to e.g. new APIs could make AutoGPT more effective at implementing its model of user prefs, so it’s therefore a bigger footgun if that model happens to be bad. But I don’t think new APIs will worsen the user pref model.]
I don’t think that gets you to dangerous capabilities. I think you need the system to have a consequentialist component somewhere, which is actually focused on pursuing the goal.
Cool, well maybe we should get alignment people to work at AutoGPT to influence the AutoGPT people to not develop dangerous capabilities then, by focusing on e.g. imitating experts :-) I’m not actually seeing a disagreement here.
This isn’t necessarily the ideal setup, but it’s also basically what the ChatGPT team does. So I don’t think there is significant added risk. If one accepts the thesis of your OP that ChatGPT is OK, this seems OK too
Oh, if we’re assuming this setup doesn’t have to be robust to AutoGPT being superintelligent and deciding to boil the oceans because of a misunderstood instruction, then yeah, that’s fine.
Once AutoGPT is running on a faster processor, I might choose to use AutoGPT more ambitiously
That’s the part that would exacerbate the issue where it sometimes misunderstands your instructions. If you’re using it for more ambitious tasks, or more often, then there are more frequent opportunities for misunderstanding, and their consequences are larger-scale. Which means that, to whichever extent it’s prone to misunderstanding you, that gets amplified, as does the damage the misunderstandings cause.
Cool, well maybe we should get alignment people to work at AutoGPT to influence the AutoGPT people to not develop dangerous capabilities then, by focusing on e.g. imitating experts :-)
Oh, sure, I’m not opposing that. It may not be the highest-value place for a given person to be, but it might be for some.
I think this argument only makes sense if it makes sense to think of the “AutoGPT clarification module” as trying to pursue this goal at all costs. If it’s just a while loop that asks clarification questions until the goal is “sufficiently clarified”, then this seems like a bad model. Maybe a while loop design like this would have other problems, but I don’t think this is one of them.
OK, so by this definition, using a more powerful processor with AutoGPT (so it just does the exact same operations faster) makes it a more “powerful optimizer”, even though it’s working exactly the same way and has exactly the same degree of issues with Goodharting etc. (just faster). Do I understand you correctly?
This seems potentially false depending on the training method, e.g. if it’s being trained to imitate experts. If it’s e.g. being trained to imitate experts, I expect the key question is the degree to which there are examples in the dataset of experts following the sort of procedure that would be vulnerable to Goodharting (step 1: identify goal specification. step 2: try to achieve it as you understand it, not worrying about possible divergence from user intent.)
Yeah, I just don’t think this is the only way that a system like AutoGPT could be implemented. Maybe it is how current AutoGPT is implemented, but then I encourage alignment researchers to join the organization and change that.
They could, but people seem to assume they will, with poor justification. I agree it’s a reasonable heuristic for identifying potential problems, but it shouldn’t be the only heuristic.
… How do you define “sufficiently clarified”, and why is that step not subject to miscommunication / the-problem-that-is-isomorphic-to-Goodharting?
I’d tried to reason about similar setups before, and my conclusion was that it has to bottom out in robust alignment somewhere.
I’d be happy to be proven wrong on that, thought. Wow, wouldn’t that make matters easier...
Sure? I mean, presumably it doesn’t do the exact same operations. Surely it’s exploiting its ability to think faster in order to more closely micromanage its tasks, or something. If not, if it’s just ignoring its greater capabilities, then no, it’s not a stronger optimizer.
I don’t think that gets you to dangerous capabilities. I think you need the system to have a consequentialist component somewhere, which is actually focused on pursuing the goal.
Here’s what I wrote previously:
In more detail, the way I would do it would be: I give AutoGPT a task, and it says “OK, I think what you mean is: [much more detailed description of the task, clarifying points of uncertainty]. Is that right?” Then the user can effectively edit that detailed description until (a) the user is satisfied with it, and (b) a model trained on previous user interactions considers it sufficiently detailed. Once we have a detailed task description that’s mutually satisfactory, AutoGPT works from it. For simplicity, assume for now that nothing comes up during the task that would require further clarification (that scenario gets more complicated).
So to answer your specific questions:
The definition of “sufficiently clarified” is based on a model trained from examples of (a) a detailed task description and (b) whether that task description ended up being too ambiguous. Miscommunication shouldn’t be a huge issue because we’ve got a human labeling these examples, so the model has lots of concrete data about what is/is not a good task description.
If the learned model for “sufficiently clarified” is bad, then sometimes AutoGPT will consider a task “sufficiently clarified” when it really isn’t (isomorphic to Goodharting, also similar to the hallucinations that ChatGPT is susceptible to). In these cases, the user is likely to complain that AutoGPT didn’t do what they wanted, and it gets added as a new training example to the dataset for the “sufficiently clarified” model. So the learned model for “sufficiently clarified” gets better over time. This isn’t necessarily the ideal setup, but it’s also basically what the ChatGPT team does. So I don’t think there is significant added risk. If one accepts the thesis of your OP that ChatGPT is OK, this seems OK too. In both cases we’re looking at the equivalent of an occasional hallucination, which hurts reliability a little bit.
Recall your original claim: “inasmuch as AutoGPT optimizes strongly, it would end up implement something that looks precisely like what it understood the user to mean, but which would look like a weird unintended extreme from the user’s point of view.”
The thought experiment here is that we take the exact same AutoGPT code and just run it on a faster processor. So no, it’s not “exploiting its ability to think faster in order to more closely micromanage its tasks”. But it does have “greater capabilities” in the sense of doing everything faster—due to a faster processor.
Once AutoGPT is running on a faster processor, I might choose to use AutoGPT more ambitiously. Perhaps I could get a week’s worth of work done in an hour, instead of a day’s worth of work. Or just get a week’s worth of work done in well under an hour. But since it’s the exact same code, your original “inasmuch as AutoGPT optimizes strongly” claim would not appear to apply.
I really dislike how people use the word “optimization” because it bundles concepts together in a way that’s confusing. In this specific case, your “inasmuch as AutoGPT optimizes strongly” claim is true, but only in a very specific sense. Specifically, if AutoGPT has some model of what the user means, and it tries to identify the very maximal state of the world that corresponds to that understanding—then subsequently works to bring about that state of the world. In the broad sense of an “optimizer”, there are ways to make AutoGPT a stronger “optimizer” that don’t exacerbate this problem, such as running it on a faster processor, or giving it access to new APIs, or even (I would argue) having it micromanage its tasks more closely, as long as that doesn’t affect it’s notion of “desired states of the world” (e.g. for simplicity, no added task micromanagement when reasoning about “desired states of the world”, but it’s OK in other circumstances). [Caveat: giving access to e.g. new APIs could make AutoGPT more effective at implementing its model of user prefs, so it’s therefore a bigger footgun if that model happens to be bad. But I don’t think new APIs will worsen the user pref model.]
Cool, well maybe we should get alignment people to work at AutoGPT to influence the AutoGPT people to not develop dangerous capabilities then, by focusing on e.g. imitating experts :-) I’m not actually seeing a disagreement here.
Oh, if we’re assuming this setup doesn’t have to be robust to AutoGPT being superintelligent and deciding to boil the oceans because of a misunderstood instruction, then yeah, that’s fine.
That’s the part that would exacerbate the issue where it sometimes misunderstands your instructions. If you’re using it for more ambitious tasks, or more often, then there are more frequent opportunities for misunderstanding, and their consequences are larger-scale. Which means that, to whichever extent it’s prone to misunderstanding you, that gets amplified, as does the damage the misunderstandings cause.
Oh, sure, I’m not opposing that. It may not be the highest-value place for a given person to be, but it might be for some.