quick take: Roughly speaking adversarial examples are the Modern Reformulation you’re asking about.
In my mind the main issue here is that we probably need extreme levels of robustness / OOD-catching. And these probably only come much too late, after less-cautious actors have deployed AI systems that induce lots of x-risk.
Interesting! I wonder whether adversarial robustness improvement is a necessary step in AGI capabilities, and thus represents a blocker from the other side.
Not to mention that there’s a race between “how many planning steps can you do” and “how hard have you made it to find adversarial examples”, and their relative growth curves determine which wins.
I think treating adversarial robustness/OOD handling as a single continuous dimension is the wrong way to go about it.
The basic robustness problem is that there are variables that are usually correlated, or usually restricted to some range, or usually independent, or otherwise usually satisfy some “nice” property. This allows you to be “philosophically lazy” by not making the distinctions that would be required if the nice property doesn’t hold.
But once the nice property fails, the distinctions you need to make are going to depend on what the purpose of your reasoning is. So there will be several different “ways” of being robust, where most of them will not lead to alignment.
For instance, if you’re not good at lying, then telling the truth is basically the same as not getting caught lying. However, as you gain capabilities, the assumption that these two go together ends up failing, because you can lie and cover your tracks. The most appropriate way to generalize depends on what you’re trying to do, e.g. whether you are trying to help vs trying to convince others.
I think if you have already figured out a way to get the AI to try to be aligned to humans, it is reasonable to rely on capabilities researchers to figure out the OOD/adversarial robustness solutions necessary to make it work. However, here you instead want to go the other way, relying on capabilities researcher’s OOD/adversarial robustness to define “being aligned to humans”, and I don’t think this is going to work, since it lacks a ground truth purpose that could guide it.
Note that “in a new ontology the previous reward signals have become under specified and therefore within the reward module we have a sub module that gets clarification from a human on which alternate hypothesis is true” is in principle a dynamic solution to that type of failure.
To head off the anticipated response: this does still count as “the reward” to the model, because it is all part of the mechanism through which the reward is being generated from the state.
Sure, but I consider this approach to fall under attempts to “try to be aligned to humans”. It doesn’t seem like it would be a blocker on the capabilities side if this is missing, only on the alignment side.
(On the alignment side, there’s the issue that your proposed solution is easier said than done.)
I don’t see that as likely, because at low capabilities levels, researchers can notice that the reward isn’t working and just it, without needing to rely on the AI asking them.
Consider a task like asking a generally-intelligent chatbot to buy you furniture you like. The only reasonable way to model the reward involves asking 20-questions about your sub-preferences for sofa styles. This seems like the nature of most service sector tasks?
I have a hard time inferring the specifics of that scenario, and I think the specifics probably matter a lot. So I need to ask some further questions.
Why exactly would a generally-intelligent chatbot be useful for buying furniture (over, say, an expert system)? If I try to come up with reasons, I could imagine it would make sense if it has to find the best deal over unstructured data including all sorts of arbitrary settings, such as people who set their couch for sale. Or if it has to go out and get the furniture. Is that what you have in mind?
Furthermore, let’s repeat that the hard part isn’t in manually specifying a distinction when you have that distinction in mind, it’s in spontaneously recognizing a need for a distinction, accurately conveying the options for the distinctions to the humans, and interpreting that to pick the appropriate distinction. When it comes to something like a firm that sells a chatbot for furniture preferences, I don’t really follow how this latter part is needed. Because it seems like the people who make the furniture-buying chatbot could sit down and enumerate whatever preferences are needed to be clarified, and then code that into the chatbot directly. The best explanation I can come up with is that you imagine it being much more general than this, being more like a sort of servant bot which can handle many tasks, not just buying furniture?
Finally, I’m unsure of what capabilities you imagine the chatbot to have. For instance, a possible “ground truth” you could use for training would be to have humans rate the furniture after they’ve received and used it, on a scale from bad to good. For bots that are not very capable, perhaps the best way to optimize their ratings would be to just get good furniture. But for bots that are highly capable, there are many other ways to get good reviews, e.g. hacking into the system and overriding them. I’m not sure if you imagine the low-capability end or the high-capability end here.
The chatbot is “generally intelligent”, so buying furniture is just one of many tasks it may be asked to execute; another task it could be asked to do is “order me some food”.
The hard part is indeed in spontaneously recognizing distinctions—but we already reward RL agents for curiosity, i.e. taking an action for which your world model fails to predict the consequences. Predicting which new distinctions are salient-to-humans is a thing you can optimize, because you can cleanly label it.
Also to clarify, we’re only arguing here about whether this capability will be naturally invested-in, so I don’t think it matters if highly capable bots have other strategies.
I think the capabilities of the AI matters a lot for alignment strategies, and that’s why I’m asking you about it and why I need you to answer that question.
A subhuman intelligence would rely on humans to make most of the decisions. It would order human-designed furniture types through human-created interfaces and receive human-fabricated furniture. At each of those steps, it delgates an enormous number of decisions to humans, which makes those decisions automatically end up reasonably aligned, but also prevents the AI from doing optimization over them. In the particular case of human-designed interfaces, they tend to automatically expose information about the things that humans care about, and eliciting human preferences can be shortcut be focusing on these dimensions.
But a superhuman intelligence would solve tasks through taking actions independently of humans, as that can allow it to more highly optimize the outcomes. And a solution for alignment that relies on humans making most of the decisions would presumably not generalize to this case, where the AI makes most of the decisions.
quick take: Roughly speaking adversarial examples are the Modern Reformulation you’re asking about.
In my mind the main issue here is that we probably need extreme levels of robustness / OOD-catching. And these probably only come much too late, after less-cautious actors have deployed AI systems that induce lots of x-risk.
Interesting! I wonder whether adversarial robustness improvement is a necessary step in AGI capabilities, and thus represents a blocker from the other side.
Not to mention that there’s a race between “how many planning steps can you do” and “how hard have you made it to find adversarial examples”, and their relative growth curves determine which wins.
I think treating adversarial robustness/OOD handling as a single continuous dimension is the wrong way to go about it.
The basic robustness problem is that there are variables that are usually correlated, or usually restricted to some range, or usually independent, or otherwise usually satisfy some “nice” property. This allows you to be “philosophically lazy” by not making the distinctions that would be required if the nice property doesn’t hold.
But once the nice property fails, the distinctions you need to make are going to depend on what the purpose of your reasoning is. So there will be several different “ways” of being robust, where most of them will not lead to alignment.
For instance, if you’re not good at lying, then telling the truth is basically the same as not getting caught lying. However, as you gain capabilities, the assumption that these two go together ends up failing, because you can lie and cover your tracks. The most appropriate way to generalize depends on what you’re trying to do, e.g. whether you are trying to help vs trying to convince others.
I think if you have already figured out a way to get the AI to try to be aligned to humans, it is reasonable to rely on capabilities researchers to figure out the OOD/adversarial robustness solutions necessary to make it work. However, here you instead want to go the other way, relying on capabilities researcher’s OOD/adversarial robustness to define “being aligned to humans”, and I don’t think this is going to work, since it lacks a ground truth purpose that could guide it.
Note that “in a new ontology the previous reward signals have become under specified and therefore within the reward module we have a sub module that gets clarification from a human on which alternate hypothesis is true” is in principle a dynamic solution to that type of failure.
(See e.g. https://arxiv.org/abs/2202.03418)
To head off the anticipated response: this does still count as “the reward” to the model, because it is all part of the mechanism through which the reward is being generated from the state.
Sure, but I consider this approach to fall under attempts to “try to be aligned to humans”. It doesn’t seem like it would be a blocker on the capabilities side if this is missing, only on the alignment side.
(On the alignment side, there’s the issue that your proposed solution is easier said than done.)
I guess I expect that even at low capability levels, reward-disambiguating on will be crucial and capabilities researchers will be working on it.
I don’t see that as likely, because at low capabilities levels, researchers can notice that the reward isn’t working and just it, without needing to rely on the AI asking them.
Consider a task like asking a generally-intelligent chatbot to buy you furniture you like. The only reasonable way to model the reward involves asking 20-questions about your sub-preferences for sofa styles. This seems like the nature of most service sector tasks?
I have a hard time inferring the specifics of that scenario, and I think the specifics probably matter a lot. So I need to ask some further questions.
Why exactly would a generally-intelligent chatbot be useful for buying furniture (over, say, an expert system)? If I try to come up with reasons, I could imagine it would make sense if it has to find the best deal over unstructured data including all sorts of arbitrary settings, such as people who set their couch for sale. Or if it has to go out and get the furniture. Is that what you have in mind?
Furthermore, let’s repeat that the hard part isn’t in manually specifying a distinction when you have that distinction in mind, it’s in spontaneously recognizing a need for a distinction, accurately conveying the options for the distinctions to the humans, and interpreting that to pick the appropriate distinction. When it comes to something like a firm that sells a chatbot for furniture preferences, I don’t really follow how this latter part is needed. Because it seems like the people who make the furniture-buying chatbot could sit down and enumerate whatever preferences are needed to be clarified, and then code that into the chatbot directly. The best explanation I can come up with is that you imagine it being much more general than this, being more like a sort of servant bot which can handle many tasks, not just buying furniture?
Finally, I’m unsure of what capabilities you imagine the chatbot to have. For instance, a possible “ground truth” you could use for training would be to have humans rate the furniture after they’ve received and used it, on a scale from bad to good. For bots that are not very capable, perhaps the best way to optimize their ratings would be to just get good furniture. But for bots that are highly capable, there are many other ways to get good reviews, e.g. hacking into the system and overriding them. I’m not sure if you imagine the low-capability end or the high-capability end here.
The chatbot is “generally intelligent”, so buying furniture is just one of many tasks it may be asked to execute; another task it could be asked to do is “order me some food”.
The hard part is indeed in spontaneously recognizing distinctions—but we already reward RL agents for curiosity, i.e. taking an action for which your world model fails to predict the consequences. Predicting which new distinctions are salient-to-humans is a thing you can optimize, because you can cleanly label it.
Also to clarify, we’re only arguing here about whether this capability will be naturally invested-in, so I don’t think it matters if highly capable bots have other strategies.
I think the capabilities of the AI matters a lot for alignment strategies, and that’s why I’m asking you about it and why I need you to answer that question.
A subhuman intelligence would rely on humans to make most of the decisions. It would order human-designed furniture types through human-created interfaces and receive human-fabricated furniture. At each of those steps, it delgates an enormous number of decisions to humans, which makes those decisions automatically end up reasonably aligned, but also prevents the AI from doing optimization over them. In the particular case of human-designed interfaces, they tend to automatically expose information about the things that humans care about, and eliciting human preferences can be shortcut be focusing on these dimensions.
But a superhuman intelligence would solve tasks through taking actions independently of humans, as that can allow it to more highly optimize the outcomes. And a solution for alignment that relies on humans making most of the decisions would presumably not generalize to this case, where the AI makes most of the decisions.
I think there are intermediate cases—delegating some but not all decisions—that require this sort of tooling. See Eg this paper from today: http://ai.googleblog.com/2022/04/simple-and-effective-zero-shot-task.html that focuses on how to learn intent.