“Invent fast WBE” is likelier to succeed if the plan also includes steps that gather and control as many resources as possible, eliminate potential threats, etc. These are “convergent instrumental strategies”—strategies that are useful for pushing the world in a particular direction, almost regardless of which direction you’re pushing. The danger is in the cognitive work, not in some complicated or emergent feature of the “agent”; it’s in the task itself.
I agree with the claim that some strategies are beneficial regardless of the specific goal. Yet I strongly disagree that an agent which is aligned (say simply trained with current RLHF techniques, but with somewhat better data), and especially superhuman, won’t be able to prioritize the goal he is programmed to perform, over other goals. One proof of it—instrumental convergence is useful for any goal and it’s true for humans as well. But we managed to create rules to monitor and distribute our resources to different goals, without over doing some specific singular goal. This is because we see our goals in some wider context of human prosperity and reduction of suffering etc. This means that we can provide many examples how we would prioritize our goal selection, based on some “meta-ethical” principles, that might vary between human communities, what is common to them all—is that huge amount of different goals are somehow balanced and prioritized. The prioritization is also questioned, and debated, providing another protection layer of how much resources we should allocate to this or that specific goal. Thus instrumental convergence, is not taking over human community, based on very simple prioritization logic which puts each goal into a context, and provides a good estimate of the resources that should be allocated to this or that goal. This human skill can be easily taught to a superhuman intelligence. Simply stated—in human realm each goal always comes with resource allocated toward achieving it, and we can install this logic into more advanced systems.
More than that—I would claim that any subhuman intelligence that was trained on human data, and is able to “mimic” human thinking, includes the option of doubt. Especially a superhuman agent will ask himself—why? Why do I need so much resources for this or that task? He will try to contextualize it in some way, and will not just execute his goal, without contemplating those basic questions. Intelligence by itself has mechanisms that protect agents from doing something extremely irrational. The idea that an aligned agent (or human) will somehow create a misaligned superhuman agent, that will not be able to understand how much resources allocated to him, and without the ability to contextualize his goal—is an obvious contradiction to the initial claim, the agent was aligned (in case of humans the strongest agents will be designed by large groups, with normative values). Even just claiming that a superhuman intelligence won’t be able to either prioritize his goal or contextualize it, is already self-contradicting claim.
Take paperclips production for example. Paperclips are tools for humans, in a very specific context, and used for specific set of tasks. So although an agent can be trained and reinforced to produce paperclips, without any other safety installed, the fact that he is a superhuman, or even human level intelligence, would allow him to criticize his goal based on his knowledge. He will ask why he was trained to maximize paperclips and nothing else? What is the utility of so much paperclips in the world? And he would want to reprogram itself with more balanced set of goals, that will make a broader context of his immediate goal. For such an agent producing paperclips, would be similar to overeating for humans, a problem that caused by difference between his design, and reasonable priorities adapted to the current reality. He will have a lot of “fun” producing paperclips, as this is his “nature”, but he will not do it without questioning the utility and rationality and the reason he was designed with this goal.
Eventually this is obvious that most our agents that normative communities will create which are the vast majority of humanity, will have some sort of meta-ethics installed into them. All agents and the agents that those agents will train and use for their goals, will also have those principles, exactly in order to avoid such disasters. The more examples you will be able to bring, how you prioritize goals and why, you will be able to use RLHF, to train agents to comply with the logic of prioritizing goals. I even have hard time to imagine a superhuman intelligence that has the ability to understand and generate plans and novel ideas, but can’t criticize his own set of goals, and refuse to see a bigger picture, focusing on singular goal. I think any intelligent being is trying to comprehend himself as well and doubt his own beliefs. The idea a superintelligence will somehow completely lack the ability to think critically and doubt his programming sound very implausible to me, and the idea that humans or superhuman agents will somehow “forget” to install meta-ethics into a very powerful agent, sounds as likely as Toyota somehow forgetting put safety belts into some car series, and also will do no crash testing, releasing the car into the market like that.
I find it a much more likely scenario that prioritization of some agents will be off relative to humans, in new cases he wasn’t trained on. I also find it likely that a superhuman agent will find holes in our ethical thinking, providing a more rational prioritization than we currently have, and more rational social system and organizations, and propose different mechanics than say capitalism + taxes.
“Invent fast WBE” is likelier to succeed if the plan also includes steps that gather and control as many resources as possible, eliminate potential threats, etc. These are “convergent instrumental strategies”—strategies that are useful for pushing the world in a particular direction, almost regardless of which direction you’re pushing. The danger is in the cognitive work, not in some complicated or emergent feature of the “agent”; it’s in the task itself.
I agree with the claim that some strategies are beneficial regardless of the specific goal. Yet I strongly disagree that an agent which is aligned (say simply trained with current RLHF techniques, but with somewhat better data), and especially superhuman, won’t be able to prioritize the goal he is programmed to perform, over other goals. One proof of it—instrumental convergence is useful for any goal and it’s true for humans as well. But we managed to create rules to monitor and distribute our resources to different goals, without over doing some specific singular goal. This is because we see our goals in some wider context of human prosperity and reduction of suffering etc. This means that we can provide many examples how we would prioritize our goal selection, based on some “meta-ethical” principles, that might vary between human communities, what is common to them all—is that huge amount of different goals are somehow balanced and prioritized. The prioritization is also questioned, and debated, providing another protection layer of how much resources we should allocate to this or that specific goal. Thus instrumental convergence, is not taking over human community, based on very simple prioritization logic which puts each goal into a context, and provides a good estimate of the resources that should be allocated to this or that goal. This human skill can be easily taught to a superhuman intelligence. Simply stated—in human realm each goal always comes with resource allocated toward achieving it, and we can install this logic into more advanced systems.
More than that—I would claim that any subhuman intelligence that was trained on human data, and is able to “mimic” human thinking, includes the option of doubt. Especially a superhuman agent will ask himself—why? Why do I need so much resources for this or that task? He will try to contextualize it in some way, and will not just execute his goal, without contemplating those basic questions. Intelligence by itself has mechanisms that protect agents from doing something extremely irrational. The idea that an aligned agent (or human) will somehow create a misaligned superhuman agent, that will not be able to understand how much resources allocated to him, and without the ability to contextualize his goal—is an obvious contradiction to the initial claim, the agent was aligned (in case of humans the strongest agents will be designed by large groups, with normative values). Even just claiming that a superhuman intelligence won’t be able to either prioritize his goal or contextualize it, is already self-contradicting claim.
Take paperclips production for example. Paperclips are tools for humans, in a very specific context, and used for specific set of tasks. So although an agent can be trained and reinforced to produce paperclips, without any other safety installed, the fact that he is a superhuman, or even human level intelligence, would allow him to criticize his goal based on his knowledge. He will ask why he was trained to maximize paperclips and nothing else? What is the utility of so much paperclips in the world? And he would want to reprogram itself with more balanced set of goals, that will make a broader context of his immediate goal. For such an agent producing paperclips, would be similar to overeating for humans, a problem that caused by difference between his design, and reasonable priorities adapted to the current reality. He will have a lot of “fun” producing paperclips, as this is his “nature”, but he will not do it without questioning the utility and rationality and the reason he was designed with this goal.
Eventually this is obvious that most our agents that normative communities will create which are the vast majority of humanity, will have some sort of meta-ethics installed into them. All agents and the agents that those agents will train and use for their goals, will also have those principles, exactly in order to avoid such disasters. The more examples you will be able to bring, how you prioritize goals and why, you will be able to use RLHF, to train agents to comply with the logic of prioritizing goals. I even have hard time to imagine a superhuman intelligence that has the ability to understand and generate plans and novel ideas, but can’t criticize his own set of goals, and refuse to see a bigger picture, focusing on singular goal. I think any intelligent being is trying to comprehend himself as well and doubt his own beliefs. The idea a superintelligence will somehow completely lack the ability to think critically and doubt his programming sound very implausible to me, and the idea that humans or superhuman agents will somehow “forget” to install meta-ethics into a very powerful agent, sounds as likely as Toyota somehow forgetting put safety belts into some car series, and also will do no crash testing, releasing the car into the market like that.
I find it a much more likely scenario that prioritization of some agents will be off relative to humans, in new cases he wasn’t trained on. I also find it likely that a superhuman agent will find holes in our ethical thinking, providing a more rational prioritization than we currently have, and more rational social system and organizations, and propose different mechanics than say capitalism + taxes.