A friend and I went on a long drive recently and listened to this podcast with Andrew Critch on ARCHES. On the way back from our drive we spent some time brainstorming solutions to the problems he outlines. Here are some notes on the podcast + some notes on our brainstorming.
In a possibly inaccurate nutshell, Critch argues that what we think of as the “alignment problem” is most likely going to get solved because there are strong economic incentives to solve it. However, Critch is skeptical of forming a singleton—he says people tend to resist that kind of concentration of power, and it will be hard for an AI team that has this as their plan to recruit team members. Critch says there is really a taxonomy of alignment problems:
single-single, where we have a single operator aligning a single AI with their preferences
single-multi, where we have a single operator aligning multiple AIs with their preferences
multi-single, where we have multiple operators aligning a single AI with their preferences
multi-multi, where we have multiple operators aligning multiple AIs with their preferences
Critch says that although there are commercial incentives to solve the single-single alignment problem, there aren’t commercial incentives to solve all of the others. He thinks the real alignment failures might look like the sort of diffusion of responsibility you see when navigating bureaucracy.
I’m a bit skeptical of this perspective. For one thing, I’m not convinced commercial incentives for single-single alignment will extrapolate well to exotic scenarios such as the “malign universal prior” problem—and if hard takeoff happens then these exotic scenarios might come quickly. For another thing, although I can see why advocating a singleton would be a turnoff to the AI researchers that Critch is pitching, I feel like the question of whether to create a singleton deserves more than the <60 seconds of thought that an AI researcher having a casual conversation with Critch likely puts into their first impression. If there are commercial incentives to solve single-single alignment but not other kinds, shouldn’t we prefer that single-single is the only kind which ends up being load-bearing? Why can’t we form an aligned singleton and then tell it to design a mechanism by which everyone can share their preferences and control what the singleton does (democracy but with better reviews)?
I guess a big issue is the plausibility of hard takeoff, because if hard takeoff is implausible, that makes it less likely that a singleton will form under any circumstances, and it also means that exotic safety problems aren’t likely to crop up as quickly. If this is Critch’s worldview then I could see why he is prioritizing the problems he is prioritizing.
Anyway my friend and I spent some time brainstorming about how to solve versions of the alignment problem besides single-single. Since we haven’t actually read ARCHES or much relevant literature, it’s likely that much of what comes below is clueless, but it might also have new insights due to being unconstrained by existing paradigms :P
One scenario which is kind of in between multi-single and multi-multi alignment is a scenario where everyone has an AI agent which negotiates with some kind of central server on their behalf. We could turn multi-single into this scenario by telling the single AI to run internal simulations of everyone’s individual AI agent, or we could turn multi-multi into this scenario if we have enough cooperation/enforcement for different people to abide by the agreements that their AI agents make with one another on their behalf.
Most of the game theory we’re familiar with deals with a fairly small space of agreements it is possible to make, but it occurred to us that in an ideal world, these super smart AIs would be doing a lot of creative thinking, trying to figure out a clever way for everyone’s preferences to be satisfied simultaneously. Let’s assume each robot agent has a perfect model of its operator’s preferences (or can acquire a perfect model as needed by querying the operator). The central server queries the agents about how much utility their operator assigns to various scenarios, or whether they prefer Scenario A to Scenario B, or something like that. And the agents can respond either truthfully or deceptively (“data poisoning”), trying to navigate towards a final agreement which is as favorable as possible for their operator. Then the central server searches the space of possible agreements in a superintelligent way and tries to find an agreement that everyone likes. (You can also imagine a distributed version of this where there is no central server and individual robot agents try to come up with a proposal that everyone likes.)
How does this compare to the scenario I mentioned above, where an aligned AI designs a mechanism and collects preferences from humans directly without any robot agent as an intermediary? The advantage of robot agents is that if everyone gets a superintelligent agent, then it is harder for individuals to gain advantage through the use of secret robot agents, so the overall result ends up being more fair. However, it arguably makes the mechanism design problem harder: If it is humans who are answering preference queries rather than superintelligent robot agents, since humans have finite intelligence, it will be harder for them to predict the strategic results of responding in various ways to preference queries, so maybe they’re better off just stating their true preferences to minimize downside risk. Additionally, an FAI is probably better at mechanism design than humans. But then again, if the mechanism design for discovering fair agreements between superintelligent robot agents fails, and a single agent manages to negotiate really well on behalf of its owner’s preferences, then arguably you are back in the singleton scenario. So maybe the robot agents scenario has the singleton scenario as its worst case.
I said earlier that it will be harder for humans to predict the strategic results of responding in various ways to preference queries. But we might be able to get a similar result for supersmart AI agents by making use of secret random numbers during the negotiation process to create enough uncertainty where revealing true preferences becomes the optimal strategy. (For example, you could imagine two mechanisms, one of which incentivizes strategic deception in one direction, and the other incentivizes strategic deception in the other direction; if we collect preferences and then flip a coin regarding which mechanism to use, the best strategy might be to do no deception at all.)
Another situation to consider is one where we don’t have as much cooperation/enforcement and individual operators are empowered to refuse to abide by any agreement—let’s call this “declaring war”. In this world, we might prefer to overweight the preferences of more powerful players, because if everyone is weighted equally regardless of power, then the powerful players might have an incentive to declare war and get more than their share. However it’s unclear how to do power estimation in an impartial way. Also, such a setup incentivizes accumulation of power.
One idea which seems like it might be helpful on first blush would be to try to invent some way of verifiably implementing particular utility functions, so competing teams could know that a particular AI will take their utility function into account. However this could be abused as follows: In the same way the game of chicken incentivizes tearing out your steering wheel so the opponent has no choice but to swerve, Team Evil could verifiably implement a particular utility function in their AI such that their AI will declare war unless competing teams verifiably implement a utility function Team Evil specifies.
Anyway looking back it doesn’t seem like what I’ve written actually does much for the “bureaucratic diffusion of responsibility” scenario. I’d be interested to know concretely how this might occur. Maybe what we need is a mechanism for incentivizing red teaming/finding things that no one is responsible for/acquiring responsibility for them?
A friend and I went on a long drive recently and listened to this podcast with Andrew Critch on ARCHES. On the way back from our drive we spent some time brainstorming solutions to the problems he outlines. Here are some notes on the podcast + some notes on our brainstorming.
In a possibly inaccurate nutshell, Critch argues that what we think of as the “alignment problem” is most likely going to get solved because there are strong economic incentives to solve it. However, Critch is skeptical of forming a singleton—he says people tend to resist that kind of concentration of power, and it will be hard for an AI team that has this as their plan to recruit team members. Critch says there is really a taxonomy of alignment problems:
single-single, where we have a single operator aligning a single AI with their preferences
single-multi, where we have a single operator aligning multiple AIs with their preferences
multi-single, where we have multiple operators aligning a single AI with their preferences
multi-multi, where we have multiple operators aligning multiple AIs with their preferences
Critch says that although there are commercial incentives to solve the single-single alignment problem, there aren’t commercial incentives to solve all of the others. He thinks the real alignment failures might look like the sort of diffusion of responsibility you see when navigating bureaucracy.
I’m a bit skeptical of this perspective. For one thing, I’m not convinced commercial incentives for single-single alignment will extrapolate well to exotic scenarios such as the “malign universal prior” problem—and if hard takeoff happens then these exotic scenarios might come quickly. For another thing, although I can see why advocating a singleton would be a turnoff to the AI researchers that Critch is pitching, I feel like the question of whether to create a singleton deserves more than the <60 seconds of thought that an AI researcher having a casual conversation with Critch likely puts into their first impression. If there are commercial incentives to solve single-single alignment but not other kinds, shouldn’t we prefer that single-single is the only kind which ends up being load-bearing? Why can’t we form an aligned singleton and then tell it to design a mechanism by which everyone can share their preferences and control what the singleton does (democracy but with better reviews)?
I guess a big issue is the plausibility of hard takeoff, because if hard takeoff is implausible, that makes it less likely that a singleton will form under any circumstances, and it also means that exotic safety problems aren’t likely to crop up as quickly. If this is Critch’s worldview then I could see why he is prioritizing the problems he is prioritizing.
Anyway my friend and I spent some time brainstorming about how to solve versions of the alignment problem besides single-single. Since we haven’t actually read ARCHES or much relevant literature, it’s likely that much of what comes below is clueless, but it might also have new insights due to being unconstrained by existing paradigms :P
One scenario which is kind of in between multi-single and multi-multi alignment is a scenario where everyone has an AI agent which negotiates with some kind of central server on their behalf. We could turn multi-single into this scenario by telling the single AI to run internal simulations of everyone’s individual AI agent, or we could turn multi-multi into this scenario if we have enough cooperation/enforcement for different people to abide by the agreements that their AI agents make with one another on their behalf.
Most of the game theory we’re familiar with deals with a fairly small space of agreements it is possible to make, but it occurred to us that in an ideal world, these super smart AIs would be doing a lot of creative thinking, trying to figure out a clever way for everyone’s preferences to be satisfied simultaneously. Let’s assume each robot agent has a perfect model of its operator’s preferences (or can acquire a perfect model as needed by querying the operator). The central server queries the agents about how much utility their operator assigns to various scenarios, or whether they prefer Scenario A to Scenario B, or something like that. And the agents can respond either truthfully or deceptively (“data poisoning”), trying to navigate towards a final agreement which is as favorable as possible for their operator. Then the central server searches the space of possible agreements in a superintelligent way and tries to find an agreement that everyone likes. (You can also imagine a distributed version of this where there is no central server and individual robot agents try to come up with a proposal that everyone likes.)
How does this compare to the scenario I mentioned above, where an aligned AI designs a mechanism and collects preferences from humans directly without any robot agent as an intermediary? The advantage of robot agents is that if everyone gets a superintelligent agent, then it is harder for individuals to gain advantage through the use of secret robot agents, so the overall result ends up being more fair. However, it arguably makes the mechanism design problem harder: If it is humans who are answering preference queries rather than superintelligent robot agents, since humans have finite intelligence, it will be harder for them to predict the strategic results of responding in various ways to preference queries, so maybe they’re better off just stating their true preferences to minimize downside risk. Additionally, an FAI is probably better at mechanism design than humans. But then again, if the mechanism design for discovering fair agreements between superintelligent robot agents fails, and a single agent manages to negotiate really well on behalf of its owner’s preferences, then arguably you are back in the singleton scenario. So maybe the robot agents scenario has the singleton scenario as its worst case.
I said earlier that it will be harder for humans to predict the strategic results of responding in various ways to preference queries. But we might be able to get a similar result for supersmart AI agents by making use of secret random numbers during the negotiation process to create enough uncertainty where revealing true preferences becomes the optimal strategy. (For example, you could imagine two mechanisms, one of which incentivizes strategic deception in one direction, and the other incentivizes strategic deception in the other direction; if we collect preferences and then flip a coin regarding which mechanism to use, the best strategy might be to do no deception at all.)
Another situation to consider is one where we don’t have as much cooperation/enforcement and individual operators are empowered to refuse to abide by any agreement—let’s call this “declaring war”. In this world, we might prefer to overweight the preferences of more powerful players, because if everyone is weighted equally regardless of power, then the powerful players might have an incentive to declare war and get more than their share. However it’s unclear how to do power estimation in an impartial way. Also, such a setup incentivizes accumulation of power.
One idea which seems like it might be helpful on first blush would be to try to invent some way of verifiably implementing particular utility functions, so competing teams could know that a particular AI will take their utility function into account. However this could be abused as follows: In the same way the game of chicken incentivizes tearing out your steering wheel so the opponent has no choice but to swerve, Team Evil could verifiably implement a particular utility function in their AI such that their AI will declare war unless competing teams verifiably implement a utility function Team Evil specifies.
Anyway looking back it doesn’t seem like what I’ve written actually does much for the “bureaucratic diffusion of responsibility” scenario. I’d be interested to know concretely how this might occur. Maybe what we need is a mechanism for incentivizing red teaming/finding things that no one is responsible for/acquiring responsibility for them?