Here’s a conversation that I think is vaguely analogous:
Alice: Suppose we had a one-way function, then we could make passwords better by...
Bob: What do you want your system to do?
Alice: Well, I want passwords to be more robust to...
Bob: Don’t tell me about the mechanics of the system. Tell me what you want the system to do.
Alice: I want people to be able to authenticate their identity more securely?
Bob: But what will they do with this authentication? Will they do good things? Will they do bad things?
Alice: IDK I just think the world is likely to be generically a better place if we can better autheticate users.
Bob: Oh OK, we’re just going to create this user authetication technology and hope people use it for good?
Alice: Yes? And that seems totally reasonable?
It seems to me like you don’t actually have to have a specific story about what you want your AI to do in order for alignment work to be helpful. People in general do not want to die, so probably generic work on being able to more precisely specify what you want out of your AIs, e.g. for them not to be mesa-optimizers, is likely to be helpful.
This is related to complaints I have with [pivotal-act based] framings, but probably that’s a longer post.
Bob: Oh OK, we’re just going to create this user authetication technology and hope people use it for good?
Seems to me that the answer “I hope people will use it for good” is quite okay for authentication, but not okay for alignment. Doing good is outside the scope of authentication, but is kinda the point of alignment.
The basis of all successful technology to date has been separation of concerns. One of the problems with Alignment as an academic discipline is keeping the focus on problems that can actually be solved, without drawing in all of philosophy and politics. It’s like the old joke about object-oriented programming: you asked for a banana, but you got a gorilla holding the banana, and the entire jungle too.
Do you mean like there are (at least) two subproblems that can be addressed separately?
how to align AI with any set of values
exact specification of human values
Where the former is the proper concern of AI researchers, and the latter should be studied by someone else (even if we currently have no idea who could do such thing reliably, it’s a separate problem regardless).
I’m actually more interested in corrigibility than values alignment, so I don’t think that AI should be solving moral dilemmas every time it takes an action. I think values should be worked out in the post-ASI period, by humans in a democratic political system.
People currently give MIRI money in the hopes they will use it for alignment. Those people can’t explain concretely what MIRI will do to help alignment. By your standard, should anyone give MIRI money?
When you’re part of a cooperative effort, you’re going to be handing off tools to people (either now or in the future) which they’ll use in ways you don’t understand and can’t express. Making people feel foolish for being a long inferential distance away from the solution discourages them from laying groundwork that may well be necessary for progress, or even from exploring.
Some common issues with alignment plans, on Eliezer’s account, include:
Someone stays vague about what task they want to align the AGI on. This lets them mentally plug in strong capabilities when someone objects ‘but won’t this make the AGI useless?‘, and weak capabilities when someone objects ‘but won’t this be too dangerous?’, without ever needing a single coherent proposal that can check all the boxes simultaneously.
More generally, someone proposes a plan that’s described at a very high level, where depending on how we cash out the details, it’s probably either too weak to be useful, or too strong to be safe (at all, or absent additional safety measures that will need to be devised and added). Hearing the details lets Eliezer focus the conversation on whichever of those two paths seems likelier, rather than hedging everything with ‘well, if you mean A, then X; and if you mean B, then Y...’
Cf. Eliezer’s 2018 objections to Paul’s approach, which was very lengthy and complicated in part because it had to spend lots of time on multiple paths of various disjunctions. In essence, ‘All of these paths seem doomy to me, but which path you pick changes the reason it strikes me as doomy.’
Someone is vague about whether they’re shooting for a more limited, task-ish, corrigible AGI, vs. shooting for a more CEV-ish AGI. They therefore are at risk of thinking they’re in the clear because their (task-ish) AGI doesn’t face all the same problems as a CEV-ish one; or thinking they’re in the clear because their (CEV-ish) AGI doesn’t face all the same problems as a task-ish one. But, again, there are different problems facing both approaches; being concrete makes it clearer which problems are more relevant to focus on and discuss.
Someone’s plan relies on unnecessarily and especially dangerous kinds of cognition (e.g., large amounts of reasoning about human minds), without a correspondingly ambitious and impressive plan for aligning this extra-dangerous cognition.
If they were more concrete about ‘what real-world tasks do you want the AGI to do?’, they might realize that it’s possible to achieve what they want (e.g., a world that doesn’t get destroyed by AGI five years later) without relying on such dangerous cognition.
The plan predictably takes too long. If you thought about what you want AGI for, it would become much more obvious that you’ll need to align and deploy your AGI under strong time pressure, because you’ll be thinking concretely about the actual development circumstances.
Or, if you disagree that the strategic situation requires local actors to act fast, then we can focus the conversation on that and more efficiently zero in on the likely cruxes, the things that makes the Eliezer-model say ‘no, this won’t work’.
These are some examples of reasons it’s much more often helpful to think about ‘what is the AGI for?’ or ‘what task are you trying to align?’, even though it’s true that not all domains work this way!
Presumably those early time sharing systems, “we need some way for a user to access their files but not other users”. So password.
Then later in scale “system administrators or people with root keep reading the passwords file and using it for bad acts later”. So one way hash.
Password complexity requirements came from people rainbow tabling the one way hash file.
None of the above was secure so 2 factor.
People keep SMS redirecting so apps on your phone...
Each additional level of security was taken from a pool of preexisting ideas that academics and others had contributed. But it wasn’t applied until it was clear it was needed.
Here’s a conversation that I think is vaguely analogous:
Alice: Suppose we had a one-way function, then we could make passwords better by...
Bob: What do you want your system to do?
Alice: Well, I want passwords to be more robust to...
Bob: Don’t tell me about the mechanics of the system. Tell me what you want the system to do.
Alice: I want people to be able to authenticate their identity more securely?
Bob: But what will they do with this authentication? Will they do good things? Will they do bad things?
Alice: IDK I just think the world is likely to be generically a better place if we can better autheticate users.
Bob: Oh OK, we’re just going to create this user authetication technology and hope people use it for good?
Alice: Yes? And that seems totally reasonable?
It seems to me like you don’t actually have to have a specific story about what you want your AI to do in order for alignment work to be helpful. People in general do not want to die, so probably generic work on being able to more precisely specify what you want out of your AIs, e.g. for them not to be mesa-optimizers, is likely to be helpful.
This is related to complaints I have with [pivotal-act based] framings, but probably that’s a longer post.
Seems to me that the answer “I hope people will use it for good” is quite okay for authentication, but not okay for alignment. Doing good is outside the scope of authentication, but is kinda the point of alignment.
The basis of all successful technology to date has been separation of concerns. One of the problems with Alignment as an academic discipline is keeping the focus on problems that can actually be solved, without drawing in all of philosophy and politics. It’s like the old joke about object-oriented programming: you asked for a banana, but you got a gorilla holding the banana, and the entire jungle too.
Do you mean like there are (at least) two subproblems that can be addressed separately?
how to align AI with any set of values
exact specification of human values
Where the former is the proper concern of AI researchers, and the latter should be studied by someone else (even if we currently have no idea who could do such thing reliably, it’s a separate problem regardless).
I’m actually more interested in corrigibility than values alignment, so I don’t think that AI should be solving moral dilemmas every time it takes an action. I think values should be worked out in the post-ASI period, by humans in a democratic political system.
Basically what I’m thinking here, as an upvoter of Conor.
People currently give MIRI money in the hopes they will use it for alignment. Those people can’t explain concretely what MIRI will do to help alignment. By your standard, should anyone give MIRI money?
When you’re part of a cooperative effort, you’re going to be handing off tools to people (either now or in the future) which they’ll use in ways you don’t understand and can’t express. Making people feel foolish for being a long inferential distance away from the solution discourages them from laying groundwork that may well be necessary for progress, or even from exploring.
Some common issues with alignment plans, on Eliezer’s account, include:
Someone stays vague about what task they want to align the AGI on. This lets them mentally plug in strong capabilities when someone objects ‘but won’t this make the AGI useless?‘, and weak capabilities when someone objects ‘but won’t this be too dangerous?’, without ever needing a single coherent proposal that can check all the boxes simultaneously.
This was a common theme of Eliezer and Richard’s conversations in the Late 2021 MIRI Conversations.
More generally, someone proposes a plan that’s described at a very high level, where depending on how we cash out the details, it’s probably either too weak to be useful, or too strong to be safe (at all, or absent additional safety measures that will need to be devised and added). Hearing the details lets Eliezer focus the conversation on whichever of those two paths seems likelier, rather than hedging everything with ‘well, if you mean A, then X; and if you mean B, then Y...’
Cf. Eliezer’s 2018 objections to Paul’s approach, which was very lengthy and complicated in part because it had to spend lots of time on multiple paths of various disjunctions. In essence, ‘All of these paths seem doomy to me, but which path you pick changes the reason it strikes me as doomy.’
Someone is vague about whether they’re shooting for a more limited, task-ish, corrigible AGI, vs. shooting for a more CEV-ish AGI. They therefore are at risk of thinking they’re in the clear because their (task-ish) AGI doesn’t face all the same problems as a CEV-ish one; or thinking they’re in the clear because their (CEV-ish) AGI doesn’t face all the same problems as a task-ish one. But, again, there are different problems facing both approaches; being concrete makes it clearer which problems are more relevant to focus on and discuss.
Cf. Rohin and Eliezer’s discussion, and the AGI Ruin post.
Someone’s plan relies on unnecessarily and especially dangerous kinds of cognition (e.g., large amounts of reasoning about human minds), without a correspondingly ambitious and impressive plan for aligning this extra-dangerous cognition.
If they were more concrete about ‘what real-world tasks do you want the AGI to do?’, they might realize that it’s possible to achieve what they want (e.g., a world that doesn’t get destroyed by AGI five years later) without relying on such dangerous cognition.
Cf. Rohin and Scott G’s discussion, and Arbital articles like behaviorist genie.
The plan predictably takes too long. If you thought about what you want AGI for, it would become much more obvious that you’ll need to align and deploy your AGI under strong time pressure, because you’ll be thinking concretely about the actual development circumstances.
Or, if you disagree that the strategic situation requires local actors to act fast, then we can focus the conversation on that and more efficiently zero in on the likely cruxes, the things that makes the Eliezer-model say ‘no, this won’t work’.
These are some examples of reasons it’s much more often helpful to think about ‘what is the AGI for?’ or ‘what task are you trying to align?’, even though it’s true that not all domains work this way!
Presumably those early time sharing systems, “we need some way for a user to access their files but not other users”. So password.
Then later in scale “system administrators or people with root keep reading the passwords file and using it for bad acts later”. So one way hash.
Password complexity requirements came from people rainbow tabling the one way hash file.
None of the above was secure so 2 factor.
People keep SMS redirecting so apps on your phone...
Each additional level of security was taken from a pool of preexisting ideas that academics and others had contributed. But it wasn’t applied until it was clear it was needed.