here’s a straw hypothetical example where I’ve exaggerated both 1 and 2; the details aren’t exactly correct but the vibe is more important:
1: “Here’s a super clever extension of debate that mitigates obfuscated arguments [etc], this should just solve alignment”
2: “Debate works if you can actually set the goals of the agents (i.e you’ve solved inner alignment), but otherwise you can get issues with the agents coordinating [etc]”
1: “Well the goals have to be inside the NN somewhere so we can probably just do something with interpretability or whatever”
2: “how are you going to do that? your scheme doesn’t tackle inner alignment, which seems to contain almost all of the difficulty of alignment to me. the claim you just made is a separate claim from your main scheme, and the cleverness in your scheme is in a direction orthogonal to this claim”
1: “idk, also that’s a fully general counterargument to any alignment scheme, you can always just say ‘but what if inner misalignment’. I feel like you’re not really engaging with the meat of my proposal, you’ve just found a thing you can say to be cynical and dismissive of any proposal”
2: “but I think most of the difficulty of alignment is in inner alignment, and schemes which kinda handwave it away are trying to some some problem which is not the actual problem we need to solve to not die from AGI. I agree your scheme would work if inner alignment weren’t a problem.”
1: “so you agree that in a pretty nontrivial number [let’s say both 1&2 agree this is like 20% or something] of worlds my scheme does actually work- I mean how can you be that confident that inner alignment is that hard? in the world’s where inner alignment turns out to be easy then my scheme will work.”
2: “I’m not super confident, but if we assume that inner alignment is easy then I think many other simpler schemes will also work, so the cleverness that your proposal adds doesn’t actually make a big difference.”
So Q=inner alignment? Seems like person 2 not only pointed to inner alignment explicitly (so it can no longer be “some implicit assumption that you might not even notice you have”), but also said that it “seems to contain almost all of the difficulty of alignment to me”. He’s clearly identified inner alignment as a crux, rather than as something meant “to be cynical and dismissive”. At that point, it would have been prudent of person 1 to shift his focus onto inner alignment and explain why he thinks it is not hard.
Note that your post suddenly introduces “Y” without defining it. I think you meant “X”.
For example?
here’s a straw hypothetical example where I’ve exaggerated both 1 and 2; the details aren’t exactly correct but the vibe is more important:
1: “Here’s a super clever extension of debate that mitigates obfuscated arguments [etc], this should just solve alignment”
2: “Debate works if you can actually set the goals of the agents (i.e you’ve solved inner alignment), but otherwise you can get issues with the agents coordinating [etc]”
1: “Well the goals have to be inside the NN somewhere so we can probably just do something with interpretability or whatever”
2: “how are you going to do that? your scheme doesn’t tackle inner alignment, which seems to contain almost all of the difficulty of alignment to me. the claim you just made is a separate claim from your main scheme, and the cleverness in your scheme is in a direction orthogonal to this claim”
1: “idk, also that’s a fully general counterargument to any alignment scheme, you can always just say ‘but what if inner misalignment’. I feel like you’re not really engaging with the meat of my proposal, you’ve just found a thing you can say to be cynical and dismissive of any proposal”
2: “but I think most of the difficulty of alignment is in inner alignment, and schemes which kinda handwave it away are trying to some some problem which is not the actual problem we need to solve to not die from AGI. I agree your scheme would work if inner alignment weren’t a problem.”
1: “so you agree that in a pretty nontrivial number [let’s say both 1&2 agree this is like 20% or something] of worlds my scheme does actually work- I mean how can you be that confident that inner alignment is that hard? in the world’s where inner alignment turns out to be easy then my scheme will work.”
2: “I’m not super confident, but if we assume that inner alignment is easy then I think many other simpler schemes will also work, so the cleverness that your proposal adds doesn’t actually make a big difference.”
So Q=inner alignment? Seems like person 2 not only pointed to inner alignment explicitly (so it can no longer be “some implicit assumption that you might not even notice you have”), but also said that it “seems to contain almost all of the difficulty of alignment to me”. He’s clearly identified inner alignment as a crux, rather than as something meant “to be cynical and dismissive”. At that point, it would have been prudent of person 1 to shift his focus onto inner alignment and explain why he thinks it is not hard.
Note that your post suddenly introduces “Y” without defining it. I think you meant “X”.