Probably you were thinking of something like teaching AIs metaphilosophy in order to perhaps improve the procedure? This would be the main alternative I see, and it does feel more robust. I am wondering though whether we’ll know by that point whether we’ve found the right way to do metaphilosophy
I think there’s some (small) hope that by the time we need it, we can hit upon a solution to metaphilosophy that will just be clearly right to most (philosophically sophisticated) people, like how math and science were probably once methodologically quite confusing but now everyone mostly agrees on how math and science should be done. Failing that, we probably need some sort of global coordination to prevent competitive pressures leading to value lock-in (like the kind that would follow from Stuart’s scheme). In other words, if there wasn’t a race to build AGI, then there wouldn’t be a need to solve AGI safety, and there would be no need for schemes like Stuart’s that would lock in our values before we solve metaphilosophy.
it doesn’t feel obvious why something like Stuart’s anti-realism isn’t already close to there
Stuart’s scheme uses each human’s own meta-preferences to determine their own (final) object-level preferences. I would less concerned if this was used on someone like William MacAskill (with the caveat that correctly extracting William MacAskill’s meta-preferences seems equivalent to learning metaphilosophy from William) but a lot of humans have seemingly terrible meta-preferences or at least different meta-preferences which likely lead to different object-level preferences (so they can’t all be right, assuming moral realism).
To put it another way, my position is that if moral realism or relativism (positions 1-3 in this list) is right, we need “metaphilosophical paternalism” to prevent a “terrible outcome”, and that’s not part of Stuart’s scheme.
I would less concerned if this was used on someone like William MacAskill [...] but a lot of humans have seemingly terrible meta-preferences
In those cases, I’d give more weight to the preferences than the meta-preferences. There is the issue of avoiding ignorant-yet-confident meta-preferences, which I’m working on writing up right now (partially thanks to you very comment here, thanks!)
or at least different meta-preferences which likely lead to different object-level preferences (so they can’t all be right, assuming moral realism).
Moral realism is ill-defined, and some allow that humans and AI would have different types of morally true facts. So it’s not too much of a stretch to assume that different humans might have different morally true facts from each other, so I don’t see this as being necessarily a problem.
Moral realism through acausal trade is the only version of moral realism that seems to be coherent, and to do that, you still have to synthesise individual preferences first. So “one single universal true morality” does not necessarily contradict “contingent choices in figuring out your own preferences”.
There is the issue of avoiding ignorant-yet-confident meta-preferences, which I’m working on writing up right now (partially thanks to you very comment here, thanks!)
I look forward to reading that. In the meantime can you address my parenthetical point in the grand-parent comment: “correctly extracting William MacAskill’s meta-preferences seems equivalent to learning metaphilosophy from William”? If it’s not clear, what I mean is that suppose Will wants to figure out his values by doing philosophy (which I think he actually does), does that mean that under you scheme the AI needs to learn how to do philosophy? If so, how do you plan to get around the problems with applying ML to metaphilosophy that I described in Some Thoughts on Metaphilosophy?
There is one way of doing metaphilosophy this way, which is “run (simulated) William MacAskill until he thinks he’s found a good metaphilosophy” or “find a description of metaphilosophy to which WA would say ‘yes’.”
But what the system I’ve sketched would most likely do is come up with something to which WA would say “yes, I can kinda see why that was built, but it doesn’t really fit together as I’d like and has a some of ad hoc and object level features”. That’s the “adequate” part of the process.
I think there’s some (small) hope that by the time we need it, we can hit upon a solution to metaphilosophy that will just be clearly right to most (philosophically sophisticated) people, like how math and science were probably once methodologically quite confusing but now everyone mostly agrees on how math and science should be done. Failing that, we probably need some sort of global coordination to prevent competitive pressures leading to value lock-in (like the kind that would follow from Stuart’s scheme). In other words, if there wasn’t a race to build AGI, then there wouldn’t be a need to solve AGI safety, and there would be no need for schemes like Stuart’s that would lock in our values before we solve metaphilosophy.
Stuart’s scheme uses each human’s own meta-preferences to determine their own (final) object-level preferences. I would less concerned if this was used on someone like William MacAskill (with the caveat that correctly extracting William MacAskill’s meta-preferences seems equivalent to learning metaphilosophy from William) but a lot of humans have seemingly terrible meta-preferences or at least different meta-preferences which likely lead to different object-level preferences (so they can’t all be right, assuming moral realism).
To put it another way, my position is that if moral realism or relativism (positions 1-3 in this list) is right, we need “metaphilosophical paternalism” to prevent a “terrible outcome”, and that’s not part of Stuart’s scheme.
In those cases, I’d give more weight to the preferences than the meta-preferences. There is the issue of avoiding ignorant-yet-confident meta-preferences, which I’m working on writing up right now (partially thanks to you very comment here, thanks!)
Moral realism is ill-defined, and some allow that humans and AI would have different types of morally true facts. So it’s not too much of a stretch to assume that different humans might have different morally true facts from each other, so I don’t see this as being necessarily a problem.
Moral realism through acausal trade is the only version of moral realism that seems to be coherent, and to do that, you still have to synthesise individual preferences first. So “one single universal true morality” does not necessarily contradict “contingent choices in figuring out your own preferences”.
I look forward to reading that. In the meantime can you address my parenthetical point in the grand-parent comment: “correctly extracting William MacAskill’s meta-preferences seems equivalent to learning metaphilosophy from William”? If it’s not clear, what I mean is that suppose Will wants to figure out his values by doing philosophy (which I think he actually does), does that mean that under you scheme the AI needs to learn how to do philosophy? If so, how do you plan to get around the problems with applying ML to metaphilosophy that I described in Some Thoughts on Metaphilosophy?
There is one way of doing metaphilosophy this way, which is “run (simulated) William MacAskill until he thinks he’s found a good metaphilosophy” or “find a description of metaphilosophy to which WA would say ‘yes’.”
But what the system I’ve sketched would most likely do is come up with something to which WA would say “yes, I can kinda see why that was built, but it doesn’t really fit together as I’d like and has a some of ad hoc and object level features”. That’s the “adequate” part of the process.