To test how much my proposed crux is in fact a crux, I’d like for folks to share their intuitions about how many people are metaphilosophically competent enough to safely 1,000,000,000,000,000x, along with their intuitions about the difficulty of AI alignment.
My current intuition is that there are under 100 people whom, if 1,000,000,000,000,000x’d, would end up avoiding irreversible catastrophes with > 50% probability. (I haven’t thought too much about this question, and wouldn’t be surprised if I update to thinking there are fewer than 10 such people, or even 0 such people.) I also think AI alignment is pretty hard, and necessitates solving difficult metaphilosophical problems.
Once humanity makes enough metaphilosophical progress (which might require first solving agent foundations), I might feel comfortable 1,000,000,000,000,000x’ing the most metaphilosophically competent person alive, though it’s possible I’ll decide I wouldn’t want to 1,000,000,000,000,000x anyone running on current biological hardware. I’d also feel good 1,000,000,000,000,000x’ing someone if we’re in the endgame and the default outcome is clearly self-annihilation.
My current intuition is that there are under 100 people whom, if 1,000,000,000,000,000x’d, would end up avoiding irreversible catastrophes with > 50% probability. (I haven’t thought too much about this question, and wouldn’t be surprised if I update to thinking there are fewer than 10 such people, or even 0 such people.)
I’ve asked this before but don’t feel like I got a solid answer: (a) do you think that giving the 100th person a lot of power is a lot worse than the status quo (w.r.t. catastrophic risk), and (b) why?
If you think it’s a lot worse, the explanations I can imagine are along the lines of: “the ideas that win in the marketplace of ideas are systematically good,” or maybe “if people are forced to reflect by thinking some, growing older, being replaced by their children, etc., that’s way better than having them reflect in the way that they’d choose to given unlimited power.”, or something like that.
But those seem inconsistent with your position in at least two ways:
If this is the case, then people don’t need metaphilosophical competence to be fine, they just need a healthy respect for business as usual and whatever magic it is that causes the status quo to arrive at good answers. Indeed, there seem to be many people (>> 100) who would effectively abdicate their power after being greatly empowered, or who would use it in a narrow way to avoid catastrophes but not to change the basic course of social deliberation.
The implicit claim about the magic of the status quo is itself a strong metaphilosphical claim, and I don’t see why you would have so much confidence in this position while thinking that we should have no confidence in other metaphilosphical conclusions.
If you think that the status quo is even worse, then I don’t quite understand what you mean by a statement like:
Once humanity makes enough metaphilosophical progress (which might require first solving agent foundations), I might feel comfortable 1,000,000,000,000,000x’ing the most metaphilosophically competent person alive, though it’s possible I’ll decide I wouldn’t want to 1,000,000,000,000,000x anyone running on current biological hardware. I’d also feel good 1,000,000,000,000,000x’ing someone if we’re in the endgame and the default outcome is clearly self-annihilation.
Other questions: why can we solve agent foundations, but the superintelligent person can’t? What are you imagining happening after you empower this person? Why are you able to foresee so many difficulties that they predictably won’t see?
Oh, I actually think that giving the 100th best person a bunch of power is probably better.than the status quo, assuming there are ~100 people who pass the bar (I also feel pessimistic about the status quo). The only reason why I think the status quo might be better is that more metaphilosophy would develop, and then whoever gets amplified would have more metaphilosophical competence to begin with, which seems safer.
I think the world will end up in a catastrophic epistemic pit. For example, if any religious leader got massively amplified, I think it’s pretty likely (>50%) the whole world will just stay religious forever.
Us making progress on metaphilosophy isn’t an improvement over the empowered person making progress on metaphilosophy, conditioning on the empowered person making enough progress on metaphilosophy. But in general I wouldn’t trust someone to make enough progress on metaphilosophy unless they had a strong enough metaphilosophical base to begin with.
(I assume you mean that the 1000th person is much worse than the status quo, because they will end up in a catastrophic epistemic pit. Let me know if that’s a misunderstanding.)
Is your view:
People can’t make metaphilosophical progress, but they can recognize and adopt it. The status quo is OK because there is a large diversity of people generating ideas (the best of which will be adopted).
People can’t recognize metaphilosphical progress when they see it, but better views will systematically win in memetic competition (or in biological/economic competition because their carriers are more competent).
“Metaphilosophy advances one funeral at a time,” the way that we get out of epistemic traps is by creating new humans who start out with less baggage.
Something completely different?
I still don’t understand how any of those views could imply that it is so hard for individuals to make progress if amplified. For each of those three views about why the status quo is good, I think that more than 10% of people would endorse that view and use their amplified power in a way consistent with it (e.g. by creating lots of people who can generate lots of ideas; by allowing competition amongst people who disagree, and accepting the winners’ views; by creating a supportive and safe environment for the next generation and then passing off power to that generation...) If you amplify people radically, I would strongly expect them to end up with better versions of these ideas, more often, than humanity at large.
My normal concern would be that people would drift too far too fast, so we’d end up with e.g. whatever beliefs were most memetically fit regardless of their accuracy. But again, I think that amplifying someone leaves us in a way better situation with respect to memetic competition unless they make an unforced error.
Even more directly: I think more than 1% of people would, if amplified, have the world continue on the same deliberative trajectory it’s on today. So it seems like the fraction of people you can safely amplify must be more than 1%. (And in general those people will leave us much better off than we are today, since lots of them will take safe, easy wins like “Avoid literally killing ourselves in nuclear war.”)
I can totally understand why you’d say “lots of people would mess up if amplified due to being hasty and uncareful.” But I still don’t see what could possibly make you think “99.99999% of people would mess up most of the time.” I’m pretty sure that I’m either misunderstanding your view, or it isn’t coherent.
It seems to me the difficulty is likely to be in assessing whether someone would have a good enough start, and being able to do this probably requires enough ability to assess metaphilosophical competence now such that we could pick such a person to make progress later.
(I’m not zhukeepa; i’m just bringing up my own thoughts.)
This isn’t quite the same as a improvement, but one thing that is more appealing about normal-world metaphilosophical progress than empowered-person metaphilosophical progress is that the former has a track record of working*, while the latter is untried and might not work.
I do not expect that any human brain would be safe if scaled up by that amount, because of lack of robustness to relative scale. My intuition is that alignment is very hard, but I don’t have an explicit reason right now.
I think the number of safe people depends sensitively on the details of the 1,000,000,000,000,000xing. For example: Were they given a five minute lecture on the dangers of value lock-in? On the universe’s control panel, is the “find out what would I think if I reflected more, and what the actual causes are of everyone else’s opinions” button more prominently in view than the “turn everything into my favorite thing” button? And so on.
My model is that giving them the five-minute lecture on the dangers of value lock-in won’t help much. (We’ve tried giving five-minute lectures on the dangers of building superintelligences...) And I think most people executing “find out what would I think if I reflected more, and what the actual causes are of everyone else’s opinions” would get stuck in an epistemic pit and not realize it.
I think everyone (including me) would go crazy from solitude in this scenario, so that puts the number at 0. If you guarantee psychological stability somehow, I think most adults (~90% perhaps) would be good at achieving their goals (which may be things like “authoritarian regime forever”). This is pretty dependent on the humans becoming more intelligent—if they just thought faster I wouldn’t be nearly as optimistic, though I’d still put the number above 0.
I think most humans achieving what they currently consider their goals would end up being catastrophic for humanity, even if they succeed. (For example I think an eternal authoritarian regime is pretty catastrophic.)
I agree that an eternal authoritarian regime is pretty catastrophic.
I don’t think that a human in this scenario would be pursuing what they currently consider their goals—I think they would think more, learn more, and eventually settle on a different set of goals. (Maybe initially they pursue their current goals but it changes over time.) But it’s an open question to me whether the final set of goals they settle upon is actually reasonably aligned towards “humanity’s goals”—it may be or it may not be. So it could be catastrophic to amplify a current human in this way, from the perspective of humanity. But, it would not be catastrophic to the human that you amplified. (I think you disagree with the last statement, maybe I’m wrong about that.)
I’d say that it wouldn’t appear catastrophic to the amplified human, but might be catastrophic for that human anyway (e.g. if their values-on-reflection actually look a lot like humanity’s values-on-reflection, but they fail to achieve their values-on-reflection).
Yeah, I think that’s where we disagree. I think that humans are likely to achieve their values-on-reflection, I just don’t know what a human’s “values-on-reflection” would actually be (eg. could be that they want an authoritarian regime with them in charge).
It’s also possible that we have different concepts of values-on-reflection. Eg. maybe you mean that I have found my values-on-reflection only if I’ve cleared out all epistemic pits somehow and then thought for a long time with the explicit goal of figuring out what I value, whereas I would use a looser criterion. (I’m not sure what exactly.)
Yeah, what you described indeed matches my notion of “values-on-reflection” pretty well. So for example, I think a religious person’s values-on-reflection should include valuing logical consistency and coherent logical arguments (because they do implicitly care about those in their everyday lives, even if they explicitly deny it). This means their values-on-reflection should include having true beliefs, and thus be atheistic. But I also wouldn’t generally trust religious people to update away from religion if they reflected a bunch.
To test how much my proposed crux is in fact a crux, I’d like for folks to share their intuitions about how many people are metaphilosophically competent enough to safely 1,000,000,000,000,000x, along with their intuitions about the difficulty of AI alignment.
My current intuition is that there are under 100 people whom, if 1,000,000,000,000,000x’d, would end up avoiding irreversible catastrophes with > 50% probability. (I haven’t thought too much about this question, and wouldn’t be surprised if I update to thinking there are fewer than 10 such people, or even 0 such people.) I also think AI alignment is pretty hard, and necessitates solving difficult metaphilosophical problems.
Once humanity makes enough metaphilosophical progress (which might require first solving agent foundations), I might feel comfortable 1,000,000,000,000,000x’ing the most metaphilosophically competent person alive, though it’s possible I’ll decide I wouldn’t want to 1,000,000,000,000,000x anyone running on current biological hardware. I’d also feel good 1,000,000,000,000,000x’ing someone if we’re in the endgame and the default outcome is clearly self-annihilation.
All of these intuitions are weakly held.
I’ve asked this before but don’t feel like I got a solid answer: (a) do you think that giving the 100th person a lot of power is a lot worse than the status quo (w.r.t. catastrophic risk), and (b) why?
If you think it’s a lot worse, the explanations I can imagine are along the lines of: “the ideas that win in the marketplace of ideas are systematically good,” or maybe “if people are forced to reflect by thinking some, growing older, being replaced by their children, etc., that’s way better than having them reflect in the way that they’d choose to given unlimited power.”, or something like that.
But those seem inconsistent with your position in at least two ways:
If this is the case, then people don’t need metaphilosophical competence to be fine, they just need a healthy respect for business as usual and whatever magic it is that causes the status quo to arrive at good answers. Indeed, there seem to be many people (>> 100) who would effectively abdicate their power after being greatly empowered, or who would use it in a narrow way to avoid catastrophes but not to change the basic course of social deliberation.
The implicit claim about the magic of the status quo is itself a strong metaphilosphical claim, and I don’t see why you would have so much confidence in this position while thinking that we should have no confidence in other metaphilosphical conclusions.
If you think that the status quo is even worse, then I don’t quite understand what you mean by a statement like:
Other questions: why can we solve agent foundations, but the superintelligent person can’t? What are you imagining happening after you empower this person? Why are you able to foresee so many difficulties that they predictably won’t see?
Oh, I actually think that giving the 100th best person a bunch of power is probably better.than the status quo, assuming there are ~100 people who pass the bar (I also feel pessimistic about the status quo). The only reason why I think the status quo might be better is that more metaphilosophy would develop, and then whoever gets amplified would have more metaphilosophical competence to begin with, which seems safer.
What about the 1000th person?
(Why is us making progress on metaphilosphy an improvement over the empowered person making progress on metaphilosphy?)
I think the world will end up in a catastrophic epistemic pit. For example, if any religious leader got massively amplified, I think it’s pretty likely (>50%) the whole world will just stay religious forever.
Us making progress on metaphilosophy isn’t an improvement over the empowered person making progress on metaphilosophy, conditioning on the empowered person making enough progress on metaphilosophy. But in general I wouldn’t trust someone to make enough progress on metaphilosophy unless they had a strong enough metaphilosophical base to begin with.
(I assume you mean that the 1000th person is much worse than the status quo, because they will end up in a catastrophic epistemic pit. Let me know if that’s a misunderstanding.)
Is your view:
People can’t make metaphilosophical progress, but they can recognize and adopt it. The status quo is OK because there is a large diversity of people generating ideas (the best of which will be adopted).
People can’t recognize metaphilosphical progress when they see it, but better views will systematically win in memetic competition (or in biological/economic competition because their carriers are more competent).
“Metaphilosophy advances one funeral at a time,” the way that we get out of epistemic traps is by creating new humans who start out with less baggage.
Something completely different?
I still don’t understand how any of those views could imply that it is so hard for individuals to make progress if amplified. For each of those three views about why the status quo is good, I think that more than 10% of people would endorse that view and use their amplified power in a way consistent with it (e.g. by creating lots of people who can generate lots of ideas; by allowing competition amongst people who disagree, and accepting the winners’ views; by creating a supportive and safe environment for the next generation and then passing off power to that generation...) If you amplify people radically, I would strongly expect them to end up with better versions of these ideas, more often, than humanity at large.
My normal concern would be that people would drift too far too fast, so we’d end up with e.g. whatever beliefs were most memetically fit regardless of their accuracy. But again, I think that amplifying someone leaves us in a way better situation with respect to memetic competition unless they make an unforced error.
Even more directly: I think more than 1% of people would, if amplified, have the world continue on the same deliberative trajectory it’s on today. So it seems like the fraction of people you can safely amplify must be more than 1%. (And in general those people will leave us much better off than we are today, since lots of them will take safe, easy wins like “Avoid literally killing ourselves in nuclear war.”)
I can totally understand why you’d say “lots of people would mess up if amplified due to being hasty and uncareful.” But I still don’t see what could possibly make you think “99.99999% of people would mess up most of the time.” I’m pretty sure that I’m either misunderstanding your view, or it isn’t coherent.
It seems to me the difficulty is likely to be in assessing whether someone would have a good enough start, and being able to do this probably requires enough ability to assess metaphilosophical competence now such that we could pick such a person to make progress later.
(I’m not zhukeepa; i’m just bringing up my own thoughts.)
This isn’t quite the same as a improvement, but one thing that is more appealing about normal-world metaphilosophical progress than empowered-person metaphilosophical progress is that the former has a track record of working*, while the latter is untried and might not work.
*Slowly and not without reversals.
I do not expect that any human brain would be safe if scaled up by that amount, because of lack of robustness to relative scale. My intuition is that alignment is very hard, but I don’t have an explicit reason right now.
I think the number of safe people depends sensitively on the details of the 1,000,000,000,000,000xing. For example: Were they given a five minute lecture on the dangers of value lock-in? On the universe’s control panel, is the “find out what would I think if I reflected more, and what the actual causes are of everyone else’s opinions” button more prominently in view than the “turn everything into my favorite thing” button? And so on.
My model is that giving them the five-minute lecture on the dangers of value lock-in won’t help much. (We’ve tried giving five-minute lectures on the dangers of building superintelligences...) And I think most people executing “find out what would I think if I reflected more, and what the actual causes are of everyone else’s opinions” would get stuck in an epistemic pit and not realize it.
I think everyone (including me) would go crazy from solitude in this scenario, so that puts the number at 0. If you guarantee psychological stability somehow, I think most adults (~90% perhaps) would be good at achieving their goals (which may be things like “authoritarian regime forever”). This is pretty dependent on the humans becoming more intelligent—if they just thought faster I wouldn’t be nearly as optimistic, though I’d still put the number above 0.
I think most humans achieving what they currently consider their goals would end up being catastrophic for humanity, even if they succeed. (For example I think an eternal authoritarian regime is pretty catastrophic.)
I agree that an eternal authoritarian regime is pretty catastrophic.
I don’t think that a human in this scenario would be pursuing what they currently consider their goals—I think they would think more, learn more, and eventually settle on a different set of goals. (Maybe initially they pursue their current goals but it changes over time.) But it’s an open question to me whether the final set of goals they settle upon is actually reasonably aligned towards “humanity’s goals”—it may be or it may not be. So it could be catastrophic to amplify a current human in this way, from the perspective of humanity. But, it would not be catastrophic to the human that you amplified. (I think you disagree with the last statement, maybe I’m wrong about that.)
I’d say that it wouldn’t appear catastrophic to the amplified human, but might be catastrophic for that human anyway (e.g. if their values-on-reflection actually look a lot like humanity’s values-on-reflection, but they fail to achieve their values-on-reflection).
Yeah, I think that’s where we disagree. I think that humans are likely to achieve their values-on-reflection, I just don’t know what a human’s “values-on-reflection” would actually be (eg. could be that they want an authoritarian regime with them in charge).
It’s also possible that we have different concepts of values-on-reflection. Eg. maybe you mean that I have found my values-on-reflection only if I’ve cleared out all epistemic pits somehow and then thought for a long time with the explicit goal of figuring out what I value, whereas I would use a looser criterion. (I’m not sure what exactly.)
Yeah, what you described indeed matches my notion of “values-on-reflection” pretty well. So for example, I think a religious person’s values-on-reflection should include valuing logical consistency and coherent logical arguments (because they do implicitly care about those in their everyday lives, even if they explicitly deny it). This means their values-on-reflection should include having true beliefs, and thus be atheistic. But I also wouldn’t generally trust religious people to update away from religion if they reflected a bunch.