What do people mean when they talk about a “long reflection”? The original usages suggest flesh-humans literally sitting around and figuring out moral philosophy for hundreds, thousands, or even millions of years, before deciding to do anything that risks value lock-in, but (at least) two things about this don’t make sense to me:
A world where we’ve reliably “solved” for x-risks well enough to survive thousands of years without also having meaningfully solved “moral philosophy” is probably physically realizable, but this seems like a pretty fine needle to thread from our current position. (I think if you have a plan for solving AI x-risk that looks like “get to ~human-level AI, pump the brakes real hard, and punt on solving ASI alignment” then maybe you disagree.)
I don’t think it takes today-humans a thousand years to come up with a version of indirect normativity (or CEV, or whatever) that actually just works correctly. I’d be somewhat surprised if it took a hundred, but maybe it’s actually very tricky. A thousand just seems crazy. A million makes it sound like you’re doing something very dumb, like figuring out every shard of each human’s values and don’t know how to automate things.
Long reflection is a concrete baseline for indirect normativity. It’s straightforwardly meaningful, even if it’s unlikely to be possible or a good idea to run in base reality. From there, you iterate to do better.
Path dependence of long reflection could be addressed by considering many possible long reflection traces jointly, aggregating their own judgement about each other to define which traces are more legitimate (as a fixpoint of some voting/preference setup), or how to influence the course of such traces to make them more legitimate. For example, a misaligned AI takeover within a long reflection trace makes it illegitimate, and preventing such is an intervention that improves a trace.
“Locking in” preferences seems like something that should be avoided as much as possible, but creating new people or influencing existing ones is probably morally irreversible, and that applies to what happens inside long reflection as well. I’m not sure that “nonperson” modeling of long reflection is possible, that sufficiently good prediction of long traces of thinking doesn’t require modeling people well enough to qualify as morally relevant to a similar extent as concrete people performing that thinking in base reality. But here too considering many possible traces somewhat helps, making all possibilities real (morally valent) according to how much attention is paid to their details, which should follow their collectively self-defined legitimacy. In this frame, the more legitimate possible traces of long reflection become the utopia itself, rather than a nonperson computation planning it. Nonperson predictions of reflection’s judgement might steer it a bit in advance of legitimacy or influence decisions, but possibly not much, lest they attain moral valence and start coloring the utopia through their content and not only consequences.
On your second point, I think that MacAskill and Ord were more saying “It would be worth it to spend thousands of years figuring out moral philosophy / figuring out what to do with the cosmos, if that’s how long it takes to be ~sure we’ve reached the ‘correct’ answer before locking things in, on account of the astronomical waste argument” than “I literally predict it will take today-humans thousands of years to figure out moral philosophy, even if we make a serious and coordinated effort to do so.” Somewhat relatedly, quoting from the ‘Long Reflection Reading List’ I wrote earlier this year (fn. 4):
Original discussion of the long reflection indicated that it could be a lengthy process of 10,000 years or more. More recent discussion I’m aware of, which is nonpublic, hence no corresponding reading, i) takes seriously the possibility that the long reflection could last just weeks rather than years or millenia, and ii) notes that wall clock time is probably not the most useful way to think about the length of reflection, given that the reflection process, if it happens at all, will likely involve manysuperfast AIs doing the bulk of the cognitive labor.
On your first point, I continue to be curious about your perspective. I basically agree with the following (written by Zach Stein-Perlman), but, based on what you said in your parentheses, it sounds like you view it as a bad plan?
The outline of the best [post-AGI] plan I’ve heard is build human-obsoleting AIs which are sufficiently aligned/trustworthy that we can safely defer[1] to them (before building wildly superintelligent AI). Assume it will take 5-10 years after AGI to build such systems and give them sufficient time. To buy time (or: avoid being rushed by other AI projects[2]), inform the US government and convince it to enforce nobody builds wildly superintelligent AI for a while (and likely limit AGI weights to allied projects with excellent security and control).
(I could be off, but it sounds like either you expect solving AI philosophical competence to come pretty much hand in hand with solving intent alignment (because you see them as similar technical problems?), or you expect not solving AI philosophical competence (while having solved intent alignment) to lead to catastrophe (thus putting us outside the worlds in which x-risks are reliably ‘solved’ for), perhaps in the way Wei Dai has talked about?)
We don’t need these human-obsoleting AIs to be able to implement CEV. We want to be able to defer to them on tricky wisdom-loaded questions like what should we do about the overall AI situation? They can ask us questions as needed.
To avoid being rushed by your own AI project, you also have to ensure that your AI can’t be stolen and can’t escape, so you have to implement excellent security and control.
“It would make sense to pay that cost if necessary” makes more sense than “we should expect to pay that cost”, thanks.
it sounds like you view it as a bad plan?
Basically, yes. I have a draft post outlining some of my objections to that sort of plan; hopefully it won’t sit in my drafts as long as the last similar post did.
(I could be off, but it sounds like either you expect solving AI philosophical competence to come pretty much hand in hand with solving intent alignment (because you see them as similar technical problems?), or you expect not solving AI philosophical competence (while having solved intent alignment) to lead to catastrophe (thus putting us outside the worlds in which x-risks are reliably ‘solved’ for), perhaps in the way Wei Dai has talked about?)
I expect whatever ends up taking over the lightcone to be philosophically competent. I haven’t thought very hard about the philosophical competence of whatever AI succeeds at takeover (conditional on that happening), or, separately, the philosophical competence of the stupidest possible AI that could succeed at takeover with non-trivial odds. I don’t think solving intent alignment necessarily requires that we have also figured out how to make AIs philosophically competent, or vice-versa; I also haven’t though about how likely we are to experience either disjunction.
I think solving intent alignment without having made much more philosophical progress is almost certainly an improvement to our odds, but is not anywhere near sufficient to feel comfortable, since you still end up stuck in a position where you want to delegate “solve philosophy” to the AI, but you can’t because you can’t check its work very well. And that means you’re stuck at whatever level of capabilities you have, and are still approximately a sitting duck waiting for someone else to do something dumb with their own AIs (like point them at recursive self-improvement).
I expect whatever ends up taking over the lightcone to be philosophically competent.
I agree that conditional on that happening, this is plausible, but also it’s likely that some of the answers from such a philosophically competent being to be unsatisfying to us.
One example is that such a philosophically competent AI might tell you that CEV either doesn’t exist, or if it does is so path-dependent that it cannot resolve moral disagreements, which is actually pretty plausible under my model of moral philosophy.
A world where we’ve reliably “solved” for x-risks well enough to survive thousands of years without also having meaningfully solved “moral philosophy” is probably physically realizable, but this seems like a pretty fine needle to thread from our current position. (I think if you have a plan for solving AI x-risk that looks like “get to ~human-level AI, pump the brakes real hard, and punt on solving ASI alignment” then maybe you disagree.)
I don’t think it takes today-humans a thousand years to come up with a version of indirect normativity (or CEV, or whatever) that actually just works correctly. I’d be somewhat surprised if it took a hundred, but maybe it’s actually very tricky. A thousand just seems crazy. A million makes it sound like you’re doing something very dumb, like figuring out every shard of each human’s values and don’t know how to automate things.
1 possible answer is that something like CEV does not exist, and yet alignment is still solvable anyways for almost arbitrarily capable AI, which could well happen, and for me personally this is honestly the most likely outcome of what happens by default.
There are arguments against the idea that CEV even exists or is well defined that are important to note, and we shouldn’t assume that technological progress equates with progress towards your preferred philosophy:
And there might not be any real justifiable way to resolve disagreements between the philosophies/moralities, either, if there isn’t a way to converge to a single morality.
What do people mean when they talk about a “long reflection”? The original usages suggest flesh-humans literally sitting around and figuring out moral philosophy for hundreds, thousands, or even millions of years, before deciding to do anything that risks value lock-in, but (at least) two things about this don’t make sense to me:
A world where we’ve reliably “solved” for x-risks well enough to survive thousands of years without also having meaningfully solved “moral philosophy” is probably physically realizable, but this seems like a pretty fine needle to thread from our current position. (I think if you have a plan for solving AI x-risk that looks like “get to ~human-level AI, pump the brakes real hard, and punt on solving ASI alignment” then maybe you disagree.)
I don’t think it takes today-humans a thousand years to come up with a version of indirect normativity (or CEV, or whatever) that actually just works correctly. I’d be somewhat surprised if it took a hundred, but maybe it’s actually very tricky. A thousand just seems crazy. A million makes it sound like you’re doing something very dumb, like figuring out every shard of each human’s values and don’t know how to automate things.
Long reflection is a concrete baseline for indirect normativity. It’s straightforwardly meaningful, even if it’s unlikely to be possible or a good idea to run in base reality. From there, you iterate to do better.
Path dependence of long reflection could be addressed by considering many possible long reflection traces jointly, aggregating their own judgement about each other to define which traces are more legitimate (as a fixpoint of some voting/preference setup), or how to influence the course of such traces to make them more legitimate. For example, a misaligned AI takeover within a long reflection trace makes it illegitimate, and preventing such is an intervention that improves a trace.
“Locking in” preferences seems like something that should be avoided as much as possible, but creating new people or influencing existing ones is probably morally irreversible, and that applies to what happens inside long reflection as well. I’m not sure that “nonperson” modeling of long reflection is possible, that sufficiently good prediction of long traces of thinking doesn’t require modeling people well enough to qualify as morally relevant to a similar extent as concrete people performing that thinking in base reality. But here too considering many possible traces somewhat helps, making all possibilities real (morally valent) according to how much attention is paid to their details, which should follow their collectively self-defined legitimacy. In this frame, the more legitimate possible traces of long reflection become the utopia itself, rather than a nonperson computation planning it. Nonperson predictions of reflection’s judgement might steer it a bit in advance of legitimacy or influence decisions, but possibly not much, lest they attain moral valence and start coloring the utopia through their content and not only consequences.
On your second point, I think that MacAskill and Ord were more saying “It would be worth it to spend thousands of years figuring out moral philosophy / figuring out what to do with the cosmos, if that’s how long it takes to be ~sure we’ve reached the ‘correct’ answer before locking things in, on account of the astronomical waste argument” than “I literally predict it will take today-humans thousands of years to figure out moral philosophy, even if we make a serious and coordinated effort to do so.” Somewhat relatedly, quoting from the ‘Long Reflection Reading List’ I wrote earlier this year (fn. 4):
On your first point, I continue to be curious about your perspective. I basically agree with the following (written by Zach Stein-Perlman), but, based on what you said in your parentheses, it sounds like you view it as a bad plan?
(I could be off, but it sounds like either you expect solving AI philosophical competence to come pretty much hand in hand with solving intent alignment (because you see them as similar technical problems?), or you expect not solving AI philosophical competence (while having solved intent alignment) to lead to catastrophe (thus putting us outside the worlds in which x-risks are reliably ‘solved’ for), perhaps in the way Wei Dai has talked about?)
We don’t need these human-obsoleting AIs to be able to implement CEV. We want to be able to defer to them on tricky wisdom-loaded questions like what should we do about the overall AI situation? They can ask us questions as needed.
To avoid being rushed by your own AI project, you also have to ensure that your AI can’t be stolen and can’t escape, so you have to implement excellent security and control.
“It would make sense to pay that cost if necessary” makes more sense than “we should expect to pay that cost”, thanks.
Basically, yes. I have a draft post outlining some of my objections to that sort of plan; hopefully it won’t sit in my drafts as long as the last similar post did.
I expect whatever ends up taking over the lightcone to be philosophically competent. I haven’t thought very hard about the philosophical competence of whatever AI succeeds at takeover (conditional on that happening), or, separately, the philosophical competence of the stupidest possible AI that could succeed at takeover with non-trivial odds. I don’t think solving intent alignment necessarily requires that we have also figured out how to make AIs philosophically competent, or vice-versa; I also haven’t though about how likely we are to experience either disjunction.
I think solving intent alignment without having made much more philosophical progress is almost certainly an improvement to our odds, but is not anywhere near sufficient to feel comfortable, since you still end up stuck in a position where you want to delegate “solve philosophy” to the AI, but you can’t because you can’t check its work very well. And that means you’re stuck at whatever level of capabilities you have, and are still approximately a sitting duck waiting for someone else to do something dumb with their own AIs (like point them at recursive self-improvement).
I agree that conditional on that happening, this is plausible, but also it’s likely that some of the answers from such a philosophically competent being to be unsatisfying to us.
One example is that such a philosophically competent AI might tell you that CEV either doesn’t exist, or if it does is so path-dependent that it cannot resolve moral disagreements, which is actually pretty plausible under my model of moral philosophy.
To answer these questions:
1 possible answer is that something like CEV does not exist, and yet alignment is still solvable anyways for almost arbitrarily capable AI, which could well happen, and for me personally this is honestly the most likely outcome of what happens by default.
There are arguments against the idea that CEV even exists or is well defined that are important to note, and we shouldn’t assume that technological progress equates with progress towards your preferred philosophy:
https://www.lesswrong.com/posts/Y7gtFMi6TwFq5uFHe/some-biases-and-selection-effects-in-ai-risk-discourse#hkoGD6Gwi9YKKZ6S2
https://www.lesswrong.com/posts/SqgRtCwueovvwxpDQ/valence-series-2-valence-and-normativity#2_7_3_Possible_implications_for_AI_alignment_discourse
https://joecarlsmith.com/2021/06/21/on-the-limits-of-idealized-values
And there might not be any real justifiable way to resolve disagreements between the philosophies/moralities, either, if there isn’t a way to converge to a single morality.