Creating superintelligence generally leads to runaway optimization.
Under the anthropic principle, we should expect there to be a ‘consistent underlying reason’ for our continued survival.[2]
By default, I’d expect the ‘consistent underlying reason’ to be a prolonged alignment effort in absence of capabilities progress. However, this seems inconsistent with the observation of progressing from AI winters to a period of vast training runs and widespread technical interest in advancing capabilities.
That particular ‘consistent underlying reason’ is likely not the one which succeeds most often. The actual distribution would then have other, more common paths to survival.
The actual distribution could look something like this: [3]
Note that the yellow portion doesn’t imply no effort is made to ensure the first ASI’s training setup produces a safe system, i.e. that we ‘get lucky’ by being on an alignment-by-default path.
Disclaimer: This is presented as probabilistic evidence rather than as a ‘sure conclusion I believe’
Editing in a further disclaimer: This argument was a passing intuition I had. I don’t know if it’s correct. I’m not confident about anthropics. It is not one of the reasons which motivated me to investigate this class of solution.
Editing in a further disclaimer: I am absolutely not saying we should assume alignment is easy because we’ll die if it’s not. Given a commenter had this interpretation, it seems this was another case of my writing difficulty causing failed communication.
Rather than expecting to ‘get lucky many times in a row’, e.g via capabilities researchers continually overlooking a human-findable method for superintelligence
(The proportions over time here aren’t precise, nor are the categories included comprehensive, I put more effort into making this image easy to read/making it help convey the idea.)
Under the anthropic principle, we should expect there to be a ‘consistent underlying reason’ for our continued survival.
Why? It sounds like you’re anthropic updating on the fact that we’ll exist in the future, which of course wouldn’t make sense because we’re not yet sure of that. So what am I missing?
It sounds like you’re anthropic updating on the fact that we’ll exist in the future
The quote you replied to was meant to be about the past.[1]
(paragraph retracted due to unclarity)
Specifically, I think that (“we find a fully-general agent-alignment solution right as takeoff is very near” given “early AGIs take a form that was unexpected”) is less probable than (“observing early AGI’s causes us to form new insights that lead to a different class of solution” given “early AGIs take a form that was unexpected”). Because I think that, and because I think we’re at that point where takeoff is near, it seems like it’s some evidence for being on that second path.
This should only constitute an anthropic update to the extent you think more-agentic architectures would have already killed us
I do think that’s possible (I don’t have a good enough model to put a probability on it though). I suspect that superintelligence is possible to create with much less compute than is being used for SOTA LLMs. Here’s a thread with some general arguments for this.
Of course, you could claim that our understanding of the past is not perfect, and thus should still update
I think my understanding of why we’ve survived so far re:AI is very not perfect. For example, I don’t know what would have needed to happen for training setups which would have produced agentic superintelligence by now to be found first, or (framed inversely) how lucky we needed to be to survive this far.
~~~
I’m not sure if this reply will address the disagreement, or if it will still seem from your pov that I’m making some logical mistake. I’m not actually fully sure what the disagreement is. You’re welcome to try to help me understand if one remains.
I’m sorry if any part of this response is confusing, I’m still learning to write clearly.
Everything makes sense except your second paragraph. Conditional on us solving alignment, I agree it’s more likely that we live in an “easy-by-default” world, rather than a “hard-by-default” one in which we got lucky or played very well. But we shouldn’t condition on solving alignment, because we haven’t yet.
Thus, in our current situation, the only way anthropics pushes us towards “we should work more on non-agentic systems” is if you believe “world were we still exist are more likely to have easy alignment-through-non-agentic-AIs”. Which you do believe, and I don’t. Mostly because I think in almost no worlds we have been killed by misalignment at this point. Or put another way, the developments in non-agentic AI we’re facing are still one regime change away from the dynamics that could kill us (and information in the current regime doesn’t extrapolate much to the next one).
Conditional on us solving alignment, I agree it’s more likely that we live in an “easy-by-default” world, rather than a “hard-by-default” one in which we got lucky or played very well.
(edit: summary: I don’t agree with this quote because I think logical beliefs shouldn’t update upon observing continued survival because there is nothing else we can observe. It is not my position that we should assume alignment is easy because we’ll die if it’s not)
I think that language in discussions of anthropics is unintentionally prone to masking ambiguities or conflations, especially wrt logical vs indexical probability, so I want to be very careful writing about this. I think there may be some conceptual conflation happening here, but I’m not sure how to word it. I’ll see if it becomes clear indirectly.
One difference between our intuitions may be that I’m implicitly thinking within a manyworlds frame. Within that frame it’s actually certain that we’ll solve alignment in some branches.
So if we then ‘condition on solving alignment in the future’, my mind defaults to something like this: “this is not much of an update, it just means we’re in a future where the past was not a death outcome. Some of the pasts leading up to those futures had really difficult solutions, and some of them managed to find easier ones or get lucky. The probabilities of these non-death outcomes relative to each other have not changed as a result of this conditioning.” (I.e I disagree with the top quote)
The most probable reason I can see for this difference is if you’re thinking in terms of a single future, where you expect to die.[1] In this frame, if you observe yourself surviving, it may seem[2] you should update your logical belief that alignment is hard (because P(continued observation|alignment being hard) is low, if we imagine a single future, but certain if we imagine the space of indexically possible futures).
Whereas I read it as only indexical, and am generally thinking about this in terms of indexical probabilities.
I totally agree that we shouldn’t update our logical beliefs in this way. I.e., that with regard to beliefs about logical probabilities (such as ‘alignment is very hard for humans’), we “shouldn’t condition on solving alignment, because we haven’t yet.” I.e that we shouldn’t condition on the future not being mostly death outcomes when we haven’t averted them and have reason to think they are.
Maybe this helps clarify my position?
On another point:
the developments in non-agentic AI we’re facing are still one regime change away from the dynamics that could kill us
I agree with this, and I still found the current lack of goals over the world surprising and worth trying to get as a trait of superintelligent systems.
Though after reflecting on it more I (with low confidence) think this is wrong, and one’s logical probabilities shouldn’t change after surviving in a ‘one-world frame’ universe either.
For an intuition pump: consider the case where you’ve crafted a device which, when activated, leverages quantum randomness to kill you with probability n-1/n where n is some arbitrarily large number. Given you’ve crafted it correctly, you make no logical update in the manyworlds frame because survival is the only thing you will observe; you expect to observe the 1/n branch.
In the ‘single world’ frame, continued survival isn’t guaranteed, but it’s still the only thing you could possibly observe, so it intuitively feels like the same reasoning applies...?
This update is screened off by “you actually looking at the past and checking whether we got lucky many times or there is a consistent reason”. Of course, you could claim that our understanding of the past is not perfect, and thus should still update, only less so. Although to be honest, I think there’s a strong case for the past clearly showing that we just got lucky a few times.
It sounded like you were saying the consistent reason is “our architectures are non-agentic”. This should only constitute an anthropic update to the extent you think more-agentic architectures would have already killed us (instead of killing us in the next decade). I’m not of this opinion. And if I was, I’d need to take into account factors like “how much faster I’d have expected capabilities to advance”, etc.
(edit: see disclaimers[1])
Creating superintelligence generally leads to runaway optimization.
Under the anthropic principle, we should expect there to be a ‘consistent underlying reason’ for our continued survival.[2]
By default, I’d expect the ‘consistent underlying reason’ to be a prolonged alignment effort in absence of capabilities progress. However, this seems inconsistent with the observation of progressing from AI winters to a period of vast training runs and widespread technical interest in advancing capabilities.
That particular ‘consistent underlying reason’ is likely not the one which succeeds most often. The actual distribution would then have other, more common paths to survival.
The actual distribution could look something like this: [3]
Note that the yellow portion doesn’t imply no effort is made to ensure the first ASI’s training setup produces a safe system, i.e. that we ‘get lucky’ by being on an alignment-by-default path.
I’d expect it to instead be the case that the ‘luck’/causal determinant came earlier, i.e. initial capabilities breakthroughs being of a type which first produced non-agentic general intelligences instead of seed agents and inspired us to try to make sure the first superintelligence is non-agentic, too.
(This same argument can also be applied to other possible agendas that may not have been pursued if not for updates caused by early AGIs)
Disclaimer: This is presented as probabilistic evidence rather than as a ‘sure conclusion I believe’
Editing in a further disclaimer: This argument was a passing intuition I had. I don’t know if it’s correct. I’m not confident about anthropics. It is not one of the reasons which motivated me to investigate this class of solution.
Editing in a further disclaimer: I am absolutely not saying we should assume alignment is easy because we’ll die if it’s not. Given a commenter had this interpretation, it seems this was another case of my writing difficulty causing failed communication.
Rather than expecting to ‘get lucky many times in a row’, e.g via capabilities researchers continually overlooking a human-findable method for superintelligence
(The proportions over time here aren’t precise, nor are the categories included comprehensive, I put more effort into making this image easy to read/making it help convey the idea.)
Why? It sounds like you’re anthropic updating on the fact that we’ll exist in the future, which of course wouldn’t make sense because we’re not yet sure of that. So what am I missing?
The quote you replied to was meant to be about the past.[1]
(paragraph retracted due to unclarity)
Specifically, I think that (“we find a fully-general agent-alignment solution right as takeoff is very near” given “early AGIs take a form that was unexpected”) is less probable than (“observing early AGI’s causes us to form new insights that lead to a different class of solution” given “early AGIs take a form that was unexpected”). Because I think that, and because I think we’re at that point where takeoff is near, it seems like it’s some evidence for being on that second path.
I do think that’s possible (I don’t have a good enough model to put a probability on it though). I suspect that superintelligence is possible to create with much less compute than is being used for SOTA LLMs. Here’s a thread with some general arguments for this.
I think my understanding of why we’ve survived so far re:AI is very not perfect. For example, I don’t know what would have needed to happen for training setups which would have produced agentic superintelligence by now to be found first, or (framed inversely) how lucky we needed to be to survive this far.
~~~
I’m not sure if this reply will address the disagreement, or if it will still seem from your pov that I’m making some logical mistake. I’m not actually fully sure what the disagreement is. You’re welcome to try to help me understand if one remains.
I’m sorry if any part of this response is confusing, I’m still learning to write clearly.
I originally thought you were asking why it’s true of the past, but then I realized we very probably agreed (in principle) in that case.
Everything makes sense except your second paragraph. Conditional on us solving alignment, I agree it’s more likely that we live in an “easy-by-default” world, rather than a “hard-by-default” one in which we got lucky or played very well. But we shouldn’t condition on solving alignment, because we haven’t yet.
Thus, in our current situation, the only way anthropics pushes us towards “we should work more on non-agentic systems” is if you believe “world were we still exist are more likely to have easy alignment-through-non-agentic-AIs”. Which you do believe, and I don’t. Mostly because I think in almost no worlds we have been killed by misalignment at this point. Or put another way, the developments in non-agentic AI we’re facing are still one regime change away from the dynamics that could kill us (and information in the current regime doesn’t extrapolate much to the next one).
(edit: summary: I don’t agree with this quote because I think logical beliefs shouldn’t update upon observing continued survival because there is nothing else we can observe. It is not my position that we should assume alignment is easy because we’ll die if it’s not)
I think that language in discussions of anthropics is unintentionally prone to masking ambiguities or conflations, especially wrt logical vs indexical probability, so I want to be very careful writing about this. I think there may be some conceptual conflation happening here, but I’m not sure how to word it. I’ll see if it becomes clear indirectly.
One difference between our intuitions may be that I’m implicitly thinking within a manyworlds frame. Within that frame it’s actually certain that we’ll solve alignment in some branches.
So if we then ‘condition on solving alignment in the future’, my mind defaults to something like this: “this is not much of an update, it just means we’re in a future where the past was not a death outcome. Some of the pasts leading up to those futures had really difficult solutions, and some of them managed to find easier ones or get lucky. The probabilities of these non-death outcomes relative to each other have not changed as a result of this conditioning.” (I.e I disagree with the top quote)
The most probable reason I can see for this difference is if you’re thinking in terms of a single future, where you expect to die.[1] In this frame, if you observe yourself surviving, it may seem[2] you should update your logical belief that alignment is hard (because P(continued observation|alignment being hard) is low, if we imagine a single future, but certain if we imagine the space of indexically possible futures).
Whereas I read it as only indexical, and am generally thinking about this in terms of indexical probabilities.
I totally agree that we shouldn’t update our logical beliefs in this way. I.e., that with regard to beliefs about logical probabilities (such as ‘alignment is very hard for humans’), we “shouldn’t condition on solving alignment, because we haven’t yet.” I.e that we shouldn’t condition on the future not being mostly death outcomes when we haven’t averted them and have reason to think they are.
Maybe this helps clarify my position?
On another point:
I agree with this, and I still found the current lack of goals over the world surprising and worth trying to get as a trait of superintelligent systems.
(I’m not disagreeing with this being the most common outcome)
Though after reflecting on it more I (with low confidence) think this is wrong, and one’s logical probabilities shouldn’t change after surviving in a ‘one-world frame’ universe either.
For an intuition pump: consider the case where you’ve crafted a device which, when activated, leverages quantum randomness to kill you with probability n-1/n where n is some arbitrarily large number. Given you’ve crafted it correctly, you make no logical update in the manyworlds frame because survival is the only thing you will observe; you expect to observe the 1/n branch.
In the ‘single world’ frame, continued survival isn’t guaranteed, but it’s still the only thing you could possibly observe, so it intuitively feels like the same reasoning applies...?
Yes, but
This update is screened off by “you actually looking at the past and checking whether we got lucky many times or there is a consistent reason”. Of course, you could claim that our understanding of the past is not perfect, and thus should still update, only less so. Although to be honest, I think there’s a strong case for the past clearly showing that we just got lucky a few times.
It sounded like you were saying the consistent reason is “our architectures are non-agentic”. This should only constitute an anthropic update to the extent you think more-agentic architectures would have already killed us (instead of killing us in the next decade). I’m not of this opinion. And if I was, I’d need to take into account factors like “how much faster I’d have expected capabilities to advance”, etc.
(I think I misinterpreted your question and started drafting another response, will reply to relevant portions of this reply there)