This post is making a valid point (the time to intervene to prevent an outcome that would otherwise occur, is going to be before the outcome actually occurs), but I’m annoyed with the mind projection fallacy by which this post seems to treat “point of no return” as a feature of the territory, rather than your planning algorithm’s map.
(And, incidentally, I wish this dumb robot cult still had a culture that cared about appreciating cognitive algorithms as the common interest of many causes, such that people would find it more natural to write a post about “point of no return”-reasoning as a general rationality topic that could have all sorts of potential applications, rather than the topic specifically being about the special case of the coming robot apocalypse. But it’s probably not fair to blame Kokotajlo for this.)
The concept of a “point of no return” only makes sense relative to a class of interventions. A 1 kg ball is falling at 9.8 m/s². When is the “point of no return” at which the ball has accelerated enough such that it’s no longer possible to stop it from hitting the ground?
The problem is underspecified as stated. If we add the additional information that your means of intervening is a net that can only trap objects falling with less than X kg⋅m/s² of force, then we can say that the point of no return happens at X/9.8 seconds. But it would be weird to talk about “the second we ball risk reducers lose the ability to significantly reduce the risk of the ball hitting the ground” as if that were an independent pre-existing fact that we could use to determine how strong of a net we need to buy, because it depends on the net strength.
Thanks! I think I agree with everything you say here except that I’m not annoyed. (Had I been annoyed by my own writing, I would have rewritten it...) Perhaps I’m not annoyed because while my post may have given the misleading impression that PONR was an objective fact about the world rather than a fact about the map of some agent or group of agents, I didn’t fall for that fallacy myself.
To be fair to my original post though, I did make it clear that the PONR is relative to a “we,” a group of people (or even a single person) with some amount of current influence over the future that could diminish to drastically less influence depending on how events go.
while my post may have given the misleading impression [...] I didn’t fall for that fallacy myself.
I reach for this “bad writing” excuse sometimes, and sometimes it’s plausible, but in general, I’m wary of the impulse to tell critics after the fact, “I agree, but I wasn’t making that mistake,” because I usually expect that if I had a deep (rather than halting, fragmentary, or inconsistent) understanding of the thing that the critic was pointing at, I would have anticipated the criticism in advance and produced different text that didn’t provide the critic with the opportunity, such that I could point to a particular sentence and tell the would-be critic, “Didn’t I already adequately address this here?”
Instead, it’s the point of no return—the day we AI risk reducers lose the ability to significantly reduce AI risk.
address this by explaining PONR as our ability to do something?
(I mean I agree that finding oneself reaching for a bad writing excuse is a good clue that there’s something you can clarify for yourself further; just, this post doesn’t seem like a case of that.)
(Thanks for this—it’s important that critiques get counter-critiqued, and I think that process is stronger when third parties are involved, rather than it just being author vs. critic.)
The reason that doesn’t satisfy me is because I expect the actual calculus of “influence” and “control” in real-world settings to be sufficiently complicated that there’s probably not going to be any usefully identifiable “point of no return”. On the contrary, if there were an identifiable PONR as a natural abstraction, I think that would be a surprising empirical fact about the world in demand of deeper explanation—that the underlying calculus of influence would just happen to simplify that way, such that you could point to an event and say, “There—that’s when it all went wrong”, rather than there just being (say) a continuum of increasingly detailed possible causal graphs that you can compute counterfactuals with respect to (with more detailed graphs being more expensive to learn but granting more advanced planning capabilities).
The post does have a paragraph mentioning continuous loss of influence and already-lost influence in the past (“Of course, influence over the future might not disappear all on one day …”), but the reason this doesn’t satisfy me as a critic is because it seems to be treated as an afterthought (“We should keep these possibilities in mind as well”), rather than being the underlying reality to which any putative “PONR” would be a mere approximation. Instead, the rhetorical emphasis is on PONR as if it were an event: “The Date of AI Takeover Is Not the Day the AI Takes Over”. (And elsewhere, Daniel writes about “PONR-inducing tasks”.)
But in my philosophy, “the date” and “the day” of the title are two very different kinds of entities that are hard to talk about in the same sentence. The day AI takes over actually is a physical event that happens on some specific, definite date: nanobots disassemble the Earth, or whatever. That’s not subjective; the AI historian-subprocess of the future will record a definitive timestamp of when it happened. In contrast, “the date” of PONR is massively “subjective” depending on further assumptions; the AI historian-subprocesses of the future will record some sort of summary of the decision-relevant results of a billion billion ancestor simulations, but the answer is not going to fit in a 64-bit timestamp.
I’ve read this twice and I’m still not sure whether I actually get your critique. My guess is you’re saying something like:
Daniel is too much taking the PONR as a thing; this leads him to both accidentally treat PONR as a specific point in time, and also to [?? mistake planning capability for “objective” feasibility ??]
I agree that the OP’s talking of PONR as a point in time doesn’t make sense; a charitable read is that it’s a toy model that’s supposed to help make it more clear what the difference is between our ability to prevent X and X actually happening (like in the movie Armageddon; did we nuke the asteroid soon enough for it to miss Earth vs. has the asteroid actually impacted Earth). I agree that asking about “our planning capability” is vague and gives different answers depending on what counterfactuals you’re using; in an extreme case of “what could we feasibly do”, there’s basically no PONR because we always “could” just sit down at a computer and type in a highly speed-prior-compressed source code of an FAI.
the AI historian-subprocesses of the future will record some sort of summary of the decision-relevant results of a billion billion ancestor simulations, but the answer is not going to fit in a 64-bit timestamp.
It won’t be a timestamp, but it will contain information about humans’s ability to plan. To extract useful lessons from its experience with coming into power surrounded by potentially hostile weak AGIs, a superintelligence has to compare its own developing models across time. It went from not understanding its situation and not knowing what to do to take control from the humans, to yes understanding and knowing, and along the way it was relevantly uncertain about what the humans were able to do.
Anyway, the above feels like it’s sort of skew to the thrust of the OP, which I think is: “notice that your feasible influence will decrease well before the AGI actually kills you with nanobots, so planning under a contrary assumption will produce nonsensical plans”. Maybe I’m just saying, yes it’s subjective how much we’re doomed at a given point, and yes we want our reasoning to be in a sense grounded in stuff actually happening, but also in order to usefully model in more detail what’s happening and what plans will work, we have to talk about stuff that’s intermediate in time and in abstraction between the nanobot end of the word, and the here-and-now. The intermediate stuff then says more specific stuff about when and how much influence you’re losing or gaining.
I don’t think we disagree about anything substantive, and I don’t expect Daniel to disagree about anything substantive after reading this. It’s just—
I agree that the OP’s talking of PONR as a point in time doesn’t make sense; a charitable read is that [...]
I don’t think we should be doing charitable readings at yearly review time! If an author uses a toy model to clarify something, we want the post to say “As a clarifying toy model [...]” rather than making the readers figure it out.
If you’re pessimistic about alignment—and especially if you have short timelines like Daniel—I think most of your point-of-no-return-ness should already be in the past.
I unfortunately was not clear about this, but I meant to define it in such a way that this is false by definition—“loss of influence” is defined relative to the amount of influence we currently have. So even if we had a lot more influence 5 years ago, the PONR is when what little influence we have left mostly dries up. :)
I don’t think we should be doing charitable readings at yearly review time! If an author uses a toy model to clarify something, we want the post to say “As a clarifying toy model [...]” rather than making the readers figure it out.
If by some chance this post does make it to further stages of the review, I will heavily edit it, and I’m happy to e.g. add in “As a clarifying toy model...” among other changes.
Perhaps I should clarify then that I don’t actually think my writing was bad. I don’t think it was perfect, but I don’t think the post would have been significantly improved by me having a paragraph or two about how influence (and thus point-of-no-return) is a property of the map, not the territory. I think most readers, like me, knew that already. At any rate it seems not super relevant to the point I was trying to make.
This post is making a valid point (the time to intervene to prevent an outcome that would otherwise occur, is going to be before the outcome actually occurs), but I’m annoyed with the mind projection fallacy by which this post seems to treat “point of no return” as a feature of the territory, rather than your planning algorithm’s map.
(And, incidentally, I wish this dumb robot cult still had a culture that cared about appreciating cognitive algorithms as the common interest of many causes, such that people would find it more natural to write a post about “point of no return”-reasoning as a general rationality topic that could have all sorts of potential applications, rather than the topic specifically being about the special case of the coming robot apocalypse. But it’s probably not fair to blame Kokotajlo for this.)
The concept of a “point of no return” only makes sense relative to a class of interventions. A 1 kg ball is falling at 9.8 m/s². When is the “point of no return” at which the ball has accelerated enough such that it’s no longer possible to stop it from hitting the ground?
The problem is underspecified as stated. If we add the additional information that your means of intervening is a net that can only trap objects falling with less than X kg⋅m/s² of force, then we can say that the point of no return happens at X/9.8 seconds. But it would be weird to talk about “the second we ball risk reducers lose the ability to significantly reduce the risk of the ball hitting the ground” as if that were an independent pre-existing fact that we could use to determine how strong of a net we need to buy, because it depends on the net strength.
Thanks! I think I agree with everything you say here except that I’m not annoyed. (Had I been annoyed by my own writing, I would have rewritten it...) Perhaps I’m not annoyed because while my post may have given the misleading impression that PONR was an objective fact about the world rather than a fact about the map of some agent or group of agents, I didn’t fall for that fallacy myself.
To be fair to my original post though, I did make it clear that the PONR is relative to a “we,” a group of people (or even a single person) with some amount of current influence over the future that could diminish to drastically less influence depending on how events go.
I reach for this “bad writing” excuse sometimes, and sometimes it’s plausible, but in general, I’m wary of the impulse to tell critics after the fact, “I agree, but I wasn’t making that mistake,” because I usually expect that if I had a deep (rather than halting, fragmentary, or inconsistent) understanding of the thing that the critic was pointing at, I would have anticipated the criticism in advance and produced different text that didn’t provide the critic with the opportunity, such that I could point to a particular sentence and tell the would-be critic, “Didn’t I already adequately address this here?”
Doesn’t the first sentence
address this by explaining PONR as our ability to do something?
(I mean I agree that finding oneself reaching for a bad writing excuse is a good clue that there’s something you can clarify for yourself further; just, this post doesn’t seem like a case of that.)
(Thanks for this—it’s important that critiques get counter-critiqued, and I think that process is stronger when third parties are involved, rather than it just being author vs. critic.)
The reason that doesn’t satisfy me is because I expect the actual calculus of “influence” and “control” in real-world settings to be sufficiently complicated that there’s probably not going to be any usefully identifiable “point of no return”. On the contrary, if there were an identifiable PONR as a natural abstraction, I think that would be a surprising empirical fact about the world in demand of deeper explanation—that the underlying calculus of influence would just happen to simplify that way, such that you could point to an event and say, “There—that’s when it all went wrong”, rather than there just being (say) a continuum of increasingly detailed possible causal graphs that you can compute counterfactuals with respect to (with more detailed graphs being more expensive to learn but granting more advanced planning capabilities).
If you’re pessimistic about alignment—and especially if you have short timelines like Daniel—I think most of your point-of-no-return-ness should already be in the past. When, specifically? I don’t see any reason to expect there to be a simple answer. You lost some measure when OpenAI launched; you lost some measure when Norbert Weiner didn’t drop everything to work on the alignment problem in 1960; you lost some measure when Samuel Butler and Charles Babbage turned out to not be the same person in our timeline; you lost some measure when the ancient Greeks didn’t discover natural selection …
The post does have a paragraph mentioning continuous loss of influence and already-lost influence in the past (“Of course, influence over the future might not disappear all on one day …”), but the reason this doesn’t satisfy me as a critic is because it seems to be treated as an afterthought (“We should keep these possibilities in mind as well”), rather than being the underlying reality to which any putative “PONR” would be a mere approximation. Instead, the rhetorical emphasis is on PONR as if it were an event: “The Date of AI Takeover Is Not the Day the AI Takes Over”. (And elsewhere, Daniel writes about “PONR-inducing tasks”.)
But in my philosophy, “the date” and “the day” of the title are two very different kinds of entities that are hard to talk about in the same sentence. The day AI takes over actually is a physical event that happens on some specific, definite date: nanobots disassemble the Earth, or whatever. That’s not subjective; the AI historian-subprocess of the future will record a definitive timestamp of when it happened. In contrast, “the date” of PONR is massively “subjective” depending on further assumptions; the AI historian-subprocesses of the future will record some sort of summary of the decision-relevant results of a billion billion ancestor simulations, but the answer is not going to fit in a 64-bit timestamp.
Maybe to Daniel, this just looks like weirdly unmotivated nitpicking (“not super relevant to the point [he] was trying to make”)? But it feels like a substantive worldview difference to me.
I’ve read this twice and I’m still not sure whether I actually get your critique. My guess is you’re saying something like:
I agree that the OP’s talking of PONR as a point in time doesn’t make sense; a charitable read is that it’s a toy model that’s supposed to help make it more clear what the difference is between our ability to prevent X and X actually happening (like in the movie Armageddon; did we nuke the asteroid soon enough for it to miss Earth vs. has the asteroid actually impacted Earth). I agree that asking about “our planning capability” is vague and gives different answers depending on what counterfactuals you’re using; in an extreme case of “what could we feasibly do”, there’s basically no PONR because we always “could” just sit down at a computer and type in a highly speed-prior-compressed source code of an FAI.
It won’t be a timestamp, but it will contain information about humans’s ability to plan. To extract useful lessons from its experience with coming into power surrounded by potentially hostile weak AGIs, a superintelligence has to compare its own developing models across time. It went from not understanding its situation and not knowing what to do to take control from the humans, to yes understanding and knowing, and along the way it was relevantly uncertain about what the humans were able to do.
Anyway, the above feels like it’s sort of skew to the thrust of the OP, which I think is: “notice that your feasible influence will decrease well before the AGI actually kills you with nanobots, so planning under a contrary assumption will produce nonsensical plans”. Maybe I’m just saying, yes it’s subjective how much we’re doomed at a given point, and yes we want our reasoning to be in a sense grounded in stuff actually happening, but also in order to usefully model in more detail what’s happening and what plans will work, we have to talk about stuff that’s intermediate in time and in abstraction between the nanobot end of the word, and the here-and-now. The intermediate stuff then says more specific stuff about when and how much influence you’re losing or gaining.
I don’t think we disagree about anything substantive, and I don’t expect Daniel to disagree about anything substantive after reading this. It’s just—
I don’t think we should be doing charitable readings at yearly review time! If an author uses a toy model to clarify something, we want the post to say “As a clarifying toy model [...]” rather than making the readers figure it out.
I unfortunately was not clear about this, but I meant to define it in such a way that this is false by definition—“loss of influence” is defined relative to the amount of influence we currently have. So even if we had a lot more influence 5 years ago, the PONR is when what little influence we have left mostly dries up. :)
If by some chance this post does make it to further stages of the review, I will heavily edit it, and I’m happy to e.g. add in “As a clarifying toy model...” among other changes.
Perhaps I should clarify then that I don’t actually think my writing was bad. I don’t think it was perfect, but I don’t think the post would have been significantly improved by me having a paragraph or two about how influence (and thus point-of-no-return) is a property of the map, not the territory. I think most readers, like me, knew that already. At any rate it seems not super relevant to the point I was trying to make.