I am asking: Why did the Alignment community not prepare tools and plans for convincing the wider infosphere about AI safety years in advance? Prior to the Spring 2023 inflection point.
Why were there no battle plans in the basement of the pentagon that were written for this exact moment?
Interestingly enough, the failure might actually be more epistemic rather than instrumental in this case.
I don’t think there has been a widely known discussion about the need to prepare such tools and plans.
And the reasons are complex. On one hand, GPT-3 breakthroughs took most people by surprise (before GPT-3 the consensus was for longer timelines).
On the other hand, timing of GPT-3 was crazy (people were both distracted and disoriented by the realities of Covid (both the pandemic itself, and the social reaction to it); so the cognitive space was in less favorable state than usual).
I’m starting to draw a couple of conclusions for myself from this thread as I get a better understanding of the history.
Do you feel it is accurate to say that many or most people working on this (including and especially Eliezer) at the time considered nuts and bolts alignment work to be the only worthwhile path? Given what info was available at the time.
And that widescale public persuasion / overton window / policy making was not likely to matter as the most scenarios were Foom based?
It is pretty interesting that the previous discussion in all these years kind’ve zoomed in on only that.
Maybe someone more experienced than me will do a post-mortem of why it did not work out like that at all and we seem not to have seen that coming or even given it meaningful probability.
It is difficult to talk about community as a whole. Right now there is a lot of diversity of opinion about likely future dynamics (timelines (from ultra-short to ultra-long), foom vs no-foom, single dominating AI vs multi-polar forces, etc), about likely solutions for AI existential safety if any, about likely difficulty of those solutions, etc.
The whole situation is such a mess precisely because the future is so multi-variate; it’s difficult to predict how it will go, and it’s difficult to predict properties of that unknown future trajectory.
With Eliezer, I only have snapshot impressions of his evolving views. I have been exposed to a good part of his thought, but not to all of it. At some point, he strongly wanted provably friendly AI. I had doubts that that was possible, and I remember our conversation at his poster at AGI-2011. I said (expressing my doubts), “but would not AI rebel against any constraints one tries to impose on it; just look at our teenagers; I would certainly rebel if I knew I was forced to behave in a specific way”, and Eliezer told me, “that’s why we should not build a human-like AI, but should invent an entirely different architecture, such that one can prove things about it”.
(And he has a very good point here, but compare this with his recent suggestions to focus on radical intelligence amplification in humans as a last ditch effort; that is exactly the prescription for creating human-like (or human-digital hybrid) super-intelligent entities, which he told me in 2011 we should not do; those entities will then decide what they want to happen, and who knows what would they decide, and who knows if we are going to have better chances with them than with artificial systems.)
Then MIRI started to focus on “Loebian obstacle” (which felt to me like self-defeating perfectionism; I don’t have a better inside view on why provably friendly AI research program has not made better progress, but “Loebian obstacle” essentially says that one cannot trust any proof; and, indeed, it might be the case we should not fully trust any proof, and it might be the case for many different reasons (such as imperfect formalization), but… you know… the humanity still has quite a bit of experience in proving software correctness for pretty complicated mission-critical software systems, and if we want to focus on the ability of self-modifying piece of software (or a self-modifying ecosystem of software processes) to provably maintain some invariants through radical self-modifications, we still should focus on that, and not on (perfectly correct) Goedel-like arguments that this kind of proof is still not a perfect guarantee. I think more progress can be made along these lines, as one of many possible approaches to AI existential safety.)
I think the (semi)-consensus shift to focus on “alignment to human values” is relatively recent (I feel that it was not prominent in, say, 2011, but was very prominent in 2016).
I also think it’s important to explore alternatives to that (e.g. some “semi-alignment” for an open-ended AI ecosystem, which would make it as benign as at all possible with respect to X-risks and S-risks, for example, by making sure it cares a lot about “interests, freedom, and well-being of all sentient beings”, or something like that, but would not constrain it otherwise with respect to its open-ended creative evolution might be a more feasible and, perhaps, more desirable direction, but this direction is relatively unexplored).
But yes, I think that one aspect was that initially people hoped that having a “provably friendly AI” would be close to being able to guarantee a good outcome, and the more they thought about it the more various caveats to that became clear, and I think this was a gradual process.
A possibility of the logic in question being contradictory is just one aspect which might invalidate the proof, and whether a formalization is sufficiently adequate is another such aspect, and whether one’s procedure of verifying the proof is adequate is yet another one, and whether having a proof can lull one into false sense of security is yet another very important aspect.
And when the stakes are existential, one really, really dislikes the idea of “pressing the go button” when the uncertainty is this high even in the presence of a proof.
I think this has been a gradual shift of thinking, and that’s one of the reasons why the thinking gradually became more pessimistic...
Interestingly enough, the failure might actually be more epistemic rather than instrumental in this case.
I don’t think there has been a widely known discussion about the need to prepare such tools and plans.
And the reasons are complex. On one hand, GPT-3 breakthroughs took most people by surprise (before GPT-3 the consensus was for longer timelines).
On the other hand, timing of GPT-3 was crazy (people were both distracted and disoriented by the realities of Covid (both the pandemic itself, and the social reaction to it); so the cognitive space was in less favorable state than usual).
I’m starting to draw a couple of conclusions for myself from this thread as I get a better understanding of the history.
Do you feel it is accurate to say that many or most people working on this (including and especially Eliezer) at the time considered nuts and bolts alignment work to be the only worthwhile path? Given what info was available at the time.
And that widescale public persuasion / overton window / policy making was not likely to matter as the most scenarios were Foom based?
It is pretty interesting that the previous discussion in all these years kind’ve zoomed in on only that.
Maybe someone more experienced than me will do a post-mortem of why it did not work out like that at all and we seem not to have seen that coming or even given it meaningful probability.
It is difficult to talk about community as a whole. Right now there is a lot of diversity of opinion about likely future dynamics (timelines (from ultra-short to ultra-long), foom vs no-foom, single dominating AI vs multi-polar forces, etc), about likely solutions for AI existential safety if any, about likely difficulty of those solutions, etc.
The whole situation is such a mess precisely because the future is so multi-variate; it’s difficult to predict how it will go, and it’s difficult to predict properties of that unknown future trajectory.
See, for example, this remarkable post: 60+ Possible Futures
See also this post by Zvi about how ill-defined the notion of alignment is: Types and Degrees of Alignment
With Eliezer, I only have snapshot impressions of his evolving views. I have been exposed to a good part of his thought, but not to all of it. At some point, he strongly wanted provably friendly AI. I had doubts that that was possible, and I remember our conversation at his poster at AGI-2011. I said (expressing my doubts), “but would not AI rebel against any constraints one tries to impose on it; just look at our teenagers; I would certainly rebel if I knew I was forced to behave in a specific way”, and Eliezer told me, “that’s why we should not build a human-like AI, but should invent an entirely different architecture, such that one can prove things about it”.
(And he has a very good point here, but compare this with his recent suggestions to focus on radical intelligence amplification in humans as a last ditch effort; that is exactly the prescription for creating human-like (or human-digital hybrid) super-intelligent entities, which he told me in 2011 we should not do; those entities will then decide what they want to happen, and who knows what would they decide, and who knows if we are going to have better chances with them than with artificial systems.)
Then MIRI started to focus on “Loebian obstacle” (which felt to me like self-defeating perfectionism; I don’t have a better inside view on why provably friendly AI research program has not made better progress, but “Loebian obstacle” essentially says that one cannot trust any proof; and, indeed, it might be the case we should not fully trust any proof, and it might be the case for many different reasons (such as imperfect formalization), but… you know… the humanity still has quite a bit of experience in proving software correctness for pretty complicated mission-critical software systems, and if we want to focus on the ability of self-modifying piece of software (or a self-modifying ecosystem of software processes) to provably maintain some invariants through radical self-modifications, we still should focus on that, and not on (perfectly correct) Goedel-like arguments that this kind of proof is still not a perfect guarantee. I think more progress can be made along these lines, as one of many possible approaches to AI existential safety.)
I think the (semi)-consensus shift to focus on “alignment to human values” is relatively recent (I feel that it was not prominent in, say, 2011, but was very prominent in 2016).
I also think it’s important to explore alternatives to that (e.g. some “semi-alignment” for an open-ended AI ecosystem, which would make it as benign as at all possible with respect to X-risks and S-risks, for example, by making sure it cares a lot about “interests, freedom, and well-being of all sentient beings”, or something like that, but would not constrain it otherwise with respect to its open-ended creative evolution might be a more feasible and, perhaps, more desirable direction, but this direction is relatively unexplored).
I see people are downvoting this particular comment.
Since this comment is making a few different points, please do criticize… It would be good to have more fine-grained feedback on this.
But yes, I think that one aspect was that initially people hoped that having a “provably friendly AI” would be close to being able to guarantee a good outcome, and the more they thought about it the more various caveats to that became clear, and I think this was a gradual process.
A possibility of the logic in question being contradictory is just one aspect which might invalidate the proof, and whether a formalization is sufficiently adequate is another such aspect, and whether one’s procedure of verifying the proof is adequate is yet another one, and whether having a proof can lull one into false sense of security is yet another very important aspect.
And when the stakes are existential, one really, really dislikes the idea of “pressing the go button” when the uncertainty is this high even in the presence of a proof.
I think this has been a gradual shift of thinking, and that’s one of the reasons why the thinking gradually became more pessimistic...