I’m not asking [...] that they make such a commitment highly public or legally binding
That seems like a self-defeating concession. If, as per your previous post, we want to exploit the “everybody likes a winner” pattern, we want there to be someone who is doing the liking. We want an audience that judges what we are and aren’t allowed to do, and when other actors have and don’t have to listen to us; and this audience has to be someone whose whims influence said actors’ incentives.
The audience doesn’t need to be the general public, sure. It might be other ML researchers, the senior management of other AI Labs, or even just the employees of a particular AI Lab. But it needs to be someone.
A second point: there needs to be a concrete “winner”. These initiatives we’re pushing have to visibly come from some specific group, that would accrue the reputation of a winner that can ram projects through. It can’t come from a vague homogeneous mass of “alignment researchers”. The hypothetical headlines[1] have to read “%groupname% convinces %list-of-AI-Labs% to implement %policy%”, not “%list-of-AI-Labs% agree to implement %policy%”. Else it won’t work.
Such a commitment risks giving us a false sense of having addressed deceptive alignment
That’s my biggest concern with your object-level idea here, yep. I think this concern will transfer to any object-level idea for our first “clear win”, though: the implementation of any first clear win will necessarily not be up to our standards. Which is something I think we should just accept, and view that win as what it is: a stepping stone, of not particular value in itself.
Or, in other words: approximately all of the first win’s value would be reputational. And we need to make very sure that we catch all this value.
I’d like us to develop a proper strategy/plan here, actually: a roadmap of the policies we want to implement, each next one more ambitious than the previous ones and only implementable due to that previous chain of victories, with some properly ambitious end-point like “all leading AI Labs take AI Risk up to our standards of seriousness”.
That roadmap won’t be useful on the object-level, obviously, but it should help develop intuitions for how “acquiring a winner’s reputation” actually looks like, and how it doesn’t.
In particular, each win should palpably expand our influence in terms of what we can achieve. Merely getting a policy implemented doesn’t count: we have to visibly prevail.
I think one could view the abstract idea of “coordination as a strategy for AI risk” as itself a winner, and the potential participants of future coordination as the audience—the more people believe that coordination can actually work, the more likely coordination is to happen. I’m not sure how much this should be taken into account.
That seems like a self-defeating concession. If, as per your previous post, we want to exploit the “everybody likes a winner” pattern, we want there to be someone who is doing the liking. We want an audience that judges what we are and aren’t allowed to do, and when other actors have and don’t have to listen to us; and this audience has to be someone whose whims influence said actors’ incentives.
The audience doesn’t need to be the general public, sure. It might be other ML researchers, the senior management of other AI Labs, or even just the employees of a particular AI Lab. But it needs to be someone.
A second point: there needs to be a concrete “winner”. These initiatives we’re pushing have to visibly come from some specific group, that would accrue the reputation of a winner that can ram projects through. It can’t come from a vague homogeneous mass of “alignment researchers”. The hypothetical headlines[1] have to read “%groupname% convinces %list-of-AI-Labs% to implement %policy%”, not “%list-of-AI-Labs% agree to implement %policy%”. Else it won’t work.
That’s my biggest concern with your object-level idea here, yep. I think this concern will transfer to any object-level idea for our first “clear win”, though: the implementation of any first clear win will necessarily not be up to our standards. Which is something I think we should just accept, and view that win as what it is: a stepping stone, of not particular value in itself.
Or, in other words: approximately all of the first win’s value would be reputational. And we need to make very sure that we catch all this value.
I’d like us to develop a proper strategy/plan here, actually: a roadmap of the policies we want to implement, each next one more ambitious than the previous ones and only implementable due to that previous chain of victories, with some properly ambitious end-point like “all leading AI Labs take AI Risk up to our standards of seriousness”.
That roadmap won’t be useful on the object-level, obviously, but it should help develop intuitions for how “acquiring a winner’s reputation” actually looks like, and how it doesn’t.
In particular, each win should palpably expand our influence in terms of what we can achieve. Merely getting a policy implemented doesn’t count: we have to visibly prevail.
Not that we necessarily want actual headlines, see the point about the general public not necessarily being the judging audience.
I think one could view the abstract idea of “coordination as a strategy for AI risk” as itself a winner, and the potential participants of future coordination as the audience—the more people believe that coordination can actually work, the more likely coordination is to happen. I’m not sure how much this should be taken into account.