All very good questions. Let me try to answer them in order:
I plan to do 1/2/3 in parallel. I will start small (maybe city of Berkeley, where I live) and increase scope as the technology is proven to work.
1 - human values are what people think they are. To the extent that different humans have incompatible values, the AI alignment problem is strongly unsolvable.
2.A—AI alignment certification would be sort of like a drivers license for AI’s. The AI needs to have a valid license in order to be hooked up to the internet and switched on. Every time the AI does something dishonorable, its license is revoked until mechanistic interpretability researchers can debug the issue.
2.B—I don’t think the organization would be trusted to certify AI’s right away. I think the organization would have to build up a lot of goodwill from doing good works, turn that into capital somehow, and then hire a world-class mechanistic interpretability team from elsewhere. Or the organization could simply contract out the mechanistic interpretability work to other, more experienced labs.
It seems incredibly unlikely to me that your organization is going to make it no longer true that people have incompatible values.
If “AI alignment” is taken to mean “the AI wants exactly the same things that humans want” and hence to imply “all humans want the same things” then, sure, mutually incompatible human values ⇒ no AI alignment. But I don’t think that’s what any reasonable person takes “AI alignment” to mean. I would consider that we’d done a pretty good job of “AI alignment” if, say, the state of the world 20 years after the first superhuman AI was such that for all times between now and then, (1) >= 75% of living humans (would) consider the post-AI state better than the pre-AI state and (2) ⇐ 10% of living humans (would) consider the post-AI state much worse than the pre-AI state. (Or something along those lines.) And I don’t see why anything along these lines requires humans never to have incompatible values.
But never mind that: I still don’t see how your coordination-market system could possibly make it no longer true that humans sometimes have incompatible values.
On (2):
I still don’t see how your proposed political-reform organization would be in any way suited to issuing “AI alignment certification”, if that were a thing. And, since you say “hire a world-class mechanistic interpretability team from elsewhere”, it sounds as if you don’t either. So I don’t understand why any of that stuff is in your post; it seems entirely irrelevant to the organization you’re actually hoping to build.
Well, fair enough I suppose. I was personally excited about the AI alignment piece, and thought that coordination markets would help with that.
Humans have always and will always hold incompatible values. That’s why we feel the need to murder each other with such frequency. But, as Steven Pinker argues, while we murder each other in greater numbers every day, we also do it with less frequency every day. Maybe this will converge to a world in which a superhuman AI knows approximately what is expected of it. Maybe it won’t, I don’t know.
All very good questions. Let me try to answer them in order:
I plan to do 1/2/3 in parallel. I will start small (maybe city of Berkeley, where I live) and increase scope as the technology is proven to work.
1 - human values are what people think they are. To the extent that different humans have incompatible values, the AI alignment problem is strongly unsolvable.
2.A—AI alignment certification would be sort of like a drivers license for AI’s. The AI needs to have a valid license in order to be hooked up to the internet and switched on. Every time the AI does something dishonorable, its license is revoked until mechanistic interpretability researchers can debug the issue.
2.B—I don’t think the organization would be trusted to certify AI’s right away. I think the organization would have to build up a lot of goodwill from doing good works, turn that into capital somehow, and then hire a world-class mechanistic interpretability team from elsewhere. Or the organization could simply contract out the mechanistic interpretability work to other, more experienced labs.
There are whole realms of theory about how to reconcile orthogonal values.
On (1):
It seems incredibly unlikely to me that your organization is going to make it no longer true that people have incompatible values.
If “AI alignment” is taken to mean “the AI wants exactly the same things that humans want” and hence to imply “all humans want the same things” then, sure, mutually incompatible human values ⇒ no AI alignment. But I don’t think that’s what any reasonable person takes “AI alignment” to mean. I would consider that we’d done a pretty good job of “AI alignment” if, say, the state of the world 20 years after the first superhuman AI was such that for all times between now and then, (1) >= 75% of living humans (would) consider the post-AI state better than the pre-AI state and (2) ⇐ 10% of living humans (would) consider the post-AI state much worse than the pre-AI state. (Or something along those lines.) And I don’t see why anything along these lines requires humans never to have incompatible values.
But never mind that: I still don’t see how your coordination-market system could possibly make it no longer true that humans sometimes have incompatible values.
On (2):
I still don’t see how your proposed political-reform organization would be in any way suited to issuing “AI alignment certification”, if that were a thing. And, since you say “hire a world-class mechanistic interpretability team from elsewhere”, it sounds as if you don’t either. So I don’t understand why any of that stuff is in your post; it seems entirely irrelevant to the organization you’re actually hoping to build.
Well, fair enough I suppose. I was personally excited about the AI alignment piece, and thought that coordination markets would help with that.
Humans have always and will always hold incompatible values. That’s why we feel the need to murder each other with such frequency. But, as Steven Pinker argues, while we murder each other in greater numbers every day, we also do it with less frequency every day. Maybe this will converge to a world in which a superhuman AI knows approximately what is expected of it. Maybe it won’t, I don’t know.