I enjoyed it, and think that ideas are important, but found it hard to follow at points
Some suggestions:
explain more why self criticism allows one part to assert control
give more examples throughout, especially the second half. I think some paragraphs don’t have examples and are harder to understand
flesh out examples to make them longer and more detailed
I enjoyed reading this, thanks.
I think your definition of solving alignment here might be too broad?
If we have superintelligent agentic AI that tries to help its user but we end up missing out of the benefits of AI bc of catastrophic coordination failures, or bc of misuse, then I think you’re saying we didn’t solve alignment bc we didn’t elicit the benefits?
You discuss this, but I prefer to separate out control and alignment. Where I wouldn’t count us as having solved alignment if we only elicit behavior via intense/exploitative control schemes. So I’d adjust your alignment definition with the extra requirement that we avoided takeover while not doing super-intense control schemes relative to what is acceptable to do to humans today. Which is a higher bar, and separates it from the thing we care about—avoiding takeover and eliciting benefits—but I think that’s a better def