Glanced through the comments and saw surprisingly positive responses, but reluctant to wade into a book-length reading commitment based on that. Are the core of your ideas on alignment compressible to fulfil the compelling insight heuristic?
Cool, so essentially “Use weaker and less aligned systems to build more aligned and stronger systems, until you have very strong very aligned systems”. This does seem like the kind of path where much of the remaining winning timelines lies, and the extra details you provide seem plausible as things that might be useful steps.
There’s two broad directions of concern with this, for me. One is captured well by “Carefully Bootstrapped Alignment” is organizationally hard, and is essentially: going slowly enough to avoid disaster is hard, it’s easy to slip into using your powerful systems to go too fast without very strong institutional buy-in and culture, or have people leave and take the ideas with them/ideas get stolen/etc and things go too fast elsewhere.
The next and probably larger concern is something like.. if current-style research on alignment doesn’t scale to radical superintelligence and you need new and more well formalized paradigms in order for the values you imbue to last a billion steps of self-modification, as I think is reasonably likely, then it’s fairly likely that somewhere along the chain of weakly aligned systems one of them either makes a fatal mistake, or follows its best understanding of alignment in a way which doesn’t actually produce good worlds. If we don’t have a crisp understanding of what we want, asking a series of systems which we haven’t been able to give that goal to to make research progress on finding that leaves free variables open in the unfolding process which seem likely to end up at extreme or unwanted values. Human steering helps, but only so much, and we need to figure out how to use that steering effectively in more concrete terms because most ways of making it concrete have pitfalls.
This seems like a solid attempt to figure out a path to a safe world. I think it needs a bunch of careful poking at and making sure that when the details are nailed down there’s not failure modes which will be dangerous, but I’m glad you’re exploring the space.
For the first issue, I agree that “Carefully Bootstrapped Alignment” is organizationally hard, but I don’t think improving the organizational culture is an effective solution. It is too slow and humans often make mistakes. I think technical solutions are needed. For example, let an AI be responsible for safety assessment. When a researcher submits a job to the AI training cluster, this AI assesses the safety of the job. If this job may produce a dangerous AI, the job will be rejected. In addition, external supervision is also needed. For example, the government could stipulate that before an AI organization releases a new model, it needs to be evaluated by a third-party safety organization, and all organizations with computing resources exceeding a certain threshold need be supervised. There is more discussion on this in the section Restricting AI Development. For the second issue, you mentioned free variables. I think this is a key point. In the case where we are not fully confident in the safety of AI, we should reduce free variables as much as possible. This is why I proposed a series of AI Controllability Rules. The priority of these rules is higher than the goals. AI should be trained to achieve the goals under the premise of complying with the rules. In addition, I think we should not place all our hopes on alignment. We should have more measures to deal with the situation where AI alignment fails, such as AI Monitoring and Decentralizing AI Power.
Glanced through the comments and saw surprisingly positive responses, but reluctant to wade into a book-length reading commitment based on that. Are the core of your ideas on alignment compressible to fulfil the compelling insight heuristic?
The core idea about alignment is described here: https://wwbmmm.github.io/asi-safety-solution/en/main.html#aligning-ai-systems
If you only focus on alignment, you can only read Sections 6.1-6.3, and the length of this part will not be too long.
Cool, so essentially “Use weaker and less aligned systems to build more aligned and stronger systems, until you have very strong very aligned systems”. This does seem like the kind of path where much of the remaining winning timelines lies, and the extra details you provide seem plausible as things that might be useful steps.
There’s two broad directions of concern with this, for me. One is captured well by “Carefully Bootstrapped Alignment” is organizationally hard, and is essentially: going slowly enough to avoid disaster is hard, it’s easy to slip into using your powerful systems to go too fast without very strong institutional buy-in and culture, or have people leave and take the ideas with them/ideas get stolen/etc and things go too fast elsewhere.
The next and probably larger concern is something like.. if current-style research on alignment doesn’t scale to radical superintelligence and you need new and more well formalized paradigms in order for the values you imbue to last a billion steps of self-modification, as I think is reasonably likely, then it’s fairly likely that somewhere along the chain of weakly aligned systems one of them either makes a fatal mistake, or follows its best understanding of alignment in a way which doesn’t actually produce good worlds. If we don’t have a crisp understanding of what we want, asking a series of systems which we haven’t been able to give that goal to to make research progress on finding that leaves free variables open in the unfolding process which seem likely to end up at extreme or unwanted values. Human steering helps, but only so much, and we need to figure out how to use that steering effectively in more concrete terms because most ways of making it concrete have pitfalls.
A lot of my models are best reflected in various Arbital pages, such as Reflective Stability, Nearest unblocked strategy, Goodhart’s Curse, plus some LW posts like Why Agent Foundations? An Overly Abstract Explanation and Siren worlds and the perils of over-optimised search (which might come up in some operationalizations of pointing the system towards being aligned).
This seems like a solid attempt to figure out a path to a safe world. I think it needs a bunch of careful poking at and making sure that when the details are nailed down there’s not failure modes which will be dangerous, but I’m glad you’re exploring the space.
For the first issue, I agree that “Carefully Bootstrapped Alignment” is organizationally hard, but I don’t think improving the organizational culture is an effective solution. It is too slow and humans often make mistakes. I think technical solutions are needed. For example, let an AI be responsible for safety assessment. When a researcher submits a job to the AI training cluster, this AI assesses the safety of the job. If this job may produce a dangerous AI, the job will be rejected. In addition, external supervision is also needed. For example, the government could stipulate that before an AI organization releases a new model, it needs to be evaluated by a third-party safety organization, and all organizations with computing resources exceeding a certain threshold need be supervised. There is more discussion on this in the section Restricting AI Development.
For the second issue, you mentioned free variables. I think this is a key point. In the case where we are not fully confident in the safety of AI, we should reduce free variables as much as possible. This is why I proposed a series of AI Controllability Rules. The priority of these rules is higher than the goals. AI should be trained to achieve the goals under the premise of complying with the rules. In addition, I think we should not place all our hopes on alignment. We should have more measures to deal with the situation where AI alignment fails, such as AI Monitoring and Decentralizing AI Power.