So far, we’ve only talked about one AI ending up aligned, or a handful ending up aligned at one particular time. However, that isn’t really the ultimate goal of AI alignment research. What we really want is for AI to remain aligned in the long run, as we (and AIs themselves) continue to build new and more powerful systems and/or scale up existing systems over time.
I think this suggests an interesting path where alignment by default might be able to serve as a bridge to better alignment mechanisms, i.e. if it works and we can select for AIs that contains representations of human values, then we might be able to prioritize this in a slow takeoff scenario so that in the early phases of it we at least have mostly aligned AI that helps us build better mechanisms for alignment (as opposed to these AIs simply building successors directly with the hope that they maintain alignment with human values in the process).
I think of this as the Rohin trajectory, since he’s the main person I’ve heard talk about it. I agree it’s a natural approach to consider, though deceptiveness-type problems are a big potential issue.
Yup, exactly right, though this version is a fair bit more involved than the simplified delegation scenarios we’ve seen in most of the theoretical work.
I think this suggests an interesting path where alignment by default might be able to serve as a bridge to better alignment mechanisms, i.e. if it works and we can select for AIs that contains representations of human values, then we might be able to prioritize this in a slow takeoff scenario so that in the early phases of it we at least have mostly aligned AI that helps us build better mechanisms for alignment (as opposed to these AIs simply building successors directly with the hope that they maintain alignment with human values in the process).
I think of this as the Rohin trajectory, since he’s the main person I’ve heard talk about it. I agree it’s a natural approach to consider, though deceptiveness-type problems are a big potential issue.
Isn’t remaining aligned an example of robust delegation? If so, there have been both discussions and technical work on this problem before.
Yup, exactly right, though this version is a fair bit more involved than the simplified delegation scenarios we’ve seen in most of the theoretical work.