In my view, there are alignment strategies that are unlikely to pay off without significant time investment, but which have large expected payoffs. For example, work on defining agency seems to fit this category.
There are also alignment strategies that have incremental payoffs, but still seem unsatisfactory. For example, we could focus on developing better AI boxing techniques that just might buy us a few weeks. Or we could discover likely takeover scenarios, and build warnings for them.
There’s an analogy for this in self driving cars. If you want to ship an impressive demo right away, you might rely on a lot of messy case handling, special road markings, mapping, and sensor arrays. If you want to solve self driving in the general case, you’d probably be developing really good end to end ML models.
In my view, there are alignment strategies that are unlikely to pay off without significant time investment, but which have large expected payoffs. For example, work on defining agency seems to fit this category.
Yup, that’s a place where I mostly disagree, and it is a crux. In general, I expect the foundational progress which matters mostly comes from solving convergent subproblems ( = subproblems which are a bottleneck for lots of different approaches). Every time progress is made on one of those subproblems, it opens up a bunch of new strategies, and therefore likely yields incremental progress. For instance, my work on abstraction was originally driven by thinking about agent foundations, but the Natural Abstraction Hypothesis is potentially relevant to more incremental strategies (like interpretability tools or retargeting the search).
Insofar as work on e.g. defining agency doesn’t address convergent subproblems, I’m skeptical that the work is on the right path at all; such work is unlikely to generalize robustly. After all, if a piece of work doesn’t address a shared bottleneck of a bunch of different strategy-variations, then it’s not going to be useful for very many strategy-variations.
In my view, there are alignment strategies that are unlikely to pay off without significant time investment, but which have large expected payoffs. For example, work on defining agency seems to fit this category.
There are also alignment strategies that have incremental payoffs, but still seem unsatisfactory. For example, we could focus on developing better AI boxing techniques that just might buy us a few weeks. Or we could discover likely takeover scenarios, and build warnings for them.
There’s an analogy for this in self driving cars. If you want to ship an impressive demo right away, you might rely on a lot of messy case handling, special road markings, mapping, and sensor arrays. If you want to solve self driving in the general case, you’d probably be developing really good end to end ML models.
Yup, that’s a place where I mostly disagree, and it is a crux. In general, I expect the foundational progress which matters mostly comes from solving convergent subproblems ( = subproblems which are a bottleneck for lots of different approaches). Every time progress is made on one of those subproblems, it opens up a bunch of new strategies, and therefore likely yields incremental progress. For instance, my work on abstraction was originally driven by thinking about agent foundations, but the Natural Abstraction Hypothesis is potentially relevant to more incremental strategies (like interpretability tools or retargeting the search).
Insofar as work on e.g. defining agency doesn’t address convergent subproblems, I’m skeptical that the work is on the right path at all; such work is unlikely to generalize robustly. After all, if a piece of work doesn’t address a shared bottleneck of a bunch of different strategy-variations, then it’s not going to be useful for very many strategy-variations.