In my view, there are alignment strategies that are unlikely to pay off without significant time investment, but which have large expected payoffs. For example, work on defining agency seems to fit this category.
Yup, that’s a place where I mostly disagree, and it is a crux. In general, I expect the foundational progress which matters mostly comes from solving convergent subproblems ( = subproblems which are a bottleneck for lots of different approaches). Every time progress is made on one of those subproblems, it opens up a bunch of new strategies, and therefore likely yields incremental progress. For instance, my work on abstraction was originally driven by thinking about agent foundations, but the Natural Abstraction Hypothesis is potentially relevant to more incremental strategies (like interpretability tools or retargeting the search).
Insofar as work on e.g. defining agency doesn’t address convergent subproblems, I’m skeptical that the work is on the right path at all; such work is unlikely to generalize robustly. After all, if a piece of work doesn’t address a shared bottleneck of a bunch of different strategy-variations, then it’s not going to be useful for very many strategy-variations.
Yup, that’s a place where I mostly disagree, and it is a crux. In general, I expect the foundational progress which matters mostly comes from solving convergent subproblems ( = subproblems which are a bottleneck for lots of different approaches). Every time progress is made on one of those subproblems, it opens up a bunch of new strategies, and therefore likely yields incremental progress. For instance, my work on abstraction was originally driven by thinking about agent foundations, but the Natural Abstraction Hypothesis is potentially relevant to more incremental strategies (like interpretability tools or retargeting the search).
Insofar as work on e.g. defining agency doesn’t address convergent subproblems, I’m skeptical that the work is on the right path at all; such work is unlikely to generalize robustly. After all, if a piece of work doesn’t address a shared bottleneck of a bunch of different strategy-variations, then it’s not going to be useful for very many strategy-variations.