In order to find the most promising alignment research directions to pour resources into, we can go about it 3 ways
Constraints all alignment proposals should have
Constraints for current research directions
Constraints for new research directions
Constraints all alignment proposals should have
We can imagine the space of all possible research directions
This includes all possible research directions, including pouring resources into McDonald’s HR Department. But we can add constraints to focus on research directions more likely to help advance alignment.
If you can tell a story on how it reduces x-risk from AI, then I am slightly more excited about your proposal. But we can continue to add more constraints like:
By constraining more and more, we can narrow down the space to search and (hopefully) avoid dead ends in research. This frame opens up a few questions to ask:
What are the ideal constraints that all alignment proposals should have?
How can we get better at these constraints? e.g. You can tell a better story if you build a story that works, then break it, then iterate on that process (Paul Christiano commonly suggests this). If we can successfully teach these mental movements, we could have more researchers wasting less time.
Constraints for current research directions
We can also perform this constraint-narrowing on known research agendas (or known problems). A good example is this arbital page on boxed AI, clearly explaining the difficulty of:
Building actually robust, air-type systems that can’t effect the world except through select channels.
Still getting a pivotal act from those limited channels
Most proposals for a boxed-ai is doomed to fail, but if the proposal competently accounts for (1) & (2) above (which can be considered additional constraints), then I am more excited about this research direction.
Doing a similar process for e.g. interpretability, learning from human feedback, agent foundations research would be very useful. An example would be
Finding the constraints we want for interpretability, such as leading to knowledge of possible deception, what goal the model is pursuing, it’s understanding of human values, etc.
Tailor this argument to research at Redwood, Anthropic, OpenAI and get their feedback.
Repeat and write up results
I expect to either convince them to change their research direction, be convinced myself, or find the cruxes and make bets/predictions if applicable.
This same process can be applied to alignment-adjacent fields like bias & fairness and task specification in robotics. The result should be “bias & fairness research but it must handle these criticisms/constraints” which is easier to convince those researchers to change to than switching to other alignment fields.
This is also very scalable/can be done in parallel, since people can perform this process on their own research agendas or do separate deep dives into other’s research.
Constraints for new research directions
The earlier section was mostly on what not to work on, but doesn’t tell you what specifically to work on (ignoring established research directions). Here is a somewhat constructive algorithm
Break existing alignment proposals/concepts into “interesting” components
Mix & match those components w/ all other components (most will be trash, but some will be interesting).
For example, Alex Turner’s power-seeking work could be broken down into:
Power-seeking
Instrumental convergence
Grounding concepts in formalized math
Deconfusion work
MDP’s
Environment/graph symmetries
which you could break down into different components, but this is what I found interesting. How I formalized (3) can then be mix & matched with other alignment concepts such as mesa-optimizers, deception, & interpretability, which are research directions I approve of.
For overall future work, we can:
Figure out constraints we want all alignment proposals to have and how to improve that process for constraints we are confident in
Improve current research directions by trying to break them and build them up, getting the feedback of experts in that field (including alignment-adjacent fields like bias & fairness and task specification)
Find new research directions by breaking proposals into interesting components and mix & matching them
I’d greatly appreciate any comments or posts that do any of these three.
Alignment as Constraints
In order to find the most promising alignment research directions to pour resources into, we can go about it 3 ways
Constraints all alignment proposals should have
Constraints for current research directions
Constraints for new research directions
Constraints all alignment proposals should have
We can imagine the space of all possible research directions
This includes all possible research directions, including pouring resources into McDonald’s HR Department. But we can add constraints to focus on research directions more likely to help advance alignment.
If you can tell a story on how it reduces x-risk from AI, then I am slightly more excited about your proposal. But we can continue to add more constraints like:
By constraining more and more, we can narrow down the space to search and (hopefully) avoid dead ends in research. This frame opens up a few questions to ask:
What are the ideal constraints that all alignment proposals should have?
How can we get better at these constraints? e.g. You can tell a better story if you build a story that works, then break it, then iterate on that process (Paul Christiano commonly suggests this). If we can successfully teach these mental movements, we could have more researchers wasting less time.
Constraints for current research directions
We can also perform this constraint-narrowing on known research agendas (or known problems). A good example is this arbital page on boxed AI, clearly explaining the difficulty of:
Building actually robust, air-type systems that can’t effect the world except through select channels.
Still getting a pivotal act from those limited channels
Most proposals for a boxed-ai is doomed to fail, but if the proposal competently accounts for (1) & (2) above (which can be considered additional constraints), then I am more excited about this research direction.
Doing a similar process for e.g. interpretability, learning from human feedback, agent foundations research would be very useful. An example would be
Finding the constraints we want for interpretability, such as leading to knowledge of possible deception, what goal the model is pursuing, it’s understanding of human values, etc.
Tailor this argument to research at Redwood, Anthropic, OpenAI and get their feedback.
Repeat and write up results
I expect to either convince them to change their research direction, be convinced myself, or find the cruxes and make bets/predictions if applicable.
This same process can be applied to alignment-adjacent fields like bias & fairness and task specification in robotics. The result should be “bias & fairness research but it must handle these criticisms/constraints” which is easier to convince those researchers to change to than switching to other alignment fields.
This is also very scalable/can be done in parallel, since people can perform this process on their own research agendas or do separate deep dives into other’s research.
Constraints for new research directions
The earlier section was mostly on what not to work on, but doesn’t tell you what specifically to work on (ignoring established research directions). Here is a somewhat constructive algorithm
Break existing alignment proposals/concepts into “interesting” components
Mix & match those components w/ all other components (most will be trash, but some will be interesting).
For example, Alex Turner’s power-seeking work could be broken down into:
Power-seeking
Instrumental convergence
Grounding concepts in formalized math
Deconfusion work
MDP’s
Environment/graph symmetries
which you could break down into different components, but this is what I found interesting. How I formalized (3) can then be mix & matched with other alignment concepts such as mesa-optimizers, deception, & interpretability, which are research directions I approve of.
For overall future work, we can:
Figure out constraints we want all alignment proposals to have and how to improve that process for constraints we are confident in
Improve current research directions by trying to break them and build them up, getting the feedback of experts in that field (including alignment-adjacent fields like bias & fairness and task specification)
Find new research directions by breaking proposals into interesting components and mix & matching them
I’d greatly appreciate any comments or posts that do any of these three.