I think I’m something like 30% on ‘The highest-leverage point for alignment work is once we have models that are capable of alignment research—we should focus on maximising the progress we make at that point, rather than on making progress now, or on making it to that point—most of the danger comes after it’
Things this maybe implies:
We should try to differentially advance models’ ability to do alignment research relative to other abilities (abilities required to be dangerous, or abilities required to accelerate capabilities)
For instance, trying to make really good datasets related to alignment, e.g. by paying humans to proliferate/augment all the alignment research and writing we have so far
Figuring out what combination of math/code/language/arxiv etc seem to be the most conducive to alignment-relevant capabilities
More generally, researching how to develop models that are strong in some domains and handicapped in others
We should focus on getting enough alignment to extract the alignment research capabilities
This might mean we only need to align:
models that are not agentic/not actively trying to deceive you
Models that in many domains are subhuman
If we think these models are going to be close to having agency, maybe we want to avoid RL or other finetuning that incentivizes the model to think about its environment/human supervisors. Instead we might want to use some techniques that are more like interpretability or extracting latent knowledge from representations, rather than RLHF?
We should think about how we can use powerful models to accelerate alignment
We should focus more on how we would recognise good alignment research as opposed to producing it
For example, setups where you can safely train a fairly capable model according to some proposed alignment scheme, and see how well it works?
I think I’m something like 30% on ‘The highest-leverage point for alignment work is once we have models that are capable of alignment research—we should focus on maximising the progress we make at that point, rather than on making progress now, or on making it to that point—most of the danger comes after it’
Things this maybe implies:
We should try to differentially advance models’ ability to do alignment research relative to other abilities (abilities required to be dangerous, or abilities required to accelerate capabilities)
For instance, trying to make really good datasets related to alignment, e.g. by paying humans to proliferate/augment all the alignment research and writing we have so far
Figuring out what combination of math/code/language/arxiv etc seem to be the most conducive to alignment-relevant capabilities
More generally, researching how to develop models that are strong in some domains and handicapped in others
We should focus on getting enough alignment to extract the alignment research capabilities
This might mean we only need to align:
models that are not agentic/not actively trying to deceive you
Models that in many domains are subhuman
If we think these models are going to be close to having agency, maybe we want to avoid RL or other finetuning that incentivizes the model to think about its environment/human supervisors. Instead we might want to use some techniques that are more like interpretability or extracting latent knowledge from representations, rather than RLHF?
We should think about how we can use powerful models to accelerate alignment
We should focus more on how we would recognise good alignment research as opposed to producing it
For example, setups where you can safely train a fairly capable model according to some proposed alignment scheme, and see how well it works?