ATA is extremely neglected. The field of ATA is at a very early stage, and currently there does not exist any research project dedicated to ATA. The present post argues that this lack of progress is dangerous and that this neglect is a serious mistake.
I agree it’s neglected, but there is in fact at least one researh project dedicated to at least designing alignment targets: the part of the formal alignment agenda dedicated to formal outer alignment, which is the design of math problems to which solutions would be world-saving. Our notable attempts at this are QACI and ESP (there was also some work on a QACI2, but it predates (and in-my-opinion is superceded by) ESP).
Those try to implement CEV in math. They only work for doing CEV of a single person or small group, but that’s fine: just do CEV of {a single person or small group} which values all of humanity/moral-patients/whatever getting their values satisfied instead of just that group’s values. If you want humanity’s values to be satisfied, then “satisfying humanity’s values” is not opposite to “satisfy your own values”, it’s merely the outcome of “satisfy your own values”.
I think I see your point. Attempting to design a good alignment target could lead to developing intuitions that would be useful for ATA. A project trying to design an alignment target might result in people learning skills that allows them to notice flaws in alignment targets proposed by others. Such projects can therefore contribute to the type of risk mitigation that I think is lacking. I think that this is true. But I do not think that such projects can be a substitute for an ATA project with a risk mitigation focus.
Regarding Orthogonal:
It is difficult for me to estimate how much effort Orthogonal spends on different types of work. But it seems to me that your published results are mostly about methods for hitting alignment targets. This also seems to me to be the case for your research goals. If you are successful, it seems to me that your methods could be used to hit almost any alignment target (subject to constraints related to finding individuals that want to hit specific alignment targets).
I appreciate you engaging on this, and I would be very interested in hearing more about how the work done by Orthogonal could contribute to the type of risk mitigation effort discussed in the post. I would, for example, be very happy to have a voice chat with you about this.
Hi !
I agree it’s neglected, but there is in fact at least one researh project dedicated to at least designing alignment targets: the part of the formal alignment agenda dedicated to formal outer alignment, which is the design of math problems to which solutions would be world-saving. Our notable attempts at this are QACI and ESP (there was also some work on a QACI2, but it predates (and in-my-opinion is superceded by) ESP).
Those try to implement CEV in math. They only work for doing CEV of a single person or small group, but that’s fine: just do CEV of {a single person or small group} which values all of humanity/moral-patients/whatever getting their values satisfied instead of just that group’s values. If you want humanity’s values to be satisfied, then “satisfying humanity’s values” is not opposite to “satisfy your own values”, it’s merely the outcome of “satisfy your own values”.
I think I see your point. Attempting to design a good alignment target could lead to developing intuitions that would be useful for ATA. A project trying to design an alignment target might result in people learning skills that allows them to notice flaws in alignment targets proposed by others. Such projects can therefore contribute to the type of risk mitigation that I think is lacking. I think that this is true. But I do not think that such projects can be a substitute for an ATA project with a risk mitigation focus.
Regarding Orthogonal:
It is difficult for me to estimate how much effort Orthogonal spends on different types of work. But it seems to me that your published results are mostly about methods for hitting alignment targets. This also seems to me to be the case for your research goals. If you are successful, it seems to me that your methods could be used to hit almost any alignment target (subject to constraints related to finding individuals that want to hit specific alignment targets).
I appreciate you engaging on this, and I would be very interested in hearing more about how the work done by Orthogonal could contribute to the type of risk mitigation effort discussed in the post. I would, for example, be very happy to have a voice chat with you about this.