I think I agree with johnswentworth’s comment. I think there is a chance that equating genuinely useful ASI safety-related work with deceptive alignment could be harmful. To give another perspective, I would also add that I think your definition of deceptive alignment is very broad—broad enough to encompass areas of research that I find quite distinct (e.g. better training methods vs. better remediation methods) -- yet still seems to exclude some other things that I think matter a lot for AGI safety. Some quick examples I thought of in a few minutes are:
Algorithms that fail less in general and are more likely to pass the safety-relevant tests we DO throw at it. This seems like a very broad category.
Containment
Resolving foundational questions about embedded agents, rationality, paradoxes, etc.
Forecasting
Governance, policy, law
Working on near term problems like ones involving fairness/bias, recommender systems, and self driving cars. These problems are often thought of a separate from AGI safety, but it seems highly unlikely that we would not be able to learn useful lessons for later by working on these problems now.
That’s fair. I think my statement as presented is much stronger than I intended it to be. I’ll add some updates/clarifications at the top of the post later.
I think I agree with johnswentworth’s comment. I think there is a chance that equating genuinely useful ASI safety-related work with deceptive alignment could be harmful. To give another perspective, I would also add that I think your definition of deceptive alignment is very broad—broad enough to encompass areas of research that I find quite distinct (e.g. better training methods vs. better remediation methods) -- yet still seems to exclude some other things that I think matter a lot for AGI safety. Some quick examples I thought of in a few minutes are:
Algorithms that fail less in general and are more likely to pass the safety-relevant tests we DO throw at it. This seems like a very broad category.
Containment
Resolving foundational questions about embedded agents, rationality, paradoxes, etc.
Forecasting
Governance, policy, law
Working on near term problems like ones involving fairness/bias, recommender systems, and self driving cars. These problems are often thought of a separate from AGI safety, but it seems highly unlikely that we would not be able to learn useful lessons for later by working on these problems now.
That’s fair. I think my statement as presented is much stronger than I intended it to be. I’ll add some updates/clarifications at the top of the post later.
Thanks for the feedback!