Something I’d add to “plan for a wide variety of scenarios” is to look for solutions that do not refer to those scenarios. A solution involving [and here we test for deceptive alignment (DA)] is going to generalise badly (even assuming such a test could work), but so too will a solution involving [and here we test for DA, x, y, z, w...].
This argues for not generating all our problem scenarios ahead of time: it’s useful to have a test set. If the solution I devise after only thinking about x and y also works for z and w, then I have higher confidence in it than if I’d generated x, y, z, w before I started looking for a solution.
For this reason, I’m not so pessimistic about putting a lot of effort into solving DA. I just wouldn’t want people to be thinking about DA-specific tests or DA-specific invariants.
Something I’d add to “plan for a wide variety of scenarios” is to look for solutions that do not refer to those scenarios. A solution involving [and here we test for deceptive alignment (DA)] is going to generalise badly (even assuming such a test could work), but so too will a solution involving [and here we test for DA, x, y, z, w...].
This argues for not generating all our problem scenarios ahead of time: it’s useful to have a test set. If the solution I devise after only thinking about x and y also works for z and w, then I have higher confidence in it than if I’d generated x, y, z, w before I started looking for a solution.
For this reason, I’m not so pessimistic about putting a lot of effort into solving DA. I just wouldn’t want people to be thinking about DA-specific tests or DA-specific invariants.