I agree that mismatch between “assumptions” and “real world“ make getting formal certificates of real world alignment largely intractable.
E.g. if you make a broad assumption like “pure self supervised learning does not exhibit strategic behaviour” (suitably formalised), that is almost certainly not justifiable in the real world, but it would be a good starting point to reason about other alignment schemes.
My point is, the list of assumptions you have to make for each alignment approach could be an interesting metric to track. You end up with a table where alignment approaches are rows and the set of necessary assumptions are the columns. Alignment approaches are then ranked based on how grounded the necessary subset of assumptions are (in aggregate), and progress is made by incrementally improving the proofs in ways that replace broad assumptions with more grounded ones.
I agree that mismatch between “assumptions” and “real world“ make getting formal certificates of real world alignment largely intractable.
E.g. if you make a broad assumption like “pure self supervised learning does not exhibit strategic behaviour” (suitably formalised), that is almost certainly not justifiable in the real world, but it would be a good starting point to reason about other alignment schemes.
My point is, the list of assumptions you have to make for each alignment approach could be an interesting metric to track. You end up with a table where alignment approaches are rows and the set of necessary assumptions are the columns. Alignment approaches are then ranked based on how grounded the necessary subset of assumptions are (in aggregate), and progress is made by incrementally improving the proofs in ways that replace broad assumptions with more grounded ones.
Will read the link