Executable textbooks are fun but the odds that playing one cultivates the parts of alignment research that are hard or important seem slim to me.
Crucially to your comment though, what work is “implicitly” doing? Implicitly is where most preparadigmatic work happens! This is loosely related to the false sense of security. Formalization starts when you admit your assumptions as a premise. Formal methods do not second guess assumptions.
I agree that mismatch between “assumptions” and “real world“ make getting formal certificates of real world alignment largely intractable.
E.g. if you make a broad assumption like “pure self supervised learning does not exhibit strategic behaviour” (suitably formalised), that is almost certainly not justifiable in the real world, but it would be a good starting point to reason about other alignment schemes.
My point is, the list of assumptions you have to make for each alignment approach could be an interesting metric to track. You end up with a table where alignment approaches are rows and the set of necessary assumptions are the columns. Alignment approaches are then ranked based on how grounded the necessary subset of assumptions are (in aggregate), and progress is made by incrementally improving the proofs in ways that replace broad assumptions with more grounded ones.
Executable textbooks are fun but the odds that playing one cultivates the parts of alignment research that are hard or important seem slim to me.
Crucially to your comment though, what work is “implicitly” doing? Implicitly is where most preparadigmatic work happens! This is loosely related to the false sense of security. Formalization starts when you admit your assumptions as a premise. Formal methods do not second guess assumptions.
If I recall correctly the Cardano team’s paper on agile formal methods may have had some comments on this, but I’m not 100% sure.
I agree that mismatch between “assumptions” and “real world“ make getting formal certificates of real world alignment largely intractable.
E.g. if you make a broad assumption like “pure self supervised learning does not exhibit strategic behaviour” (suitably formalised), that is almost certainly not justifiable in the real world, but it would be a good starting point to reason about other alignment schemes.
My point is, the list of assumptions you have to make for each alignment approach could be an interesting metric to track. You end up with a table where alignment approaches are rows and the set of necessary assumptions are the columns. Alignment approaches are then ranked based on how grounded the necessary subset of assumptions are (in aggregate), and progress is made by incrementally improving the proofs in ways that replace broad assumptions with more grounded ones.
Will read the link