Not an expert, but what if we took all our current proposed solutions to alignment, and the assumptions they implicitly make, and try and formalise them using some type of proof assistant? Could that at least be a useful pedagogical tool for understanding the current research landscape?
Executable textbooks are fun but the odds that playing one cultivates the parts of alignment research that are hard or important seem slim to me.
Crucially to your comment though, what work is “implicitly” doing? Implicitly is where most preparadigmatic work happens! This is loosely related to the false sense of security. Formalization starts when you admit your assumptions as a premise. Formal methods do not second guess assumptions.
I agree that mismatch between “assumptions” and “real world“ make getting formal certificates of real world alignment largely intractable.
E.g. if you make a broad assumption like “pure self supervised learning does not exhibit strategic behaviour” (suitably formalised), that is almost certainly not justifiable in the real world, but it would be a good starting point to reason about other alignment schemes.
My point is, the list of assumptions you have to make for each alignment approach could be an interesting metric to track. You end up with a table where alignment approaches are rows and the set of necessary assumptions are the columns. Alignment approaches are then ranked based on how grounded the necessary subset of assumptions are (in aggregate), and progress is made by incrementally improving the proofs in ways that replace broad assumptions with more grounded ones.
sounds pretty damn hard, but maybe research workers at google who worked on autoformalization (similar papers) might be able to help this happen at significant scale. if you, like me, find videos far easier to get through than text, here’s a talk on the topic that was given recently:
Intuitively, asking the LLM to translate from english to coq is a cousin of “ask the LLM to summarize the coq into english”, via security mindset. What assurances can we get about the translation process? This seems dicey and hard. I’m imagining that there’s an auditing game design out there that would give us some hope though.
most codegen techniques assume a margin of error until proven otherwise, that includes extraction from coq to ocaml or haskell and lots of program synthesis techniques I’ve heard about. Extraction (like that from coq to ocaml or haskell) is probably the furthest away from SGD I think you can get on “the codegen spectrum”.
I’m mostly interested in formal checking of statements not generated by strictly formal reasoning, though. the benefit of automatic formalization is to be sure a machine checked the interference pattern of implications of a claim I already knew I wanted to make. ie, the stronger version of fuzz testing.
But actually what I had in mind is something simpler that would not necessarily need such tools to be feasible. Basically akin to taking the main argument of each approach, as expressed in natural language, without worrying too much about all the baggage at finer levels of abstraction. But I guess this is not quite what the article is about ..
Not an expert, but what if we took all our current proposed solutions to alignment, and the assumptions they implicitly make, and try and formalise them using some type of proof assistant? Could that at least be a useful pedagogical tool for understanding the current research landscape?
Executable textbooks are fun but the odds that playing one cultivates the parts of alignment research that are hard or important seem slim to me.
Crucially to your comment though, what work is “implicitly” doing? Implicitly is where most preparadigmatic work happens! This is loosely related to the false sense of security. Formalization starts when you admit your assumptions as a premise. Formal methods do not second guess assumptions.
If I recall correctly the Cardano team’s paper on agile formal methods may have had some comments on this, but I’m not 100% sure.
I agree that mismatch between “assumptions” and “real world“ make getting formal certificates of real world alignment largely intractable.
E.g. if you make a broad assumption like “pure self supervised learning does not exhibit strategic behaviour” (suitably formalised), that is almost certainly not justifiable in the real world, but it would be a good starting point to reason about other alignment schemes.
My point is, the list of assumptions you have to make for each alignment approach could be an interesting metric to track. You end up with a table where alignment approaches are rows and the set of necessary assumptions are the columns. Alignment approaches are then ranked based on how grounded the necessary subset of assumptions are (in aggregate), and progress is made by incrementally improving the proofs in ways that replace broad assumptions with more grounded ones.
Will read the link
sounds pretty damn hard, but maybe research workers at google who worked on autoformalization (similar papers) might be able to help this happen at significant scale. if you, like me, find videos far easier to get through than text, here’s a talk on the topic that was given recently:
Intuitively, asking the LLM to translate from english to coq is a cousin of “ask the LLM to summarize the coq into english”, via security mindset. What assurances can we get about the translation process? This seems dicey and hard. I’m imagining that there’s an auditing game design out there that would give us some hope though.
yeah it only helps if you read the generated formal code. it’s almost useless otherwise.
most codegen techniques assume a margin of error until proven otherwise, that includes extraction from coq to ocaml or haskell and lots of program synthesis techniques I’ve heard about. Extraction (like that from coq to ocaml or haskell) is probably the furthest away from SGD I think you can get on “the codegen spectrum”.
fair point.
I’m mostly interested in formal checking of statements not generated by strictly formal reasoning, though. the benefit of automatic formalization is to be sure a machine checked the interference pattern of implications of a claim I already knew I wanted to make. ie, the stronger version of fuzz testing.
Thanks for the link!
But actually what I had in mind is something simpler that would not necessarily need such tools to be feasible. Basically akin to taking the main argument of each approach, as expressed in natural language, without worrying too much about all the baggage at finer levels of abstraction. But I guess this is not quite what the article is about ..