Quinn comments on Speedrunning 4 mistakes you make when your alignment strategy is based on formal proof

Quinn 20 Feb 2023 19:22 UTC
1 point
0
I think a coq bug or the type of mistakes that are easy to make in formalization are more likely than a transistor failure or a bad bit flip bubbling up through the stack, but still not what I’m talking about.

In the background, I’m thinking a little about this terry tao post, about how the process of grokking the proof (and being graceful with typos, knowing which typos are bad and which are harmless, and so on) is where the state of the art mathematics lies, not in the proof itself.

I was discussing your comment with a friend, who suggested what I’m calling factored cognition parallel auditing (FCPA): she asked why don’t we just divide the 1e6 lines of coq into 1000-line segments and send it out to 1000 experts each of whom make sense of their segment? The realities would be a little messier than linearly stitching together segments, but this basic setup (or something that emphasizes recursion, like a binary search flavored version) would I think buy us roughly 2-3x the assurance that RH true over just blindly trusting the type theory / laptop to execute Riemann.v in the coq session.

In the current comment, let’s assume a background of russell-like “enfeeblement” or christiano-like “losing control of the world/future in a slow burn”, filtering out the more yudkowsky-like threatmodels. Not to completely discount malicious injections in Riemann.v, but they don’t seem productive to emphasize. I will invoke the vague notions of “theorem intricacy” and “proof length”, but it should be possible to not read this footnote^[1] and still follow along with the goals of the current comment.

This isn’t really about RH, RH isn’t necessarily important, etc. I was actually alluding to / vaguely criticizing a few different ideas at once in the paragraph you quoted.
1. The proof assistants or applied type theory technologies will help agent foundations researchers progress and/or teach. I’m bearish on this largely because I think agent foundations researchers care more about the aforementioned Tao blogpost than about the Kevin Buzzard side of the aisle (recall that Kevin Buzzard is the guy leading the charge to put all math department operations on Lean, from undergrad coursework to the state of the art). I don’t think FCPA would help with Tao’s notion of mathematical sophistication and understanding, barring pretty drastic advances in HCI/BCI. If theorem intricacy and proof length escapes us in agent foundations, formalization doesn’t solve many of the problems that would present.
2. Alignment itself turns out to be deeply inelegant. In these worlds, we may be using a flavor of codegen to even write down the theorem/spec, not to mention the proofs. Let’s assume type theory is secure (no leak or unbox related attack surface) and trustworthy (no false positives or false negatives). Then, I don’t know how to beat FCPA at arbitrary theorem intricacy and proof size. Maybe a massive FCPA regime could “understand alignment” in an “institutions as distributed cognition” sense. I find this unsatisfying, but it may be the best we can do.
3. The LLM does alignment research for us, and communicates it to us via type theory (there are flavors of this approach that do not invoke type theory, and I think those are even worse). Even when we assume type theory is secure and trustworthy, we have the same problems I highlighted in the preceding list entries. But SGD and the programs it writes are untrusted, so the standard of grokking the old fashion way should be really high. Trusting an auditing process like FCPA to be adequate here feels really dicey to me.
I think we should be sensitive to “which parts of the proof are outsourced to SGD or the programs it writes?” as a potential area of research acceleration. I should be willing to be convinced that parts of the proof are harmless to automate or not pay close attention to, deferring to the type theory or some proof automation regime that may even involve SGD in some way.

If we’re relying on computational processes we don’t understand, we still get enfeeblement and we still risk losing control of the world/future over time even if type theory is a bajillion times more secure and trustworthy than SGD and the programs it writes.

I hope that makes my worldview more crisp.
1. ↩︎
  There’s not a widely agreed upon metric or proxy for theorem size, I’m intentionally using a wishy-washy word like “intricacy” to keep the reader from taking any given metric or proxy too seriously. I think the minimum viable quantitative view is basically AST size, and it’s ok to be unsatisfied about this. Proof length is sort of similar. A proof object is kind of like a lambda term, it’s size can be thought of as tree size. Both of these are not “basis independent”, each of these measurements will be sensitive to variation in syntactic choices, but that probably doesn’t matter. There’s also a kind of kolmogorov complexity point of view, which I think usually prefers to emphasize character count of programs over anything to do with the tree perspective. A particular kind of constructivist would say that theorem intricacy of $P$ is length of the shortest proof of $P$ , but I think there are benefits to separating statement from proof, but that would take me too far afield. A turing machine story would emphasize different things but not be qualitatively different.

Quinn comments on Speedrunning 4 mistakes you make when your alignment strategy is based on formal proof

If we’re relying on computational processes we don’t understand, we still get enfeeblement and we still risk losing control of the world/​future over time even if type theory is a bajillion times more secure and trustworthy than SGD and the programs it writes.

If we’re relying on computational processes we don’t understand, we still get enfeeblement and we still risk losing control of the world/future over time even if type theory is a bajillion times more secure and trustworthy than SGD and the programs it writes.