TL;DR:
Requesting feedback or advice on questions of what math to study (or what projects to engage in) in order to learn how to prove things about programs and to analyze the difficulty/possibility of proving things about programs. Ulterior motive: AI safety (and also progress towards a university degree).
Motivation, Goal, Background
To my current understanding, any endeavor to figure out how to program[1] safe AI would probably eventually[2] benefit a lot from an ability to prove various properties of programs:
The AI will have to self-modify, or create successor agents, i.e. (re)write (parts of) new AI-software; and we’d probably want the AI to be able to prove[3] that said AI-software satisfies various “safety properties” (which I would not yet even know how to specify).
I’d like to get closer to being able to design the above kind of AI, and to better understand (the limits of) proving things about programs. To my current tentative understanding, the knowledge I’m looking for probably lies mostly under the umbrella of type theory.
(Also, I’m currently studying for a degree at a university and have the opportunity to choose a project for myself; and I think learning about type theory, lambda calculus, and/or automated theorem proving might be the most useful way to use that opportunity. (Since I don’t see a way to pass deconfusion work off as a university project.))
Sub-goals, skills
To clarify the top-level goal here (viz. being able to design AIs that are able to prove stuff about AI-programs), I’ve listed below some (semi-)concrete examples of skills/abilities which I imagine would constitute progress towards said goal:
-
Given a type system S and some property P of programs/terms, be able to (dis)prove that some, all, or no terms of S have property P.
As simple concrete examples:Are programs in a given language guaranteed to terminate? (Do all terms in the given type system have normal forms?)
Given some function F: Can some term in the given type system implement F?
Does a given type system allow general recursion?
-
Be able to write a program R that, given encodings of S, P, and a term T of S,
finds a (dis)proof that T has property P (if doing so is possible). -
Given a property P of programs, and a type system S,
design a (minimal) type system S’ that admits terms/programs R that are able to prove (as above) properties P of some/all terms of S
or prove that no such type system S’ exists (for P and S).
-
Analyze the computational complexity of finding a (dis)proof of term T having property P, for various T and P.
-
Understand the different ways that program-properties can be specified, and how different specification methods might affect (e.g.) the complexity of proving those properties.
-
Be able to determine whether, when, and how
designing type systems would actually even be relevant/useful for building safe AI (sufficiently useful to merit spending time on it).
Questions
Problem: There appears to be a very wide variety of theory[4] that might (or might not) be useful to the above goals. I lack the knowledge to discern what theory would be useful. And so, to avoid wasting a lot of time on studying non-useful things, my questions are:
-
What would be good interventions for learning the above mentioned skills? (Ideally, the interventions would be something like studying a textbook, reading a bunch of papers, or programming a bare-bones theorem prover, or some other compact project that can be undertaken in 100h-200h of work; but any and all suggestions are very welcome.)
-
Are the skills/knowledge listed above actually useful to the stated top-level goal? (Are they even coherent?) Aside from the skills listed above, what would be some other useful skills or knowledge?
-
What is your assessment of the (non-)usefulness to AI safety of developing the above kinds of skills? What about the (non-)usefulness to AI safety of studying type theory more generally? [3:1]
-
Does the reasoning in this post make sense? What parts of this post feel most wrong or confused to you?
Update
I tried to read MIRI’s paper on Vingean reflection, and skimmed Proof-producing reflection for HOL, but didn’t understand much. Notable knowledge I seem to be missing includes things like
Modal logic, or provability logic, or some kind of proof theory
More than a superficial familiarity with Gödel’s theorems
Higher-order logic (I only have a basic familiarity with FOL).
Possibly: Model theory?
Now I’m wondering: How useful/necessary would it be to study those subjects? Would it be possible to approach the AI-proves-things-about-AI-software problem via type theory, without going much into the above? (Although I’m guessing that some kinds of incompleteness results à la Gödel would come up in type theory too.)
- ↩︎
To program safe seed AI “by hand”, as opposed to outsourcing the programming (/weight-finding) to e.g. gradient descent.
- ↩︎
I’m guessing there’s probably a large amount of deconfusion work that would need to be done before program-verification-stuff becomes applicable to AI safety.
- ↩︎↩︎
My (ignorant, fuzzy-intuition-based) guess is that in practice, to get over Vingean reflection / corrigibility problems, we’ll need to use some means other than trying to design an AI that can formally prove complicated properties of AI-software it writes for itself. Like figuring out how to design an AI that can derive probabilistic proofs about its mental software. (Something something Logical Induction?) This raises the question: Is studying type theory—or other approaches to agent-formally-proves-things-about-programs—a waste of time?
- ↩︎
“Types and Programming Languages” by Benjamin Pierce is a good textbook for learning about typesystems in general (both theory and implementation). While it doesn’t cover dependent types in detail (it only goes as far as F_Omega in that direction), it provides a solid background for further reading.
“Software Foundations” (available at https://softwarefoundations.cis.upenn.edu/) is about proving programs correct in Coq (with embeddings of various Hoare logics). VST (https://vst.cs.princeton.edu/) shows that the methodology in Software Foundation extends to proving properties about actual C programs.
MIRI’s papers “Proof-Producing Reflection for HOL” (https://intelligence.org/files/ProofProducingReflection.pdf) and “Definability of Truth in Probabilistic Logic” (https://intelligence.org/files/DefinabilityTruthDraft.pdf) are partial positive results on side-stepping some problems of self-reference in formal logic (the former via stratification of the logic, the latter via generalizing truth values to probabilities).
Thanks for the suggestions!
Software Foundations in particular looks highly apposite, and in a useful format, to boot.
TaPL was indeed one of the books I was considering reading. (Hesitating between that and Lambda Calculus and Combinators: an Introduction.) Gave it another point in my mental ranking.
Based on the little I understood of MIRI’s papers (the first of which also prompted me to read their paper on Vingean Reflection), they seem interesting, but currently inaccessible to me. Added an edit/update to my post.