My January alignment theory Nanowrimo

[Update: first post is up, here, second post is here, third post is here, fourth post, fifth post, and sixth post are shortforms.]

This is a quick announcement/​commitment post:

I’ve been working at the PIBBSS Horizon Scanning team (with Lauren Greenspan and Lucas Teixeira), where we have been working on reviewing some “basic-science-flavored” alignment and interpretability research and doing talent scouting (see this intro doc we wrote so far, which we split off from an unfinished larger review). I have also been working on my own research. Aside from active projects, I’ve accumulated a bit of a backlog of technical writeups and shortforms in draft or “slack discussion”-level form, with various levels of publishability.

This January, I’m planning to edit and publish some of these drafts as posts and shortforms on LW/​the alignment forum. To keep myself accountable, I’m committing to publish at least 3 posts per week.

I’m planning to post about (a subset? superset? overlapping set? of) the following themes:

  1. Opinionated takes on a few research directions (I have drafts on polytopes, mode connectivity, and takes on proof vs. other kinds of “principled formalism without proofs”).

  2. Notes on grammars and more generally, how simpler rules and formal structures can combine into larger ones. This overlaps with a project I’m working on with collaborators, involving a notion of “analogistic circuits”: mechanisms that learn to generalize a complex rule “by analogy”, without ever encoding the structure itself.

  3. Joint with Lauren Greenspan and Lucas Teixeira: some additional bits of our review, with a focus on interepretability (and ways to think about assumptions and experiments).

  4. Joint with Lauren: some distillation and discussion of QFT methods in interpretability.

  5. Bayesian vs. SGD learning from various points of view. (Closely related to discussions with Kaarel Hänni, Lucius Bushnaq, and others).

  6. Related to the above: Extensions of the “Low-Hanging-Fruit” prior post with Nina Panicksserry, specifically focusing on non-learnability of parity, and a new notion of “training stories” (this is closely related to some other work we’ve done with Nina, as well as joint work with Louis Jaburi).

  7. ???

I am generally resistant to making announcements before doing writeups. But in this case, I have thought for a while that these drafts might be useful to get out, but have been blocked by not wanting to post unpolished things. I’ll be pointing at this announcement when posting this month for the following reasons:

  • I will appreciate the extra accountability.

  • Since I’m planning a kind of “nanowrimo” sprint, I’m using this as an excuse to post draft-quality writing (possibly with mistakes, bugs, etc.).

  • I’m hoping to treat this month as a test run of producing more short, imperfect and slightly technical takes which straddle the line between distillation, hot takes, and original research (a very ambitious comparison point I have for the format is Terry Tao’s blog). Based on the success and reception of this short project, I might either do more or less of this in the future.

  • I’m expecting to be wrong about some things, and hoping that more eyes and discussion on the work I and my collaborators have been thinking about will help me find mistakes quickly and debug my thinking more effectively.