Dmitry Vaintrob

Karma: 1,929

Against blanket arguments against interpretability

Dmitry VaintrobJan 22, 2025, 9:46 AM

50 points

4 comments7 min readLW link

Dmitry Vaintrob Jan 20, 2025, 5:35 PM
12 points
1
on: Dmitry Vaintrob’s Shortform
Statistical localization in disordered systems, and dreaming of more realistic interpretability endpoints

[epistemic status: half fever dream, half something I think is an important point to get across. Note that the physics I discuss is not my field though close to my interests. I have not carefully engaged with it or read the relevant papers—I am likely to be wrong about the statements made and the language used.]

A frequent discussion I get into in the context of AI is “what is an endpoint for interpretability”. I get into this argument from two sides:
- arguing with interpretability purists, who say that the only way to get robust safety from interpretability is to mathematically prove that behaviors are safe and/or no deception is going on.
- arguing with interpretability skeptics, who say that the only way to get robust safety from interpretability is to prove that behaviors are safe and/or no deception is going on.
My typical response to this is that no, you’re being silly: imagine discussing any other phenomenon in this way: “the only way to show that the sun will rise tomorrow is to completely model the sun on the level of subatomic particles and prove that they will not spontaneously explode”. Or asking a bridge safety expert to model every single particle and provably lower-bound the probability of them losing structural coherence in a way not observed by bulk models.

But there’s a more fundamental intuition here, that I started developing when I started trying to learn statistical physics. There are a few lossy ways of expressing it. One is to talk about renormalization, how assumptions about renormalizability of systems is a “theorem” in statistical mechanics, but is not (and probably never will be) proven mathematically, (in some sense, it feels much more like a “truly new flavor of axiom” than even complexity-theoretic things like P vs. NP). But that’s still not it. There is a more general intuition, that’s hard to get across (in particular for someone who, like me, is only a dabbler in the subject) -- that some genuinely incredibly complex and information-laden systems have some “strong locality” properties, which are (insofar as the physical meaning of the word holds meaning) both provable and very robust to changing and expanding the context.

For a while, I thought that this is just a vibe—a way to guide thinking, but not something that can be operationalized in a way that may significantly convince people without a similar intuition.

However, recently I’ve become more hopeful that an “explicitly formalizable” notion of robust interpretability may fall out of this language in a somewhat natural way.

This is closely related to recent discussions and writeups we’ve been doing with Lauren Greenspan on scale and renormalization in (statistical) QFT and connections to ML.

One direction to operationalize this is through the notion of “localization” in statistical physics, and in particular “Anderson localization”. The idea (if I understand it correctly) is that in certain disordered systems (think of a semiconductor, which is an “ordered” metal with a disordered system of “impurity atoms” sprinkled inside), you can prove a kind of screening property: that from the point of view of the localized dynamics near a particular spin, you can provably ignore spins far away from the point you’re studying (or rather, replace them by an “ordered” field that modifies the local dynamics in a fully controllable way). This idea of of local interactions being “screened” from far-away details is ubiquitous. In a very large and very robust class of systems, interactions are purely local, except for mediation by a small number of hierarchical “smooth” couplings that see only high-level summary statistics of the “non-local” spins and treat them as a background—and moreover, these “locality” properties are provable (insofar as we assume the extra “axioms” of thermodynamics), assuming some (once again, hierarchical and robustly adjustable) assumptions of independence. There are a number of related principles here that (if I understand correctly) get used in similar contexts, sometimes interchangeably: one I liked is “local perturbations perturb locally” (“LPPL”) from this paper.

Note that in the above paragraph I did something I generally disapprove of: I am trying to extract and verbalize “vibes” from science that I don’t understand on a concrete level, and I am almost certainly getting a bunch of things wrong. But I don’t know of another way of gesturing in a “look, there’s something here and it’s worth looking into” way without doing this to some extent.

Now AI systems, just like semiconductors, are statistical systems with a lot of disorder. In particular in a standard operationalization (as e.g. in PDLT), we can conceptualize of neural nets as a field theory. There is a “vacuum theory” that depends only on the architecture, and then adding new datapoints corresponds to adding particles. PDLT only studies a certain perturbative picture here, but it seems plausible that an extension of these techniques may extend to non-perturbative scales (and hope for this is a big part of the reason that Lauren and I have been thinking and writing about renormalization). In a “dream” version of such an extension, the datapoints would form a kind of disordered system, with both ordered components, hierarchical relationships, and some assumption of inherent randomness outside of the relationships. A great aspect of “numerical” QFT, such as gets applied in condensed matter models, is that you don’t need a really great model of the hierarchical relationships: sometimes you can just play around and turn on a handful of extra parameters until you find something that works. (Again, at the moment this is an imprecise interpretation of things I have not deeply engaged with.)

Of course doing this makes some assumptions—but the assumptions are on the level of the data (i.e. particles), not the weights/ model internals (i.e., fields—the place where we are worried about misalignment, etc.). And if you assume these assumptions and write down a “localization theorem” result, then plausibly the kind of statement you will get is something along the lines of the following:

“the way this LLM is completing this sentence is a combination of a sophisticated collection of hierarchical relationships, but I know that the behavior here is equivalent to behaviors on other similar sentences up to small (provably) low-complexity perturbations”.

More generally, the kinds of information this kind of picture would give is a kind of “local provably robust interpretability”—where the text completion behavior of a model is provably (under suitable “disordered system” assumptions) reducible to a collection of several local circuits that depend on understandable phenomena at a few different scales. A guiding “complexity intuition” for me here is provided by the notrivial but tractable grammar task diagrams in the paper Marks et al. (See pages 25-27, and note the shape of these diagrams is more or less straightup typical of the shape of a nonrenormalized interaction diagram you see before you start applying renormalization to simplify a statistical system).

An important caveat here is that in physical models of this type (and in pictures that include renormalization more generally), one does not make—or assume—any “fundamentality” assumptions. In many cases a number of alternative (but equivalent, once the “screening” is factored in) pictures exist, with various levels of granularity, elegance, etc. (this already can be seen in the 2D Ising model—a simple magnet model—where the same behaviors can either be understood in a combinatorial “spin-to-spin interaction” way, which mirrors the “fundamental interpretability” desires of mechinterp, and through this “recursive screening out” model that is more renormalization-flavored; the results are the same (to a very high level of precision), even when looking at very localized effects involving collections of a few spins. So the question of whether an interpretation is “fundamental” or uses the “right latents” is to a large extent obviated here; the world of thermodynamics is much more anarchical and democratic than the world of mathematical formalism and “elegant proof”, at least in this context.

Having handwavily described a putative model, I want to quickly say that I don’t actually believe in this model. There are a bunch of things I probably got wrong, there are a bunch of other, better tools to use, and so on. But the point is not the model: it’s that this kind of stuff exists. There exist languages that show that arbitrarily complex, arbitrarily expressive behaviors are provably reducible to local interactions, where behaviors can be understood as clusters of hierarchical interactions that treat all but a few parts of the system at every point as “screened out noise”.

I think that if models like this are possible, then a solution to “the interpretability component to safety” is possible in this framework. If you have provably localized behaviors then for example you have a good idea where to look for deception: e.g., deception cannot occur on the level “very low-level” local interactions, as they are too simple to express the necessary reasoning, and perhaps it can be carefully operationalized and tracked in the higher-level interactions.

As you’ve no doubt noticed, this whole picture is splotchy and vague. It may be completely wrong. But there also may be something in this direction that works. I’m hoping to think more about this, and very interested in hearing people’s criticisms and thoughts.
What links here?

Logits, log-odds, and loss for parallel circuits

Dmitry VaintrobJan 20, 2025, 9:56 AM

57 points

4 comments11 min readLW link

Dmitry Vaintrob Jan 20, 2025, 1:47 AM
10 points
2
on: What’s the Right Way to think about Information Theoretic quantities in Neural Networks?
What application do you have in mind? If you’re trying to reason about formal models without trying to completely rigorously prove things about them, then I think thinking of neural networks as stochastic systems is the way to go. Namely, you view the weights as a random variable solving a stochastic optimization problem to produce a weight-valued random variable, then conditioning it on whatever knowledge about the weights/activations you assume is available. This can be done both in the Bayesian “thermostatic” sense as a model of idealized networks, and in the sense of modeling the NN as SGD-like systems. Both methods are explored explicitly (and give different results) in suitable high width limits by the PDLT and tensor networks paradigms (the latter also looks at “true SGD” with nonnegligible step size).
Here you should be careful about what you condition on, as conditioning on exact knowledge of too much input-output behavior of course blows stuff up, and you should think of a way of coarse-graining, i.e. “choose a precision scale” :). Here my first goto would be to assume the tempered Boltzmann distribution on the loss at an appropriate choice of temperature for what you’re studying.
If you’re trying to do experiments, then I would suspect that a lot of the time you can just blindly throw whatever ML-ish tools you’d use in an underdetermined, “true inference” context and they’ll just work (with suitable choices of hyperparameters)

Dmitry Vaintrob Jan 19, 2025, 11:06 PM
3 points
0
in reply to: Tahp’s comment on: Renormalization Redux: QFT Techniques for AI Interpretability
This is where this question of “scale” comes in. I want to add that (at least morally/intuitively) we are also thinking about discrete systems like lattices, and then instead of a regulator you have a coarsegraining or a “blocking transformation”, which you have a lot of freedom to choose. For example in PDLT, the object that plays the role of coarsegraining is the operation that takes a probability distribution on neurons and applies a single-layer NN to it.

Is theory good or bad for AI safety?

Dmitry VaintrobJan 19, 2025, 10:32 AM

26 points

1 comment5 min readLW link

Dmitry Vaintrob Jan 19, 2025, 12:02 AM
2 points
0
in reply to: Simon Pepin Lehalleur’s comment on: Renormalization Redux: QFT Techniques for AI Interpretability
https://www.cond-mat.de/events/correl22/manuscripts/vondelft.pdf

Dmitry Vaintrob Jan 18, 2025, 11:43 AM
5 points
0
in reply to: Charlie Steiner’s comment on: The quantum red pill or: They lied to you, we live in the (density) matrix
Thanks for the reference—I’ll check out the paper (though there are no pointer variables in this picture inherently).
I think there is a miscommunication in my messaging. Possibly through overcommitting to the “matrix” analogy, I may have given the impression that I’m doing something I’m not. In particular, the view here isn’t a controversial one—it has nothing to do with Everett or einselection or decoherence. Crucially, I am saying nothing at all about quantum branches.
I’m now realizing that when you say map or territory, you’re probably talking about a different picture where quantum interpretation (decoherence and branches) is foregrounded. I’m doing nothing of the sort, and as far as I can tell never making any “interpretive” claims.
All the statements in the post are essentially mathematically rigorous claims which say what happens when you
- start with the usual QM picture, and posit that
- your universe divides into at least two subsystems, one of which you’re studying
- one of the subsystems your system is coupled to is a minimally informative infinite-dimensional environment (i.e., a bath).
Both of these are mathematically formalizable and aren’t saying anything about how to interpret quantum branches etc. And the Lindbladian is simply a useful formalism for tracking the evolution of a system that has these properties (subdivisions and baths). Note that (maybe this is the confusion?) subsystem does not mean quantum branch, or decoherence result. “Subsystem” means that we’re looking at these particles over here, but there are also those particles over there (i.e. in terms of math, your Hilbert space is a tensor product ${Sytem}_{1} \otimes {System}_{2} .$
Also, I want to be clear that we can and should run this whole story without ever using the term “probability distribution” in any of the quantum-thermodynamics concepts. The language to describe a quantum system as above (system coupled with a bath) is from the start a language that only involves density matrices, and never uses the term “X is a probability distribution of Y”. Instead you can get classical probability distributions to map into this picture as a certain limit of these dynamics.
As to measurement, I think you’re once again talking about interpretation. I agree that in general, this may be tricky. But what is once again true mathematically is that if you model your system as coupled to a bath then you can set up behaviors that behave exactly as you would expect from an experiment from the point of view of studying the system (without asking questions about decoherence).

Dmitry Vaintrob Jan 18, 2025, 10:55 AM
7 points
0
in reply to: Simon Pepin Lehalleur’s comment on: Renormalization Redux: QFT Techniques for AI Interpretability
Thanks for the questions!
1. Yes, “QFT” stands for “Statistical field theory” :). We thought that this would be more recognizable to people (and also, at least to some extent, statistical is a special case of quantum). We aren’t making any quantum proposals.
2. 1. We’re following (part of) this community, and interested in understanding and connecting the different parts better. Most papers in the “reference class” we have looked at come from (a variant of) this approach. (The authors usually don’t assume Gaussian inputs or outputs, but just high width compared to depth and number of datapoints—this does make them “NTK-like”, or at least perturbatively Gaussian, in a suitable sense).
  2. Neither of us thinks that you should think of AI as being in this regime. One of the key issues here is that Gaussian models can not model any regularities of the data beyond correlational ones (and it’s a big accident that MNIST is learnable by Gaussian methods). But we hope that what AIs learn can largely be well-described by a hierarchical collection of different regimes where the “difference”, suitably operationalized, between the simpler interpretation and the more complicated one is well-modeled by a QFT-like theory (in a reference class that includes perturbatively Gaussian models but is not limited to them). In particular one thing that we’d expect to occur in certain operationalizations of this picture is that once you have some coarse interpretation that correctly captures all generalizing behaviors (but may need to be perturbed/suitably denoised to get good loss), the last and finest emergent layer will be exactly something in the perturbatively Gaussian regime.
  3. Note that I think I’m more bullish about this picture and Lauren is more nuanced (maybe she’ll comment about this). But we both think that it is likely that having good understanding of perturbatively Gaussian renormalization would be useful for “patching in the holes”, as it were, of other interpretability schemes. A low-hanging fruit here is that whenever you have a discrete feature-level interpreatation of a model, instead of just directly measuring the reconstruction loss you should at minimum model the difference model-interpretation as a perturbative Gaussian (corresponding to assuming the difference has “no regularity beyond correlation information”).
3. We don’t want to assume homogeneity, and this is mostly covered by 2b-c above. I think the main point we want to get across is that it’s important and promising to try to go beyond the “homogeneity” picture—and to try to test this in some experiments. I think physics has a good track record here. Not on the level of tigers, but for solid-state models like semiconductors. In this case you have:
  1. The “standard model” only has several-particle interactions (corresponding to the “small-data limit”).
  2. By applying RG techniques to a regular metallic lattice (with initial interactions from the standard model), you end up with a good new universality class of QFT’s (this now contains new particles like phonons and excitons which are dictated by the RG analysis at suitable scales). You can be very careful and figure out the renormalization coupling parameters in this class exactly, but much more realistically and easily you just get them from applying a couple of measurements. On an NN level, “many particles arranged into a metallic pattern” corresponds to some highly regular structure in the data (again, we think “particles” here should correspond to datapoints, at least in the current RLTC paradigm).
  3. The regular metal gives you a “background” theory, and now we view impurities as a discrete random-feature theory on top of this background. Physicists can still run RG on this theory by zooming out and treating the impurities as noise, but in fact you can also understand the theory on a fine-grained level near an impurity by a more careful form of renormalization, where you view the nearest several impurities as discrete sources and only coarsegrain far-away impurities as statistical noise. At least for me, the big hope is that this last move is also possible for ML systems. In other words, when you are interpreting a particular behavior of a neural net, you can model it as a linear combination of a few messy discrete local circuits that apply in this context (like the complicated diagram from Marks et al below) plus a correctly renormalized background theory associated to all other circuits (plus corrections from other layers plus …)

Dmitry Vaintrob Jan 18, 2025, 4:13 AM
2 points
0
in reply to: jacob_drori’s comment on: The quantum red pill or: They lied to you, we live in the (density) matrix
To add: I think the other use of “pure state” comes from this context. Here if you have a system of commuting operators and take a joint eigenspace, the projector is mixed, but it is pure if the joint eigenvalue uniquely determines a 1D subspace; and then I think this terminology gets used for wave functions as well

Renormalization Redux: QFT Techniques for AI Interpretability

Lauren Greenspan and Dmitry Vaintrob

Jan 18, 2025, 3:54 AM

44 points

12 comments7 min readLW link

Dmitry Vaintrob Jan 18, 2025, 2:27 AM
7 points
1
in reply to: Charlie Steiner’s comment on: The quantum red pill or: They lied to you, we live in the (density) matrix
One person’s “occam’s razor” may be description length, another’s may be elegance, and a third person’s may be “avoiding having too much info inside your system” (as some anti-MW people argue). I think discussions like “what’s real” need to be done thoughtfully, otherwise people tend to argue past each other, and come off overconfident/ underinformed.

To be fair, I did use language like this so I shouldn’t be talking—but I used it tongue-in-cheek, and the real motivation given in the above is not “the DM is a more fundamental notion” but “DM lets you make concrete the very suggestive analogy between quantum phase and probability”, which you would probably agree with.
For what it’s worth, there are “different layers of theory” (often scale-dependent), like classical vs. quantum vs. relativity, etc., where there I think it’s silly to talk about “ontological truth”. But these theories are local conceptual optima among a graveyard of “outdated” theories, that are strictly conceptually inferior to new ones: examples are heliocentrism (and Ptolemy’s epycycles), the ether, etc.
Interestingly, I would agree with you (with somewhat low confidence) that in this question there is a consensus among physicists that one picture is simply “more correct” in the sense of giving theoretically and conceptually more elegant/ precise explanations. Except your sign is wrong: this is the density matrix picture (the wavefunction picture is genuinely understood as “not the right theory”, but still taught and still used in many contexts where it doesn’t cause issues).
I also think that there are two separate things that you can discuss.
1. Should you think of thermodynamics, probability, and things like thermal baths as fundamental to your theory or incidental epistemological crutches to model the world at limited information?
2. Assuming you are studying a “non-thermodynamic system with complete information”, where all dynamics is invertible over long timescales, should you use wave functions or density matrices?
Note that for #1, you should not think of a density function as a probability distribution on quantum states (see the discussion with Optimization Process in the comments), and this is a bad intuition pump. Instead, the thing that replaces probability distributions in quantum mechanics is a density matrix.
I think a charitable interpertation of your criticism would be a criticism of #1 (putting limited-info dynamics—i.e., quantum thermodynamics) as primary to “invertible dynamics”. Here there is a debate to be had.
I think there is not really a debate in #2: even in invertible QM (no probability), you need to use density matrices if you want to study different subsystems (e.g. when modeling systems existing in an infinite, but not thermodynamic universe you need this language, since restricting a wavefunction to a subsystem makes it mixed). There’s also a transposed discussion, that I don’t really understand, of all of this in field theory: when do you have fields vs. operators vs. other more complicated stuff, and there is some interesting relationship to how you conceptualize “boundaries”—but this is not what we’re discussing. So you really can’t get away from using density matrices even in a nice invertible universe, as soon as you want to relate systems to subsystems.
For question #1 is reasonable (though I don’t know how productive) to discuss what is “primary”. I think (but here I am really out of my depth) that people who study very “fundamental” quantum phenomena increasingly use a picture with a thermal bath (e.g. I vaguely remember this happening in some lectures here). At the same time, it’s reasonable to say that “invertible” QM phenomena are primary and statistical phenomena are ontological epiphenomena on top of this. While this may be a philosophical debate, I don’t think it’s a physical one, since the two pictures are theoretically interchangeable (as I mentioned, there is a canonical way to get thermodynamics from unitary QM as a certain “optimal lower bound on information dynamics”, appropriately understood).
Still, as soon as you introduce the notion of measurement, you cannot get away from thermodynamics. Measurement is an inherently information-destroying operation, and iiuc can only be put “into theory” (rather than being an arbitrary add-on that professors tell you about) using the thermodynamic picture with nonunitary operators on density matrices.

Dmitry Vaintrob Jan 18, 2025, 1:51 AM
2 points
0
in reply to: jacob_drori’s comment on: The quantum red pill or: They lied to you, we live in the (density) matrix
Thanks—you’re right. I have seen “pure state” referring to a basis vector (e.g. in quantum computation), but in QTD your definition is definitely correct. I don’t like the term “pointer variable”—is there a different notation you like?

Dmitry Vaintrob Jan 18, 2025, 1:48 AM
2 points
0
in reply to: gentleunwashed’s comment on: The quantum red pill or: They lied to you, we live in the (density) matrix
Yeah, this also bothered me. The notion of “probability distribution over quantum states” is not a good notion: the matrix I is both (|0\rangle \langle 0|+|1\rangle \langle 1|) and (|a\rangle \langle a|+|b\rangle \langle b|) for any other orthogonal basis. The fact that these should be treated equivalently seems totally arbitrary. The point is that density matrix mechanics is the notion of probability for quantum states, and can be formalized as such (dynamics of informational lower bounds given observations). I was sort of getting at this with the long “explaining probability to an alien” footnote, but I don’t think it landed (and I also don’t have the right background to make it precise)

Dmitry Vaintrob Jan 17, 2025, 7:03 PM
4 points
−1
in reply to: Charlie Steiner’s comment on: The quantum red pill or: They lied to you, we live in the (density) matrix
I’ve found our Agent Smith :) If you are serious, I’m not sure what you mean. Like there is no ontology in physics—every picture you make is just grasping at pieces of whatever theory of everything you eventually develop

The quantum red pill or: They lied to you, we live in the (density) matrix

Dmitry VaintrobJan 17, 2025, 1:58 PM

37 points

34 comments12 min readLW link

Dmitry Vaintrob Jan 17, 2025, 12:35 AM
9 points
0
on: Quantum without complication
I like this! Something I would add at some point before unitarity is that there is another type of universe that we almost inhabit, where your vectors of states have real positive coefficients that sum to 1, and your evolution matrices are Markovian (i.e., have positive coefficients and preserve the sum of coordinates). In a certain sense in such a universe it’s weird to say “the universe is .3 of this particle being in state 1 and .7 of it being in state 2”, but if we interpret this as a probability, we have lived experience of this.
Something that I like to point out that clicked for me at some point and serves as a good intuition pump, is that for many systems that have a real and quantum analogue, there is actually an interpolated collection of linear dynamics problems like you described that exactly interpolates between quantum and statistical. There’s a little bit of weirdness here, BTW, since there’s this weird nonlinearity (“squaring the norm”) that you need to go from quantum to classical systems. The reason for this actually has to do with density matrices.
There’s a whole post to be written on this, but the basic point is that “we’ve been lied to”: when you’re introduced to QM and see a wavefunction $ψ$ , this actually doesn’t correspond to any linear projection/disentanglement/etc. of the “multiverse state”. What instead is being linearly extracted from the “multiverse state” is the external product matrix $ψ ψ^{†},$ which is the $n \times n$ complex-valued matrix that projects to the 1-dimensional space spanned by the wave function. Now the correction of the “lie” is that the multiverse state itself should be thought of as a matrix. When you do this, the new dynamics now acts on the space of matrices. And you see that the quantum probabilities are now real-valued linear invariants of this state (to see this: the operation of taking the outer product with itself is quadratic, so the “squared norm” operators are now just linear projections that happen to have real values). In this picture, finding the probability of a measurement has exactly the same type signature as measuring the “probability of an event” in the statistical picture: namely, it is a linear function of the “multiverse vector” (just a probability distribution on states in the “statistical universe picture”). Now the evolution of the projection matrix still comes from a linear evolution on your “corrected” vector space of matrix states (in terms of your evolution matrix U, it takes the matrix M to $U M U^{†}$ , and of course each coefficient of the new matrix is linear in the old matrix). So this new dynamics is exactly analogous to probability dynamics, with the exception that your matrices are non-Markovian (indeed, on the level of matrices they are also unitary or at least orthogonal) and you make an assumption on your initial “vector” that, when viewed as a matrix, it is rank-1 complex projection matrix, i.e. has the form $ψ ψ^{†} .$ (In fact if you drop this assumption of being rank-1 and look instead at the linear subspace of matrices these generate—namely, Hermitian matrices—then you also get reasonable quantum mechanics, and many problems in QM in fact force you to make this generalization.)

Dmitry Vaintrob Jan 16, 2025, 8:59 AM
3 points
0
in reply to: Alexander Gietelink Oldenziel’s comment on: Permanents: much more than you wanted to know
The elves care, Alex. The elves care.

Permanents: much more than you wanted to know

Dmitry VaintrobJan 16, 2025, 8:04 AM

17 points

2 comments15 min readLW link

Dmitry Vaintrob Jan 15, 2025, 1:45 PM
15 points
2
on: Dmitry Vaintrob’s Shortform
Why I’m in AI sequence: 2020 Journal entry about gpt3

I moved from math academia to full-time AI safety a year ago—in this I’m in the same boat as Adam Shai, whose reflection post on the topic I recommend you read instead of this.

In making the decision, I went through a lot of thinking and (attempts at) learning about AI before that. A lot of my thinking had been about whether a pure math academic can make a positive difference in AI, and examples that I thought counterindicated this—I finally decided this might be a good idea after talking to my sister Lizka extensively and doing MATS in Summer of 2023. I’m thinking of doing a more detailed post about my decision and thinking later, in case there are other academics thinking about making this transition (and feel free to reach out in pm’s in this case!).

But one thing I have started to forget is how scary and visceral AI risk felt when I was making the decision. I’m both glad and a little sad that the urgency is less visceral and more theoretical now. AI is “a part of the world”, not an alien feature: part of the “setting” in the Venkat Rao post that was part of my internal lexicon at the time.

For now, in order to fill a gap in my constantly flagging daily writing schedule, I’ll share a meandering entry from 2020 about how I thought about positive AI futures. I don’t endorse a lot of it; much is simplistic and low-context, or alternatively commonplace in these circles, though some of it holds up. It’s interesting reading back that the thing I thought was most interesting as a first attempt at orienting my thinking was fleshing out “positive futures” and what they might entail. Two big directional updates I’ve had since are thinking harder about “human alignment” and “human takeover”, and trying to temper the predictions that assume singularitarian “first-past-the-post” AGI for a messier “AI-is-kinda-AGI” world that we will likely end up in.

journal entry

7/19/2020 [...] I’m also being paranoid about GPT-3.

Let’s think. Will the world end, and if so, when? No one knows, obviously. GPT-3 is a good text generation bot. It can figure out a lot about semantics, mood, style, even a little about humor. It’s probably not going to take over the world yet. But how far away are we from AGI?

GPT-3 makes me think, “less than a decade”. There’s a possibility it will be soon (within the year). I’d assign that probability 10%. It felt like 20% when I first saw its text, but seeing Sam Altman’s remark and thinking a little harder, I don’t think it’s quite realistic for it to go AGI without a significant extra step or two. I think that I’d give it order of 50% within the decade. So it’s a little like living with a potentially fatal disease, with a prognosis of 10 years. Now we have no idea what AGI will be like. It will most likely either be very weird and deadly or revolutionary and good, though disappointing in some ways. I think there’s not much we can do about the weird and deadly scenarios. Humans have lived in sociopathic times (see Venkat’s notes on his 14th Century Europe book). It would probably be shorter and deadlier than the plague; various “human zoo” scenarios may be pleasant to experience (after all zoo animals are happier in general than in the wild, at least from the point of view of basic needs), but harrowing to imagine. In any case, it’s not worth speculating on this.

What would a good outcome look like? Obviously, no one knows. It’s very hard to predict our interaction with a super-human intelligence. But here are some pretty standard “decent” scenarios: (1) After a brief period of a pro-social AI piloted by a team of decent people, we end up with a world much like ours but with AI capabilities curbed for a long period of time [...]. If it were up to me I would design this world with certain “guard rail”-like changes: to me this would be a “Foundation”-style society somewhere in New Zealand (or on the bottom of the ocean perhaps? the moon?) consisting of people screened for decency, intelligence, etc. (but with serious diversity and variance built in), and with control of the world’s nukes, with the responsibility of imposing very basic non-interference and freedom of immigration criteria on the world’s societies (i.e., making the “archipelago” dream a reality, basically). So enforcing no torture, disincentivizing violent conflict, imposing various controls to make sure people can move from country to country and are exposed to the basic existence of a variety of experiences in the world, but allowing for culturally alien or disgusting practices in any given country: such as Russian homophobia, strict Islamic law, unpleasant-seeming (for Western Europeans) traditions in certain tribal cultures, etc. This combined with some sort of non-interventionist altruistic push. In this sci-fi scenario the Foundation-like culture would have de facto monopoly of the digital world (but use it sparingly) and also a system of safe nuclear power plants sufficient to provide the world’s power (but turned on carefully and slowly, to prevent economic jolts), but to carefully and “incontrovertibly” turn most of the proceeds into a universal basic income for the entire world population. Obviously this would have to be carefully thought out first by a community of intelligent and altruistic people with clear rules of debate/decision. —The above was written extremely sleepy. [...]

(2) (Unlikely) AI becomes integrated with (at first, decent and intelligent later, all interested) humans via some kind of mind-machine interface, or alternatively a faithful human modeling in silica. Via a very careful and considered transition (in some sense “adiabatic”, i.e. designed so as not to lose any of our human ethos and meaning that can possibly be recovered safely) we become machines, with a good and meaningful (not wireheaded, other than by considered choice) world left for the hold-outs who chose to remain purely human.

(3) The “Her” scenario: AI takes off on its own, because of human carelessness or desperation. It develops in a way that cherishes and almost venerates humans, and puts effort into making a good, meaningful existence for humans (meaningful and good in sort of the above adiabatic sense, i.e. meaningful via a set of clearly desirable stages of progress from step to step, without hidden agendas, and carefully and thoughtfully avoiding creating or simulating, in an appropriate sense, anything that would be considered a moral horror by locally reasonable intelligences at any point in the journey). AI continues its own existence, either self-organized to facilitate this meaningful existence of humans or doing its own thing, in a clearly separated and “transcendent” world, genuinely giving humans a meaningful amount of self-determination, while also setting up guardrails to prevent horrors and also perhaps eliminating or mitigating some of the more mundane woes of existence (something like cancer, etc.) without turning us into wireheads.

(4) [A little less than ideal by my book, but probably more likely than the others]: The “garden of plenty” scenario. AI takes care of all human needs and jobs, and leaves all humans free to live a nevertheless potentially fulfilling existence, like aristocrats or Victorians but less classist, socializing learning reading, etc., with the realization that all they are doing is a hobby: perhaps “human-generated knowledge” would be a sort of sport, or analog of organic produce (homeopathically better, but via a game that makes everyone who plays it genuinely better in certain subtle ways). Perhaps AI will make certain “safe” types of art, craft and knowledge (maybe math! Here I’m obviously being very biased about my work’s meaning not becoming fully automated) purely the domain of humans, to give us a sense of self-determination. Perhaps humans are guided through a sort of accelerated development over a few generations to get to the no.2 scenario.

(5) There is something between numbers 3 and 4 above, less ideal than all of the above but likely, where AI quickly becomes an equal player to humans in the domain of meaning-generation, and sort of fills up space with itself while leaving a vaguely better (maybe number 4-like) Earth to humans. Perhaps imposes a time limit on humans (enforced via a fertility cap, hopefully with the understanding that humans can raise AI babies with genuine sense of filial consciousness and complete with bizzarre scences of trying to explain the crazy world of AI to their parents), after which the human project becomes the AI project, probably essentially incomprehensible to us.

There’s a sense that I have that while I’m partial to scenarios 1 and 2: I want humans to retain the monopoly on meaning-generation and to be able to feel empowered and important, it will be seen to be old-fashioned and almost dangerous by certain of my peers because of the lack of emphasis on harm-prevention, stable future, etc. I think this is part of the very serious debate, so far abstract and fun, but, as AI gets better, perhaps turning heated and loud, between whether comfort or meaning are more important goals of the human project (and both sides will get weird). I am firmly on the side of meaning, with a strict underpinning of retaining bodily and psychological integrity in all the object-level and meta-level senses (except I guess I’m ok with moving to the cloud eventually? Adiabatic is the word for me). Perhaps my point of view is on the side I think it is just in the weird group of futurists and rationalists that I mostly read when reading about AI: probably the generic human who thinks about AI is horrified by all of the above scenarios and just desperately hoping it will go away on its own, or has some really idiosyncratic mix of the above or other ideas which seem obviously preferable to them.
What links here?
- My January alignment theory Nanowrimo by Dmitry Vaintrob (Jan 2, 2025, 12:07 AM; 42 points)

Dmitry Vaintrob

Against blan­ket ar­gu­ments against interpretability

Log­its, log-odds, and loss for par­allel circuits

Is the­ory good or bad for AI safety?

Renor­mal­iza­tion Re­dux: QFT Tech­niques for AI Interpretability

The quan­tum red pill or: They lied to you, we live in the (den­sity) matrix

Per­ma­nents: much more than you wanted to know

Why I’m in AI sequence: 2020 Journal entry about gpt3

journal entry