Traps of Formalization in Deconfusion

Introduction

The Depression of Deconfusion

It had made perfect sense to you. It just clicked, and clarified so much. But now you’re doubting again. Suddenly, your argument doesn’t feel so tight. Or maybe you described it to some colleague, who pointed out a weak joint. Or the meaning of your terms is disagreed upon, and so your point is either trivially wrong or trivially right.

Now you don’t have anything.

In truth, something probably came out of it all. You know that this line of reasoning fails, for one. Not that it’s of use to anyone except you, as you can’t even extract a crystallized reason for the failure. Maybe it salvageable, although you don’t see how. Maybe there’s no way to show what you endeavored to show, for the decidedly pedestrian reason that it’s false. But once again, how are you supposed to argue for that? Even the influence of this failure on what the terms should mean is unclear.

Melancholic, you think back to your PhD in Theoretical Computer Science, where problems abided by the laws of Mathematics. It didn’t made them easier to solve, for sure, but it offered more systematic way to check solutions; better guarantees of correctness; and the near certainty that no one could take it away from you.

Maybe that’s were you went wrong: you didn’t make things formal from the start. No wonder you got nowhere.

So you go back to your notebook, but this time equations, formulas and graphs take most of the space.

It’s a Trap!

As a deconfusion researcher, and as part of a notoriously un-paradigmatic field lacking a clear formalization, I feel like that regularly. From where I come from, math just looks more like research and actually doing work, instead of simply talking about stuff. And there is definitely a lot of value in formalization and its use to unveil confusing parts of what we’re investigating.

What I object against is the (often unconscious) belief that only the formalization matters, and that we should go as fast as possible from the pre-formal state into real math. Thinking that way (as I’m wont to do) leads to problems described in this post: favoring an inaccurate formalization over the informal real deal (and forgetting the later), as well as misapplying standards from Mathematics to deconfusion.

This is probably not an issue for everyone (as I discuss in the last section on equal and opposite advice), but I believe it can serve to anyone who like me comes from a Math/​Theoretical Computer Science background, or ends up believing they need to master heaps of math to contribute in any regard to Alignment.

Thanks to John S. Wentworth and Scott Garrabrant for discussions that helped me realize many of these points and articulate them.

Two Traps of Formalization

Obsessive Formalization

What I call obsessive formalization is this almost pathological need to clarify the muddy discussions we have into mathematical statements. In reasonable doses this impulse underlies the entire drive for deconfusion; yet it can easily go to far.

What this looks like is rushing to produce a very ad-hoc mathematical definition/​formula, and then make it the object of study. Once this is done, we’re back in the more trustworthy realm of formal science, and we can explore at will the properties of this definition, as well as what it entails.

The only issue is that we care about the idea to deconfuse, not the formalization! In the relief to be formal once again, it’s so easy to not check thoroughly whether the proposed definition indeed fits the concept it aimed to capture.

I’ve been guilty of this in my Behavorial Sufficient Statistics for Goal-Directedness post, as pointed out by John (our subsequent discussion pushed him to write a great post on the dangers of ad-hoc mathematical definitions – at least something useful came out of my mistake).

Another trouble with this approach: in the rush to formalize, we might forbid to tease constraints and insights from the applications (which, remember, drive deconfusion). Here to I have to atone for my sins, as I basically wrote an entire post to argue that I didn’t really need to look at applications for goal-directedness to deconfuse it (turn out I was wrong! -- new post incoming about this)

What it all boils down to is to not fall in love with any formalization (kill your darlings, as Stephen King said). By itself, this is enough to remove most of the problems I mentioned: without this bias, we actually search for falsifications, and go back again and again to the intuitions and applications motivating the deconfusion.

Unfortunately, it’s also really hard. For a certain group of people (me included), we get attached to any cool formalization we build. So I don’t have a panacea to offer you. Instead, the best I can give is a bunch of heuristics I got from great deconfusion researchers, and some I use myself:

  • Check whether you have an internal feeling of certainty about your formalization, and be extremely suspicious of it. Not that any formalization that isn’t completely argued for should be dismissed immediately. The point here is instead to compensate the overconfidence that a clean formalization gives you.

  • Make a pre-mortem. Imagine that your formalization is wrong, and look for where it fails. This works better if you manage to roleplay actually believing that it doesn’t work.

    • For example I recently looked for cases where my current formalization of goal-directedness gives a very close call between a goal and its opposite.

  • Refuse to go formal for some time. I got this heuristic from Scott Garrabrant, who explained to me how what he was doing involved the kind of abstraction and generalization common in category theory, with the difference that they do everything formally where he keep the ideas and problems vague and underdefined until everything kind of crystallize in its final formal shape.

    • Extensive definitions (lists of positive and negative examples) work nicely as intuitions pumps in a pre-formal setting. Trying to explain the concept to deconfuse without resorting to any math can also be enlightening.

  • Be application-specific. Instead of capturing the intuitions present in our confused understanding, John proposes to start with one of the applications and only focus on formalizing the concept for this specific purpose. Among other things, this means reverse-engineering what is needed for the application, the hard parts, and the constraints it entail. A later step is to attempt unification of the different formalization for the many applications.

  • Create many different toy model, if possible contradictory As an extension of the previous point, if you manage to create variously interesting and contradictory formalization, it becomes obvious that you can’t just choose one without further investigation and thinking. That’s also a great way to find where two intuitions disagree.

Inadapted Standards

Let’s say that you don’t fall head over heels for your definition, and keep your head cool. There is still a trap waiting for you: applying mathematical standards to the formalization, instead of deconfusion ones.

Complexity

The most obvious discrepancy is complexity: in my experience, most deconfusion only require basic mathematical objects. Concepts like graphs, probability distributions, a handful of geometry and analysis. None of this sounds exciting for someone aware of the fascinating world of modern mathematics: no complex and fascinating objects like simplicial complexes or sheaves.

See some of the most interesting recent formal deconfusions in AI Alignment:

(Potential counterexample: Infrabayesianism was formalized with measure theory and functional analysis, hardly basic math. I think this has to do with the mathematical complexity of the applications: Vanessa’s research is particularly formal, and so its requirement deconfusion might require more powerful structure. That being said, it’s still possible that Infrabayesianism could have been formalized through more basic means – I’m not a master enough of it to have an informed opinion).

A subtlety is that while complex mathematical objects rarely seem valuable in the formalization step, they can prove crucial in studying the result of this formalization. An analogy in mathematics is with the relatively simple and obvious statements of number theory which are studied through the subtleties of complex analysis.

Elegance

Next discrepancy: elegance. Anyone who practiced math or related disciplines has heard this word, usually in opposition with ad-hoc or messy. Elegance in the mathematical sense points at minimalism, doing much with a little, a general lack of clutter and details.

By all mean, if your formalization for deconfusion is of magnificent elegance, that’s great! What I object to is the requirement of elegance. Between elegance and accuracy, the deconfusion researcher, just like the scientist shall prefer the latter.

If an elegant model loses some crucial detail about the concept, it is a point against it being a complete formalization. Maybe this model can serve as a simplified version, but this discrepancy should always be kept in mind.

Generalization

Lastly, we have generalization. I’m using the word in a very intuitive sense: generalization as applying to more contexts and requiring less conditions. Interestingly, even in math I feel that overgeneralization tend to be often considered, if not a sin, at least a distraction.

In the same way, while we want a certain level of generalization in deconfusion, we rarely aim for literally the most general statement. If one has a formalization in terms of Euclidean spaces, in a setting where that’s the only kind of spaces that will appear, moving to topological spaces and looking for the exact axioms required goes a bit overboard.

I honestly don’t feel like generalization is too often an issue in published deconfusion. But this standard has a perverse effect on researchers and newcomers interested in the field: they end up believing they need the means to tackle the most general questions, where, as I argued above, the only requirement are quite simple undergraduate mathematics.

Equal And Opposite Advice

It’s customary around here to mention the Law of Equal and Opposite Advice in every post like this. I actually think I can do better, by explicitly calling out the group of people for which I believe my advice to be most detrimental: researchers with background in philosophy and social science (not economy).

Why? Because this post explores the trappings of wanting to formalize too much – of the mathematical pull on deconfusion. If you come from a field where formalization is rare or frowned upon, I instead encourage you to try to make more mathematical models. Your suspicion for formalization already saves you from some of my caveats; and I hope my discussion of complexity and generalization will encourage you to try to play with simple models.

Conclusion

Formalization provides an invaluable tool for deconfusion: making intuitions and ideas so concrete and precise than they can be judged and investigated without constantly changing faces. But formalization is only a tool, not the most important piece of the puzzle. In almost every case where formalization is prioritized, problems arise: we fall in love with our ad-hoc mathematical definitions; we forget to falsify our models; we apply standards of complexity, elegance and generalization from Mathematics instead of the ones for deconfusion.