Director of AI research at ALTER, where I lead a group working on the learning-theoretic agenda for AI alignment. PhD student in Shay Moran’s group in the Technion (my PhD research and my ALTER research are one and the same). See also Google Scholar and LinkedIn.
E-mail: {first name}@alter.org.il
This post is a great review of the Natural Abstractions research agenda, covering both its strengths and weaknesses. It provides a useful breakdown of the key claims, the mathematical results and the applications to alignment. There’s also reasonable criticism.
To the weaknesses mentioned in the overview, I would also add that the agenda needs more engagement with learning theory. Since the claim is that all minds learn the same abstractions, it seems necessary to look into the process of learning, and see what kind of abstractions can or cannot be learned (both in terms of sample complexity and in terms of computational complexity).
Some thoughts about natural abstractions inspired by this post:
The concept of natural abstractions seems closely related to my informally conjectured agreement theorem for infra-Bayesian physicalism. In a nutshell, two physicalist agents in the same universe with access to “similar” information should asymptotically arrive at similar beliefs (notably this is false for cartesian agents because of the different biases resulting from the different physical points of view).
A possible formalization of the agreement theorem inspired by my richness of mathematics conjecture: Given two beliefs Ψ and Φ, we say that Ψ⪯Φ when some conditioning of Ψ on a finite set of observations produces a refinement of some conditioning of Φ on a finite set of observations (see linked shortform for mathematical details). This relation is a preorder. In general, we can expect an agent to learn a sequence of beliefs of the form Ψ0≺Ψ1≺Ψ2≺… Here, the sequence can be over physical time, or over time discount or over a parameter such as “availability of computing resources” or “how much time the world allows you for thinking between decisions”: the latter is the natural asymptotic for metacognitive agents (see also logical time). Given two agents, we get two such sequences {Ψi} and {Φi}. The agreement theorem can then state that for all i∈N, there exists j∈N s.t. Φj⪯Ψi (and vice versa). More precisely, this relation might hold up to some known function ϵ(i,j) s.t. limj→∞ϵ(i,j)=0.
The “agreement” in the previous paragraph is purely semantic: the agents converge to believing in the same world, but this doesn’t say anything about the syntactic structure of their beliefs. This seems conceptually insufficient for natural abstractions. However, maybe there is a syntactic equivalent where the preorder ⪯ is replaced by morphisms in the category of some syntactic representations (e.g. string machines). It seems reasonable to expect that agents must use such representations to learn efficiently (see also frugal compositional languages).
In this picture, the graphical models used by John are a candidate for the frugal compositional language. I think this might be not entirely off the mark, but the real frugal compositional language is probably somewhat different.