Director of AI research at ALTER, where I lead a group working on the learning-theoretic agenda for AI alignment. I’m also supported by the LTFF. See also LinkedIn.
E-mail: {first name}@alter.org.il
Director of AI research at ALTER, where I lead a group working on the learning-theoretic agenda for AI alignment. I’m also supported by the LTFF. See also LinkedIn.
E-mail: {first name}@alter.org.il
This post proposes an approach to decision theory in which we notion of “actions” is emergent. Instead of having an ontologically fundamental notion of actions, the agent just has beliefs, and some of them are self-fulfilling prophecies. For example, the agent can discover that “whenever I believe my arm will move up/down, my arm truly moves up/down”, and then exploit this fact by moving the arm in the right direction to maximize utility. This works by having a “metabelief” (a mapping from beliefs to beliefs; my terminology, not the OP’s) and allowing the agent to choose its belief out of the metabelief fixed points.
The next natural question is then, can we indeed demonstrate that an agent will learn which part of the world it controls, under reasonable conditions. Abram implies that it should be possible if we only allow choice among attractive fixed point. He then bemoans the need for this restriction and tries to use ideas from Active Inference to fix it with limited success. Personally, I don’t understand what’s so bad with staying with the attractive fixed points.
Unfortunately, this post avoids spelling out a sequential version of the decision theory, which would be necessary to actually establish any learning-theoretic result. However, I think that I see how it can be done, and it seems to support Abram’s claims. Details follows.
Let’s suppose that the agent observes two systems, each of which can be in one of two positions. At each moment of time, it observes an element of , where . The agent beliefs it can control one of and whereas the other is a fair coin. However, it doesn’t know which is which.
In this case, metabeliefs are mappings of type . Specifically, we have a hypothesis that asserts is controllable, a hypothesis that asserts is controllable and the overall metabelief is (say) .
The hypothesis is defined by
Here, , , , and is some “motor response function”, e.g. .
Similarly, is defined by
Now, let be an attractive fixed point of and consider some history . If the statistics of in seem biased towards whereas the statistics of in seem like a fair coin, then the likelihoods will satisfy and hence will be close to and therefore will be close to (since is an attractive fixed point). On the other hand, in the converse situation, the likelihoods will satisfy and hence will be close to . Hence, the agent effectively updates on the observed history and will choose some fixed point which controls the available degrees of freedom correctly.
Notice that all of this doesn’t work with repelling fixed points. Indeed, if we used then would have a unique fixed point and there would be nothing to choose.
I find these ideas quite intriguing and am likely to keep thing about them!
I feel that coherence arguments, broadly construed, are a reason to be skeptical of such proposals, but debating coherence arguments because of this seems backward. Instead, we should just be discussing your proposal directly. Since I haven’t read your proposal yet, I don’t have an opinion, but some coherence-inspired question I would be asking are:
Can you define an incomplete-preferences AIXI consistent with this proposal?
Is there an incomplete-preferences version of RL regret bound theory consistent with this proposal?
What happens when your agent is constructing a new agent? Does the new agent inherit the same incomplete preferences?
This post tries to push back against the role of expected utility theory in AI safety by arguing against various ways to derive expected utility axiomatically. I heard many such arguments before, and IMO they are never especially useful. This post is no exception.
The OP presents the position it argues against as follows (in my paraphrasing): “Sufficiently advanced agents don’t play dominated strategies, therefore, because of [theorem], they have to be expected utility maximizers, therefore they have to be goal-directed and [other conclusions]”. They then proceed to argue that there is no theorem that can make this argument go through.
I think that this entire framing is attacking a weak man. The real argument for expected utility theory is:
In AI safety, we are from the get-go interested in goal-directed systems because (i) we want AIs to achieve goals for us (ii) we are worried about systems with bad goals and (iii) stopping systems with bad goals is also a goal.
The next question is then, what is a useful mathematical formalism for studying goal-directed systems.
The theorems quoted in the OP are moderate evidence that expected utility has to be part of this formalism, because their assumptions resonate a lot with our intuitions for what “rational goal-directed behavior” is. Yes, of course we can still quibble with the assumptions (like the OP does in some cases), which is why I say “moderate evidence” rather than “completely watertight proof”, but given how natural the assumptions are, the evidence is good.
More importantly, the theorems are only a small part of the evidence base. A philosophical question is never fully answered by a single theorem. Instead, the evidence base is holistic: looking at the theoretical edifices growing up from expected utility (control theory, learning theory, game theory etc) one becomes progressively more and more convinced that expected utility correctly captures some of the core intuitions behind “goal-directedness”.
If one does want to present a convincing case against expected utility, quibbling with the assumption of VNM or whatnot is an incredibly weak move. Instead, show us where the entire edifice of existing theory runs ashore because of expected utility and how some alternative to expected utility can do better (as an analogy, see how infra-Bayesianism supplants Bayesian decision theory).
In conclusion, there are coherence theorems. But, more important than individual theorems are the “coherence theories”.
There are plenty examples in fiction of greed and hubris leading to a disaster that takes down its own architects. The dwarves who mined too deep and awoke the Balrog, the creators of Skynet, Peter Isherwell in “Don’t Look Up”, Frankenstein and his Creature...
I kinda agree with the claim, but disagree with its framing. You’re imagining that peer pressure is something extraneous to the person’s core personality, which they want to resist but usually fail. Instead, the desire to fit in, to be respected, liked and admired by other people, is one of the core desires that most (virtually all?) people have. It’s approximately on the same level as e.g. the desire to avoid pain. So, people don’t “succumb to peer pressure”, they (unconsciously) choose to prioritize social needs over other considerations.
At the same time, the moral denouncing of groupthink is mostly a self-deception defense against hostile telepaths. With two important caveats:
Having “independent thinking” as part of the ethos of a social group is actually beneficial for that group’s ability to discover true things. While the members of such a group still feel the desire to be liked by other members, they also have the license to disagree without being shunned for it, and are even rewarded for interesting dissenting opinions.
Hyperbolic discount seems to be real, i.e. human preferences are time-inconsistent. For example, you can be tempted to eat candy when one is placed in front of you, while also taking measures to avoid such temptation in the future. Something analogous might apply to peer pressure.
This remains the best overview of the learning-theoretic agenda to-date. As a complementary pedagogic resource, there is now also a series of video lectures.
Since the article was written, there were several new publications:
Gergely Szűcs’s article on interpreting quantum mechanics using infra-Bayesian physicalism.
My paper on linear infra-Bayesian bandits.
An article on infra-Bayesian haggling by my MATS scholar Hanna Gabor. This approach to multi-agent systems did not exist when the overview was written, and currently seems like the most promising direction.
An article on time complexity in string machines by my MATS scholar Ali Cataltepe. This is a rather elegant method to account for time complexity in the formalism.
In addition, some new developments were briefly summarized in short-forms:
A proposed solution for the monotonicity problem in infra-Bayesian physicalism. This is potentially very important since the monotonicity problem was by far the biggest issue with the framework (and as a consequence, with PSI).
Multiple developments concerning metacognitive agents (see also recorded talk). This framework seems increasingly important, but an in-depth analysis is still pending.
A conjecture about a possible axiomatic characterization of the maximin decision rule in infra-Bayesianism. If true, it would go a long way to allaying any concerns about whether maximin is the “correct” choice.
Ambidistributions: a cute new mathematical gadget for formalizing the notion of “control” in infra-Bayesianism.
Meanwhile, active research proceeds along several parallel directions:
I’m working towards the realization of the “frugal compositional languages” dream. So far, the problem is still very much open, but I obtained some interesting preliminary results which will appear in an upcoming paper (codename: “ambiguous online learning”). I also realized this direction might have tight connections with categorical systems theory (the latter being a mathematical language for compositionality). An unpublished draft was written by my MATS scholars on the subject of compositional polytope MDPs, hopefully to be completed some time during ’25.
Diffractor achieved substantial progress in the theory of infra-Bayesian regret bounds, producing an infra-Bayesian generalization of decision-estimation coefficients (the latter is a nearly universal theory of regret bounds in episodic RL). This generalization has important connections to Garrabrant induction (of the flavor studied here), finally sketching a unified picture of these two approaches to “computational uncertainty” (Garrabrant induction and infra-Bayesianism). Results will appear in upcoming paper.
Gergely Szucs is studying the theory of hidden rewards, starting from the realization in this short-form (discovering some interesting combinatorial objects beyond what was described there).
It remains true that there are more shovel-ready open problems than researchers, and hence the number of (competent) researchers is still the bottleneck.
Seems right, but is there a categorical derivation of the Wentworth-Lorell rules? Maybe they can be represented as theorems of the form: given an arbitrary Markov category C, such-and-such identities between string diagrams in C imply (more) identities between string diagrams in C.
This article studies a potentially very important question: is improving connectomics technology net harmful or net beneficial from the perspective of existential risk from AI? The author argues that it is net beneficial. Connectomics seems like it would help with understanding the brain’s reward/motivation system, but not so much with understanding the brain’s learning algorithms. Hence it arguably helps more with AI alignment than AI capability. Moreover, it might also lead to accelerating whole brain emulation (WBE) which is also helpful.
The author mentions 3 reasons why WBE is helpful:
We can let WBEs work on alignment.
We can more easily ban de novo AGI by letting WBEs fill its economic niche
Maybe we can derive aligned superintelligence from modified WBEs.
I think there is another reason: some alignment protocols might rely on letting the AI study a WBEs and use it for e.g. inferring human values. The latter might be viable even if actually running the WBE too slow to be useful with contemporary technology.
I think that performing this kind of differential benefit analysis for various technologies might be extremely important, and I would be glad to see more of it on LW/AF (or anywhere).
This article studies a natural and interesting mathematical question: which algebraic relations hold between Bayes nets? In other words, if a collection of random variables is consistent with several Bayes nets, what other Bayes nets does it also have to be consistent with? The question is studied both for exact consistency and for approximate consistency: in the latter case, the joint distribution is KL-close to a distribution that’s consistent with the net. The article proves several rules of this type, some of them quite non-obvious. The rules have concrete applications in the authors’ research agenda.
Some further questions that I think would be interesting to study:
Can we derive a full classification of such rules?
Is there a category-theoretic story behind the rules? Meaning, is there a type of category for which Bayes nets are something akin to string diagrams and the rules follow from the categorical axioms?
Tbf, you can fit a quadratic polynomial to any 3 points. But triangular numbers are certainly an aesthetically pleasing choice. (Maybe call it “triangular voting”?)
I feel that this post would benefit from having the math spelled out. How is inserting a trader a way to do feedback? Can you phrase classical RL like this?
P(GPT-5 Release)
What is the probability that OpenAI will release GPT-5 before the end of 2025? “Release” means that a random member of the public can use it, possibly paid.
Does this require a product called specifically “GPT-5”? What if they release e.g “OpenAI o2″ instead, and there will never be something called GPT-5?
Number of Current Partners
(for example, 0 if you are single, 1 if you are in a monogamous relationship, higher numbers for polyamorous relationships)
This is a confusing phrasing. If you have 1 partner, it doesn’t mean your relationship is monogamous. A monogamous relation is one in which there is a mutually agreed understanding that romantic or sexual interaction with other people is forbidden. Without this, your relationship is not monogamous. For example:
You have only one partner, but your partner has other partners.
You have only one partner, but you occasionally do one night stands with other people.
You have only one partner, but both you and your partner are open to you having more partners in the future.
All of the above are not monogamous relationships!
I’ve been thinking along very similar lines for a while (my inside name for this is “mask theory of the mind”: consciousness is a “mask”). But my personal conclusion is very different. While self-deception is a valid strategy in many circumstances, I think that it’s too costly when trying to solve an extremely difficult high-stakes problem (e.g. stopping the AI apocalypse). Hence, I went in the other direction: trying to self-deceive little, and instead be self-honest about my[1] real motivations, even if they are “bad PR”. In practice, this means never making excuses to myself such as “I wanted to do A, but I didn’t have the willpower so I did B instead”, but rather owning the fact I wanted to do B and thinking how to integrate this into a coherent long-term plan for my life.
My solution to “hostile telepaths” is diving other people into ~3 categories:
People that are adversarial or untrustworthy, either individually or as representatives of the system on behalf of which they act. With such people, I have no compunction to consciously lie (“the Jews are not in the basement… I packed the suitcase myself...”) or act adversarially.
People that seem cooperative, so that they deserve my good will even if not complete trust. With such people, I will be at least metahonest: I will not tell direct lies, and I will be honest about in which circumstances I’m honest (i.e. reveal all relevant information). More generally, I will act cooperatively towards such people, expecting them to reciprocate. My attitude towards in this group is that I don’t need to pretend to be something other than I am to gain cooperation, I can just rely on their civility and/or (super)rationality.
Inner circle: People that have my full trust. With them I have no hostile telepath problem because they are not hostile. My attitude towards this group is that we can resolve any difference by putting all the cards on the table and doing whatever is best for the group in aggregate.
Moreover, having an extremely difficult high-stakes problem is not just a strong reason to self-deceive less, it’s also strong reason to become more truth-oriented as a community. This means that people with such a common cause should strive to put each other at least in category 2 above, tentatively moving towards 3 (with the caveat of watching out for bad actors trying to exploit that).
While making sure to use the word “I” to refer to the elephant/unconscious-self and not to the mask/conscious-self.
Two thoughts about the role of quining in IBP:
Quine’s are non-unique (there can be multiple fixed points). This means that, viewed as a prescriptive theory, IBP produces multi-valued prescriptions. It might be the case that this multi-valuedness can resolve problems with UDT such as Wei Dai’s 3-player Prisoner’s Dilemma and the anti-Newcomb problem[1]. In these cases, a particular UDT/IBP (corresponding to a particular quine) loses to CDT. But, a different UDT/IBP (corresponding to a different quine) might do as well as CDT.
What to do about agents that don’t know their own source-code? (Arguably humans are such.) Upon reflection, this is not really an issue! If we use IBP prescriptively, then we can always assume quining: IBP is just telling you to follow a procedure that uses quining to access its own (i.e. the procedure’s) source code. Effectively, you are instantiating an IBP agent inside yourself with your own prior and utility function. On the other hand, if we use IBP descriptively, then we don’t need quining: Any agent can be assigned “physicalist intelligence” (Definition 1.6 in the original post, can also be extended to not require a known utility function and prior, along the lines of ADAM) as long as the procedure doing the assigning knows its source code. The agent doesn’t need to know its own source code in any sense.
I just read Daniel Boettger’s “Triple Tragedy And Thankful Theory”. There he argues that the thrival vs. survival dichotomy (or at least its implications on communication) can be understood as time-efficiency vs. space-efficiency in algorithms. However, it seems to me that a better parallel is bandwidth-efficiency vs. latency-efficiency in communication protocols. Thrival-oriented systems want to be as efficient as possible in the long-term, so they optimize for bandwidth: enabling the transmission of as much information as possible over any given long period of time. On the other hand, survival-oriented systems want to be responsive to urgent interrupts which leads to optimizing for latency: reducing the time it takes between a piece of information appearing on one end of the channel and that piece of information becoming known on the other end.
I believe that all or most of the claims here are true, but I haven’t written all the proofs in detail, so take it with a grain of salt.
Ambidistributions are a mathematical object that simultaneously generalizes infradistributions and ultradistributions. It is useful to represent how much power an agent has over a particular system: which degrees of freedom it can control, which degrees of freedom obey a known probability distribution and which are completely unpredictable.
Definition 1: Let be a compact Polish space. A (crisp) ambidistribution on is a function s.t.
(Monotonocity) For any , if then .
(Homogeneity) For any and , .
(Constant-additivity) For any and , .
Conditions 1+3 imply that is 1-Lipschitz. We could introduce non-crisp ambidistributions by dropping conditions 2 and/or 3 (and e.g. requiring 1-Lipschitz instead), but we will stick to crisp ambidistributions in this post.
The space of all ambidistributions on will be denoted .[1] Obviously, (where stands for (crisp) infradistributions), and likewise for ultradistributions.
Example 1: Consider compact Polish spaces and a continuous mapping . We can then define by
That is, is the value of the zero-sum two-player game with strategy spaces and and utility function .
Notice that in Example 1 can be regarded as a Cartesian frame: this seems like a natural connection to explore further.
Example 2: Let and be finite sets representing actions and observations respectively, and be an infra-Bayesian law. Then, we can define by
In fact, this is a faithful representation: can be recovered from .
Example 3: Consider an infra-MDP with finite state set , initial state and transition infrakernel . We can then define the “ambikernel” by
Thus, every infra-MDP induces an “ambichain”. Moreover:
Claim 1: is a monad. In particular, ambikernels can be composed.
This allows us defining
This object is the infra-Bayesian analogue of the convex polytope of accessible state occupancy measures in an MDP.
Claim 2: The following limit always exists:
Definition 3: Let be a convex space and . We say that occludes when for any , we have
Here, stands for convex hull.
We denote this relation . The reason we call this “occlusion” is apparent for the case.
Here are some properties of occlusion:
For any , .
More generally, if then .
If and then .
If and then .
If and for all , then .
If for all , and also , then .
Notice that occlusion has similar algebraic properties to logical entailment, if we think of as ” is a weaker proposition than ”.
Definition 4: Let be a compact Polish space. A cramble set[2] over is s.t.
is non-empty.
is topologically closed.
For any finite and , if then . (Here, we interpret elements of as credal sets.)
Question: If instead of condition 3, we only consider binary occlusion (i.e. require , do we get the same concept?
Given a cramble set , its Legendre-Fenchel dual ambidistribution is
Claim 3: Legendre-Fenchel duality is a bijection between cramble sets and ambidistributions.
The space is equipped with the obvious partial order: when for all . This makes into a distributive lattice, with
This is in contrast to which is a non-distributive lattice.
The bottom and top elements are given by
Ambidistributions are closed under pointwise suprema and infima, and hence is complete and satisfies both infinite distributive laws, making it a complete Heyting and co-Heyting algebra.
is also a De Morgan algebra with the involution
For , is not a Boolean algebra: and for any we have .
One application of this partial order is formalizing the “no traps” condition for infra-MDP:
Definition 2: A finite infra-MDP is quasicommunicating when for any
Claim 4: The set of quasicommunicating finite infra-MDP (or even infra-RDP) is learnable.
Going to the cramble set representation, iff .
is just , whereas is the “occlusion hall” of and .
The bottom and the top cramble sets are
Here, is the top element of (corresponding to the credal set .
The De Morgan involution is
Definition 5: Given compact Polish spaces and a continuous mapping , we define the pushforward by
When is surjective, there are both a left adjoint and a right adjoint to , yielding two pullback operators :
Given and we can define the semidirect product by
There are probably more natural products, but I’ll stop here for now.
Definition 6: The polytopic ambidistributions are the (incomplete) sublattice of generated by .
Some conjectures about this:
For finite , an ambidistributions is polytopic iff there is a finite polytope complex on s.t. for any cell of , is affine.
For finite , a cramble set is polytopic iff it is the occlusion hall of a finite set of polytopes in .
and from Example 3 are polytopic.
One reason to doubt chaos theory’s usefulness is that we don’t need fancy theories to tell us something is impossible. Impossibility tends to make itself obvious.
This claim seems really weird to me. Why do you think that’s true? A lot of things we accomplished with technology today might seem impossible to someone from 1700. On the other hand, you could have thought that e.g. perpetuum mobile, or superluminal motion, or deciding whether a graph is 3-colorable in worst-case polynomial time, or transmitting information with a rate higher than Shannon-Hartley is possible if you didn’t know the relevant theory.
This post argues that, while it’s traditional to call policies trained by RL “agents”, there is no good reason for it and the terminology does more harm than good. IMO Turner has a valid point, but he takes it too far.
What is an “agent”? Unfortunately, this question is not discussed in the OP in any detail. There are two closely related informal approaches to defining “agents” that I like, one more axiomatic / black-boxy and the other more algorithmic / white-boxy.
The algorithmic definition is: An agent is a system that can (i) learn models of its environment (ii) use learned models to generate plans towards a particular goal (iii) execute these plans.
Under this definition, is an RL policy an “agent”? Not necessarily. There is a much stronger case for arguing that the RL algorithm, including the training procedure, is an agent. Indeed, such an algorithm (i) learns a model of the environment (at least if it’s model-based RL: if it’s model-free it might still do so implicitly, but it’s less clear) (ii) generates a plan (the policy) (iii) executes the plans (when the policy is executed, i.e. in inference/deployment time). Whether the policy in itself is an agent amounts to asking whether the policy is capable of in-context RL (which is far from obvious). Moreover, the case for calling the system an agent is stronger when it learns online and weaker (but not completely gone) when there is a separation into non-overlapping training and deployment phases, as often done in contemporary systems.
The axiomatic definition is: An agent is a system that effectively pursues a particular goal in an unknown environment. That is, it needs to perform well (as measured by achieving the goal) when placed in a large variety of different environments.
With this definition we reach similar conclusions. An online RL system would arguably adapt to its environment and optimize towards achieving the goal (which is maximizing the reward). A trained policy will not necessarily do it: if it was trained in a particular environment, it can become completely ineffective in other environments!
Importantly, even an online RL system can easily fail at agentic-ness, depending how good its learning algorithm is for dealing with distributional shift, nonrealizability etc. Nevertheless, the relation between agency and RL is pretty direct, more so than the OP implies.