Vanessa Kosoy

Karma: 9,229

Director of AI research at ALTER, where I lead a group working on the learning-theoretic agenda for AI alignment. I’m also supported by the LTFF. See also LinkedIn.

E-mail: {first name}@alter.org.il

Vanessa Kosoy Apr 9, 2025, 6:40 AM
LW: 6 AF: 4
−4
AF
in reply to: Cole Wyeth’s comment on: abramdemski’s Shortform
I think that in 2 years we’re unlikely to accomplish anything that leaves a dent in P(DOOM), with any method, but I also think it’s more likely than not that we actually have >15 years.
As to “completing” the theory of agents, I used the phrase (perhaps perversely) in the same sense that e.g. we “completed” the theory of information: the latter exists and can actually be used for its intended applications (communication systems). Or at least in the sense we “completed” the theory of computational complexity: even though a lot of key conjectures are still unproven, we do have a rigorous understanding of what computational complexity is and know how to determine it for many (even if far from all) problems of interest.
I probably should have said “create” rather than “complete”.

Vanessa Kosoy Apr 8, 2025, 5:49 PM
LW: 5 AF: 2
0
AF
in reply to: Cole Wyeth’s comment on: abramdemski’s Shortform
(Summoned by @Alexander Gietelink Oldenziel)
I don’t understand this comment. I usually don’t think of “building a safer LLM agent” as a viable route to aligned AI. My current best guess about how to create aligned AI is Physicalist Superimitation. We can imagine other approaches, e.g. Quantilized Debate, but I am less optimistic there. More importantly, I believe that we need to complete the theory of agents first, before we can have strong confidence about which approaches are more promising.
As to heuristic implementations of infra-Bayesianism, this is something I don’t want to speculate about in public, it seems exfohazardous.

Vanessa Kosoy Apr 8, 2025, 3:16 PM
2 points
0
in reply to: Lorxus’s comment on: Some Rules for an Algebra of Bayes Nets
Somehow being able to derive all relevant string diagram rewriting rules for latential string diagrams, starting with some fixed set of equivalences?
What are “latential” string diagrams?
What does it it mean that you can’t derive them all from a “fixed” set? Do you imagine some strong claim e.g. that the set of rewriting rules being undecidable, or something else?
Two Bayes nets are of the same Markov equivalence class when they have precisely the same set of conditionality relations holding on them (and by extension, precisely the same undirected skeleton).
Okay, so this is not what you care about? Maybe you are saying the following: Given two diagrams X,Y, we want to ask whether any distribution compatible with X is compatible with Y. We don’t ask whether the converse also holds. This is a certain asymmetric relation, rather than an equivalence.

Vanessa Kosoy Apr 8, 2025, 11:45 AM
2 points
0
in reply to: Lorxus’s comment on: Some Rules for an Algebra of Bayes Nets
I found the above comment difficult to parse.
that’s not a thing that really happens?
What is the thing that doesn’t happen? Reading the rest of the paragraph only left me more confused.
we don’t quite care about Markov equivalence class
What do you mean by “Markov equivalence class”?

Vanessa Kosoy Mar 25, 2025, 8:23 AM
LW: 2 AF: 2
0
AF
in reply to: Cole Wyeth’s comment on: Vanessa Kosoy’s Shortform
Thanks for this!
What I was saying up there is not a justification of Hurwicz’ decision rule. Rather, it is that if you already accept the Hurwicz rule, it can be reduced to maximin, and for a simplicity prior the reduction is “cheap” (produces another simplicity prior).
Why accept the Hurwicz’ decision rule? Well, at least you can’t be accused of a pessimism bias there. But if you truly want to dig deeper, we can start instead from an agent making decisions according to an ambidistribution, which is a fairly general (assumption-light) way of making decisions. I believe that a similar argument (easiest to see in the LF-dual cramble set representation) would allow reducing that to maximin on infradistributions (credal sets).
To make such an approach even more satisfactory, it would be good to add a justification for a simplicity ambi/infra-prior. I think this should be possible by arguing from “opinionated agents”: the ordinary Solomonoff prior is the unique semicomputable one that dominates all semicomputable measures, which decision-theoretically corresponds to something like “having preferences about as many possible worlds as we can”. Possibly, the latter principle formalized can be formalized into something which ends up picking out an infra-Solomonoff prior (and, replacing “computability” by a stronger condition, some other kind of simplicity infra-prior).

Vanessa Kosoy Mar 24, 2025, 9:03 AM
LW: 2 AF: 2
0
AF
in reply to: Chris van Merwijk’s comment on: Compositional language for hypotheses about computations
You now understand correctly. The reason I switch to colored operads is to add even more generality. My key use case is when the operad consists of terms-with-holes in a programming language, in which case the colors are the types of the terms/holes.

Vanessa Kosoy Mar 23, 2025, 5:53 PM
LW: 14 AF: 5
0
AF
in reply to: Vanessa Kosoy’s comment on: Vanessa Kosoy’s Shortform
The following are my thoughts on the definition of learning in infra-Bayesian physicalism (IBP), which is also a candidate for the ultimate prescriptive agent desideratum.
In general, learning of hypotheses about the physical universe is not possible because of traps. On the other hand, learning of hypotheses about computable mathematics is possible in the limit of ample computing resources, as long as we can ignore side effects of computations. Moreover, learning computable mathematics implies approximating Bayesian planning w.r.t the prior about the physical universe. Hence, we focus on this sort of learning.
We consider an agent comprised of three modules, that we call Simulator, Learner and Controller. The agent’s history consists of two phases. In the Training phase, the Learner interacts with the Simulator, and in the end produces a program for the Controller. In the Deployment phase, the Controller runs the program.
Roughly speaking:
- The Simulator is a universal computer whose function is performing computational experiments, which we can think of as “thought experiments” or coarse-grained simulations of potential plans. It receives commands from the Learner (which computations to run / threads to start/stop) and reports to the Learner the results. We denote the Simulator’s input alphabet by $I_{S}$ and output alphabet by $O_{S}$ .
- The Learner is the machine learning (training) module. The algorithm whose desideratum we’re specifying resides here.
- The Controller (as in “control theory”) is a universal computer connected to the agent’s external interface (environmental actions $A$ and observations $O$ ). It’s responsible for real-time interaction with the environment, and we can think of it as the learned policy. It is programmed by the Learner, for which purpose it has input alphabet $I_{C}$ .
We will refer to this as the SiLC architecture.
Let $H \subseteq □ Γ$ be our hypothesis class about computable mathematics. Let $ξ : Γ \to □ 2^{Γ}$ be our prior about the physical universe^[1]. These have to satisfy the coherence conditions
$\forall y \in Γ, α \in s u p p ξ (y) : y \in α$ $\forall y, y^{'} \in Γ, θ \in ξ (y) : χ_{y^{'} \in α} θ \in ξ (y^{'})$ $\forall Θ \in H, θ \in Θ, s : Γ \to Γ, ϕ : Γ \to Δ^{c} Γ : ϕ ⪯ ξ ⟹ {p r}_{Γ} χ_{{e l}^{Γ}} (s \times {i d}_{2^{Γ}})_{*} (θ ⋉ ϕ) \in Θ$
Here, $ϕ ⪯ ξ$ means that $\forall y \in Γ : ϕ (y) \in ξ (y)$ .
Together, these ensure that $Θ ⋉ ξ$ is a coherent IBP hypothesis. Notice that for any $ξ_{0} : Γ \to □ 2^{Γ}$ satisfying the first condition^[2], there is a unique minimal coherent $ξ : Γ \to □ 2^{Γ}$ s.t. $\forall y \in Γ : ξ_{0} (y) \subseteq ξ (y)$ . Moreover, given a coherent $ξ$ and any $Θ_{0} \in □ Γ$ , there is a unique minimal coherent $Θ \in □ Γ$ s.t. $Θ_{0} \subseteq Θ$ .
The duration of the Training phase will be denoted by $τ \in N$ ^[3]. We can think of it as “computational time”.
Let the source codes of the Learner (obtained by quining), the Simulator and the Controller respectively be denoted by
$D_{L} : Γ \times N \times (I_{S} \times O_{S})^{*} \to Δ_{Q} (I_{S} \times I_{C})$ $D_{S} : Γ \times I_{S}^{+} \to O_{S}$ $D_{C} : Γ \times I_{C}^{*} \times O^{*} \to A$
Here, the $N$ argument of $D_{L}$ corresponds to $τ$ and $Δ_{Q}$ is a probability distribution in which all probabilities are rational numbers^[4].
We assume that the simulator can indeed run any computation, and that any given halting computation would run fast for $τ ≫ 0$ . These are assumptions on $D_{S}$ (or, on some combination of (i) $D_{S}$ , (ii) the definition of $Γ$ , and (iii) the support of all $Θ \in H$ ) that we will not spell out here.
We will say that a policy is a mapping of type $(I_{S} \times I_{C} \times O_{S})^{*} \to Δ_{Q} (I_{S} \times I_{C})$ and a metapolicy is a mapping of type $N \times (I_{S} \times I_{C} \times O_{S})^{*} \to Δ_{Q} (I_{S} \times I_{C})$ .
Given any $D_{L}^{'} : Γ \times N \times (I_{S} \times I_{C} \times O_{S})^{*} \to Δ_{Q} (I_{S} \times I_{C})$ , we can compose it with $D_{S}$ and $D_{C}$ in the obvious way^[5] to yield
$D_{S} \otimes D_{L}^{'} \otimes D_{C} : Γ \times N \times (A \times O)^{*} \to Δ_{Q} A$
In particular, we can take $D_{L}^{'} = μ$ for some metapolicy $μ$ by postulating no dependence on the $Γ$ argument.
Denote by $P$ the set of all policies. Given metapolicy $μ$ and $π \in P$ , we define $μ^{π} : [τ + 1] \times (I_{S} \times I_{C} \times O_{S})^{*} \to Δ_{Q} (I_{S} \times I_{C})$ by
$μ^{π} (k, p q) := {\begin{matrix} μ (k, p q) if k < τ π (p q) if k = τ \end{matrix}$
Given any $ν : [τ + 1] \times (I_{S} \times I_{C} \times O_{S})^{*} \to Δ_{Q} (I_{S} \times I_{C})$ , we say that $y \in Γ$ is a $ν^{τ}$ -consistent counterpossible when the following conditions hold:
- For all $k < τ$ and $h \in (I_{S} \times I_{C} \times O_{S})^{< k}$ , $D_{L} (y, k, p q) = ν (k, p q)$
- For all $k \leq τ$ and $h \in (A \times O)^{*}$ , $(D_{S} \otimes D_{L} \otimes D_{C}) (y, k, p q) = (D_{S} \otimes ν \otimes D_{C}) (y, k, p q)$
We denote by $C_{ν}^{τ} \subseteq Γ$ the set of $ν^{τ}$ -consistent counterpossibles.
A (deterministic) copolicy is a mapping of signature $I_{S}^{+} \to O_{S}$ . We denote the set of copolicies by $C$ . Given a policy $π$ and a copolicy $ν$ , we define $π^\otimes ν \in Δ (I_{S} \times I_{C} \times O_{S})^{*}$ in the obvious way. Given policies $π_{1}, π_{2}$ , we define their total variation distance^[6] to be
$d_{TV} (π_{1}, π_{2}) := max ν \in C d_{TV} (π_{1}^\otimes ν, π_{2}^\otimes ν)$
Given $Ξ \in □ (Γ \times 2^{Γ})$ , $f : Γ \times 2^{Γ} \to [0, \infty)$ , $τ \in N$ and metapolicy $μ$ , we will use the notation
$E_{Ξ} [f ∣∣ μ^{τ}] := min π \in P (E_{Ξ} [f \cdot χ_{C_{μ^{π}}^{τ}}] + d_{TV} (π, μ (τ)))$
Intuitively, $E_{Ξ} [f ∣∣ μ^{τ}]$ should be thought as the counterfactual expectation of loss function $f$ assuming metapolicy $μ$ , while adding a “bonus” to account for “fair” treatment of randomization by the agent. More on that below.
Given a metapolicy $μ$ and $τ \in N$ , we define $E_{μ}^{τ} \subseteq 2^{Γ}$ by
$E_{μ}^{τ} := {α \in 2^{Γ} ∣ \exists π \in P : C_{μ^{π}}^{τ} \cap α = \emptyset}$
Intuitively, $E_{μ}^{τ t}$ is the set of universe states for which at least one copy of the agent exists which followed the metapolicy $μ$ until computational time $τ$ .
Given a loss function $l : N \times 2^{Γ} \to [0, 1]$ ^[7] (which we allow to explicitly depend on computational time for greater generality), the learning condition on a metapolicy $μ^{*}$ and hypothesis $Θ \in H$ is
$\sum τ < N E_{ξ_{*} Θ} [l (τ) ∣∣ μ^{* τ}] \leq \sum τ < N (min π \in P E_{ξ_{*} Θ} [l (τ) \cdot χ_{C_{μ^{* π}}^{τ}}] + E_{ξ_{*} Θ} [χ_{E_{μ^{*}}^{τ}}] ϵ (Θ, N))$
Here, $ϵ$ is the “regret bound” function which should vanish in the $N \to \infty$ limit.
Some remarks on the particulars of this definition:
- There are several reasons we impose $(D_{S} \otimes D_{L} \otimes D_{C}) (y, k, p q) = (D_{S} \otimes ν \otimes D_{C}) (y, k, p q)$ rather than $D_{L} (y, k, p q) = ν (k, p q)$ :
- First, we want to ignore the side effects of running computations on the Simulator (both causal side effects and “mindcrime”, i.e. the direct contribution of those computations to $l$ ). Because, taking side effects into account is usually inconsistent with the unlimited experimentation needed for learning.
- Second, learning requires trusting the reports of the Simulator, which means we should only impose the policy on copies of $D_{L}$ that are actually connected to $D_{S}$ .
- Third, we should also be able to trust the Controller, because otherwise we lose the semantic grounding of the agent’s external interface. (Even though this is not necessary for learning per se.).
- On the other hand, we impose $D_{L} (y, k, p q) = ν (k, p q)$ in the computational past because that’s valid additional information that doesn’t interfere with the learning or decision theory.
- The learning criterion treats computational time myopically, so that we won’t have to worry about traps in computational time.
- The reason we need randomization is, it’s often necessary for efficient learning. In the simplest non-trivial examples, we learn by IID sampling a distribution over computations (e.g. we simulate the interaction between a particular policy and our physical prior $ξ$ ). If we sampled deterministically instead, Murphy would be able to fool us by changing behavior precisely at the sampled instances.
- The reason we need $E_{Ξ} [f ∣∣ μ^{τ t}]$ is, randomization only helps if low probability events can be ignored. However, if sufficiently many copies of the agents are instantiated, even a low probability even would be detectable. Hence, we use a “forgiving” metric that assigns low loss even to distributions that technically have high loss but are close to a different distribution with low loss.
- We can consider Newcombian problems where Omega makes decisions based on the agent’s action probabilities. I suspect that if Omega’s policy is Lipschitz in the agent policy, the behavior advised by the $E_{Ξ} [f ∣∣ μ^{τ t}]$ counterfactual will converge to FDT-optimal in the limit of sufficiently many iterations.
- Both in the case of ignoring side effects of computations and in the case of the treatment of randomization, we can be accused of departing from priorism (“updatelessness”). However, I think that this departure is justified. In the original TDT paper, Yudkowsky addressed the “Omega rewards irrationality” objection by postulating that, a decision problem is “fair” when it only depends on the agent’s decisions rather than on how the agent makes those decisions. Here, we use the same principle: the agent should not be judged based on its internal thought process (side effects), and it should in some sense be judged based on its decisions rather than the probabilities assigned to those decisions.
- Also about priorism, this kind of agents will not endorse iterated-in-computational-time “logical” counterfactual mugging when the same coin is reused indefinitely, but will endorse it when a new coin is used every time, for an appropriate definition of “new” (or e.g. we switch to a new coin every $k$ rounds). Arguably, this solves the tension between priorism and learning observed by Demski. Formulating the precise criterion when Learning-IBP behavior converges to priorist / FDT-optimal is left for further research.
- The dependence of $ϵ (Θ, N)$ on $Θ$ should ultimately involve some kind of description complexity. However, it will also involve something in the vein of “what are the computational resource bounds, according to the belief $Θ$ , for running certain computations, selected for their importance in testing $Θ$ ”. In particular, we won’t require the agent to learning anything about non-halting computations. Indeed, any hypothesis about such computations will either assert a time bound on running the non-halting computations (in which case it is false) or will fail to assert any such bound, in which case its learning complexity is known to be infinite.
- We could make do without the $E_{ξ_{*} Θ} [χ_{E_{μ^{*}}^{τ}}]$ factor but that would make the learning criterion weaker. The presence of this factor means that, roughly speaking, regret should be low even conditional on the agent existing, which seems like a reasonable demand.
- Given an AI designed along these principles, we might worry about the impact of the side effects that are ignored. Maybe these can produce non-Cartesian daemons. However, during the Training phase, the algorithm has no access to external observation, which arguably makes it unlikely anything inside it can learn how to cyberattack. Moreover, during the Deployment phase, any reason for concern would be mediated by the particular algorithm the Controller runs (rather than the particulars of how it’s implemented), which is what we do take into account in our optimization. Finally, the agent might be able to self-modify to make itself safer: we can even intentionally give it the means to do so (as part of its action space $A$ ). This probably requires careful prior-shaping to work well.
1. ^
  This framework assumes all our hypotheses are disintegrable w.r.t. the factorization into $Γ$ and $2^{Γ}$ . It is an interesting question to understand whether we should or can relax this assumption.
2. ^
  For example, we can imagine $ξ_{0}$ to be a Solomonoff-like prior along the following lines. Every hypothesis comprising $ξ_{0}$ is defined by a Turing machine $M$ with access to two oracles representing $y, y^{'} \in Γ$ and two tapes of random and “ambiguous” bits respectively. $ξ_{0} (y)$ is defined by running $M$ with one oracle fielding queries about $y$ (i.e. we given a program $P$ we can request to know its counterpossible output $P^{y}$ ) and the other oracle fielding queries about some $y^{'}$ s.t. we want to decide whether $y^{'} \in α$ for $α \sim θ \in ξ_{0} (y)$ . $M$ is only allowed to return NO if there was at least one query to which the two oracles gave different answers.
3. ^
  We use the “duration” interpretation for simplicity, but more generally $τ$ can be some parameter controlling the computing resources available in the Training phase, and we can also allow the computing resources of the Controller to scale with $τ$ .
4. ^
  The reason we restrict to rational numbers is because we need a notion of computing the distribution. It is in principle possible to generalize further to computable numbers. On the other hand, it might be more realistic to constrain even further to e.g. dyadic rationals (which can be implemented via fair coinflips). We stick to $Q$ for simplicity.
5. ^
  We let the Learner interact with the Simulator for $τ$ timesteps, producing some output $g \in I_{C}^{τ}$ , and then run the Controller with $g$ as an input.
6. ^
  This is not technically a distance since it is possible to have $d_{TV} (π_{1}, π_{2}) = 0$ if $π_{1} \neq π_{2}$ so long as $π_{1}$ and $π_{2}$ only disagree on histories that are inconsistent with these policies. Such $π_{1}$ and $π_{2}$ are morally equivalent.
7. ^
  We could also allow $l$ to have a $Γ$ argument, but then we would have to remove the $E_{ξ_{*} Θ} [χ_{E_{μ^{*}}^{τ}}]$ factor from the learning condition, because the choice of policy would matter intrinsically even if the agent doesn’t exist. Alternatively, we could modify the definition of $C_{μ}^{τ}$ to avoid that. Or perhaps use some normalization factor more complicated than $E_{ξ_{*} Θ} [χ_{E_{μ^{*}}^{τ}}]$ .
What links here?
- I “invented” semimeasure theory and all I got was imprecise probability theory by Cole Wyeth (Apr 3, 2025, 4:33 PM; 14 points)

Vanessa Kosoy Mar 23, 2025, 2:30 PM
LW: 2 AF: 2
0
AF
in reply to: Chris van Merwijk’s comment on: Compositional language for hypotheses about computations
No? The elements of an operad have fixed arity. When defining a free operad you need to specify the arity of every generator.

Vanessa Kosoy Mar 12, 2025, 3:02 PM
2 points
0
in reply to: Gurkenglas’s comment on: The Learning-Theoretic Agenda: Status 2023
I don’t think that undecidability of exact comparison (as opposed to comparison within any given margin of error) is necessarily a bug, however, if you really want comparison for periodic sequences, you can insist that the utility function is defined by a finite state machine. This is in any case already a requirement in the bounded compute version.

Vanessa Kosoy Feb 24, 2025, 12:20 PM
3 points
0
on: Gauging Interest for a Learning-Theoretic Agenda Mentorship Programme
So far interest in the programme was modest. I would appreciate it to hear from people who either (i) deliberated whether to apply and decided against it or (ii) feel that they might meet the requirements but are not interested. Specifically, what held you back and what changes (if any) would persuade you to apply?

Vanessa Kosoy Feb 22, 2025, 9:22 AM
2 points
0
in reply to: Gurkenglas’s comment on: The Learning-Theoretic Agenda: Status 2023
First, it’s uncomputable to measure performance because that involves the Solomonoff prior. You can approximate it if you know some bits of Chaitin’s constant, but that brings a penalty into the description complexity.
Second, I think that saying that comparison is computable means that the utility is only allowed to depend on a finite number of time steps, it rules out even geometric time discount. For such utility functions, the optimal policy has finite description complexity, so g is upper bounded. I doubt that’s useful.

Vanessa Kosoy Feb 16, 2025, 6:18 PM
7 points
0
in reply to: Vinayak Pathak’s comment on: Gauging Interest for a Learning-Theoretic Agenda Mentorship Programme
I added some examples to the end of this post, thank you for the suggestion.

Vanessa Kosoy Jan 24, 2025, 11:00 AM
9 points
0
in reply to: quila’s comment on: Announcement: Learning Theory Online Course
Not sure these are the best textbooks, but you can try:
- “Naive Set Theory” by Halmos
- “Probability Theory” by Jaynes
- “Introduction to the Theory of Computation” by Sipser

Vanessa Kosoy Jan 23, 2025, 5:38 PM
LW: 8 AF: 4
0
AF
in reply to: harfe’s comment on: Vanessa Kosoy’s Shortform
Another excellent catch, kudos. I’ve really been sloppy with this shortform. I corrected it to say that we can approximate the system arbitrarily well by VNM decision-makers. Although, I think it’s also possible to argue that a system that selects a non-exposed point is not quite maximally influential, because it’s selecting somethings that’s very close to delegating some decision power to chance.
Also, maybe this cannot happen when $D$ is the inverse limit of finite sets? (As is the case in sequential decision making with finite action/observation spaces). I’m not sure.

Vanessa Kosoy Jan 23, 2025, 10:59 AM
LW: 3 AF: 2
0
AF
in reply to: Vanessa Kosoy’s comment on: Vanessa Kosoy’s Shortform
Example: Let $X = {0, 1}$ , and $D$ consist of the probability intervals $Θ_{0} := [0, \frac{2}{3}]$ , $Θ_{1} := [\frac{1}{3}, 1]$ and $Θ_{2} := [\frac{1}{3}, \frac{2}{3}]$ . Then, it is (I think) consistent with the desideratum to have $Θ^{*} = Θ_{2}$ .
Not only that interpreting $Θ^{*} = Θ_{2}$ requires an unusual decision rule (which I will be calling “utility hyperfunction”), but applying any ordinary utility function to this example yields a non-unique maximum. This is another point in favor of the significance of hyperfunctions.

Vanessa Kosoy Jan 23, 2025, 10:31 AM
LW: 4 AF: 3
0
AF
in reply to: harfe’s comment on: Vanessa Kosoy’s Shortform
You’re absolutely right, good job! I fixed the OP.

Vanessa Kosoy Jan 22, 2025, 1:46 PM
LW: 14 AF: 9
0
AF
in reply to: Vanessa Kosoy’s comment on: Vanessa Kosoy’s Shortform
TLDR: Systems which locally maximal influence can be described as VNM decision-makers.
There are at least 3 different motivations leading to the concept of “agent” in the context of AI alignment:
1. The sort of system we are concerned about (i.e. which poses risk)
2. The sort of system we want to build (in order to defend from dangerous systems)
3. The sort of systems that humans are (in order to meaningfully talk about “human preferences”)
Motivation #1 naturally suggests a descriptive approach, motivation #2 naturally suggests a prescriptive approach, and motivation #3 is sort of a mix of both: on the one hand, we’re describing something that already exists, on the other hand, the concept of “preferences” inherently comes from a normative perspective. There are also reasons to think these different motivation should converge on a single, coherent concept.
Here, we will focus on motivation #1.
A central reason why we are concerned about powerful unaligned agents, is that they are influential. Agents are the sort of system that, when instantiated in a particular environment is likely to heavily change this environment, potentially in ways inconsistent with the preferences of other agents.
Bayesian VNM
Consider a nice space^[1] $X$ of possible “outcomes”, and a system that can choose^[2] out of a closed set of distributions $D \subseteq Δ X$ . I propose that an influential system should satisfy the following desideratum:
The system cannot select $μ^{*} \in D$ which can be represented as a non-trivial lottery over other elements in $D$ . In other words, $μ^{*}$ has to be an extreme point of the convex hull of $D$ .
Why? Because a system that selects a non-extreme point leaves something to chance. If the system can force outcome $μ \in Δ X$ , or outcome $ν \in Δ X$ but chooses instead outcome $p μ + (1 - p) ν$ , for $p \in (0, 1)$ and $μ \neq ν$ , this means the system gave up on its ability to choose between $μ$ and $ν$ in favor of a $p$ -biased coin. Such a system is not “locally^[3] maximally” influential^[4].
[EDIT: The original formulation was wrong, h/t @harfe for catching the error.]
The desideratum implies that there is a convergent sequence of utility functions ${u_{k} : X \to R}_{k \in N}$ s.t.
- For every $k \in N$ , $E_{μ} [u_{k}]$ has a unique maximum $μ_{k}$ in $D$ .
- The sequence $μ_{k}$ converges to $μ^{*}$ .
In other words, such a system can be approximated by a VNM decision-maker within any precision. For finite $D$ , we don’t need the sequence, instead there is some $u : X \to R$ s.t. $μ^{*}$ is the unique maximum of $E_{μ} [u]$ over $D$ . This observation is mathematically quite simple, but I haven’t seen it made elsewhere (but I would not be surprised if it did appear in the decision theory literature somewhere).
Infra-Bayesian VNM?
Now, let’s say that the system is choosing out of a set of credal sets (crisp infradistributions) $D \subseteq □ X$ . I propose the following desideratum:
[EDIT: Corrected according to a suggestion by @harfe, original version was too weak.]
Let $^D$ be the closure of $D$ w.r.t. convex combinations and joins^[5]. Let $Θ^{*} \in □ X$ be selected by the system. Then:
- For any $Φ, Ψ \in^D$ and $p \in (0, 1)$ , if $Θ^{*} = p Φ + (1 - p) Ψ$ then $Φ = Ψ$ .
- For any $Φ \in^D$ , if $Φ \subseteq Θ^{*}$ then $Φ = Θ^{*}$ .
The justification is, a locally maximal influential system should leave the outcome neither to chance nor to ambiguity (the two types of uncertainty we have with credal sets).
We would like to say that this implies that the system is choosing according to maximin relatively to a particular utility function. However, I don’t think this is true, as the following example shows:
Example: Let $X = {0, 1}$ , and $D$ consist of the probability intervals $Θ_{0} := [0, \frac{2}{3}]$ , $Θ_{1} := [\frac{1}{3}, 1]$ and $Θ_{2} := [\frac{1}{3}, \frac{2}{3}]$ . Then, it is (I think) consistent with the desideratum to have $Θ^{*} = Θ_{2}$ .
Instead, I have the following conjecture:
Conjecture: There exists some space $Y$ , some $ξ \in Δ Y$ and convergent sequence ${u_{k} : Y \times X \to R}_{k \in N}$ s.t.
$Θ^{*} = lim k \to \infty arg max Θ \in D E_{y \sim ξ} [min μ \in Θ E_{x \sim μ} [u_{k} (y, x)]]$
As before, the maxima should be unique.
Such a “generalized utility function” can be represented as an ordinary utility function with a latent $Y$ -valued variable, if we replace $D$ with $D^{'} \subseteq □ (Y \times X)$ defined by
$D^{'} := {ξ ⋉ Θ ∣ Θ \in D}$
However, using utility functions constructed in this way leads to issues with learnability, which probably means there are also issues with computational feasibility. Perhaps in some natural setting, there is a notion of “maximally influential under computational constraints” which implies an “ordinary” maximin decision rule.
This approach does rule out optimistic or “mesomistic” decision-rules. Optimistic decision makers tend to give up on influence, because they believe that “nature” would decide favorably for them. Influential agents cannot give up on influence, therefore they should be pessimistic.
Sequential Decision-Making
What would be the implications in a sequential setting? That is, suppose that we have a set of actions $A$ , a set of observations $O$ , $X := (A \times O)^{ω}$ , a prior $ζ : (A \times O)^{*} \times A \to Δ O$ and
$D := {ζ π ∣ π : O^{*} \to A}$
In this setting, the result is vacuous because of an infamous issue: any policy can be justified by a contrived utility functions that favors it. However, this is only because the formal desideratum doesn’t capture the notion of “influence” sufficiently well. Indeed, a system whose influence boils down entirely to its own outputs is not truly influential. What motivation #1 asks of us, is talk about systems that influence the world-at-large, including relatively “faraway” locations.
One way to fix some of the problem is, take $X := O^{ω}$ and define $D$ accordingly. This singles out systems that have influence over their observations rather than only their actions, which is already non-vacuous (some policies are not such). However, such a system can still be myopic. We can take this further, and select “long-term” influence by projecting onto late observations or some statistics over observations. However, in order to talk about actually “far-reaching” influence, we probably need to switch to the infra-Bayesian physicalism setting. There, we can set $X := 2^{Γ}$ , i.e. select for system that have influence over physically manifest computations.
1. ^
  I won’t keep track of topological technicalities here, probably everything here works at least for compact Polish spaces.
2. ^
  Meaning that the system has some output, and different counterfactual outputs correspond to different elements of $D$ .
3. ^
  I say “locally” because it refers to something like a partial order, not a global scalar measure of influence.
4. ^
  See also Yudkowsky’s notion of efficient systems “not leaving free energy”.
5. ^
  That is, if $Ψ, Φ \in^D$ then their join (convex hull) $Ψ \lor Φ$ is also in $^D$ , and so is $p Ψ + (1 - p) Φ$ for every $p \in [0, 1]$ . Moreover, $^D$ is the minimal closed superset of $D$ with this property. Notice that this implies $^D$ is closed w.r.t. arbitrary infra-convex combinations, i.e. for any $Y$ , $K : Y \to^D$ and $Ξ \in □ Y$ , we have $K_{*} Ξ \in^D$ .
What links here?

Vanessa Kosoy Jan 22, 2025, 12:27 PM
LW: 11 AF: 7
0
AF
on: Vanessa Kosoy’s Shortform
Master post for selection/coherence theorems. Previous relevant shortforms: learnability constraints decision rules, AIT selection for learning.

Vanessa Kosoy Jan 17, 2025, 10:09 AM
LW: 3 AF: 2
0
AF
in reply to: gwern’s comment on: What is the most impressive game LLMs can play well?
Do you mean that seeing the opponent make dumb moves makes the AI infer that its own moves are also supposed to be dumb, or something else?

Vanessa Kosoy Jan 16, 2025, 3:05 PM
LW: 3 AF: 2
1
AF
in reply to: Archimedes’s comment on: What is the most impressive game LLMs can play well?
Relevant link

Vanessa Kosoy

Bayesian VNM

Infra-Bayesian VNM?

Sequential Decision-Making