Research Lead at CORAL. Director of AI research at ALTER. PhD student in Shay Moran’s group in the Technion (my PhD research and my CORAL/ALTER research are one and the same). See also Google Scholar and LinkedIn.
E-mail: {first name}@alter.org.il
Research Lead at CORAL. Director of AI research at ALTER. PhD student in Shay Moran’s group in the Technion (my PhD research and my CORAL/ALTER research are one and the same). See also Google Scholar and LinkedIn.
E-mail: {first name}@alter.org.il
Hold my beer ;)
I see modeling vs. implementation as a spectrum more than a dichotomy. Something like:
On the “implementation” extreme you prove theorems about the exact algorithm you implement in your AI, s.t. you can even use formal verification to prove these theorems about the actual code you wrote.
Marginally closer to “modeling”, you prove (or at least conjecture) theorems about some algorithm which is vaguely feasible in theory. Some civilization might have used that exact algorithm to build AI, but in our world it’s impractical, e.g. because it’s uncompetitive with other AI designs. However, your actual code is conceptually very close to the idealized algorithm, and you have good arguments why the differences don’t invalidate the safety properties of the idealized model.
Further along the spectrum, your actual algorithm is about as similar to the idealized algorithm as DQN is similar to vanilla Q-learning. Which is to say, it was “inspired” by the idealized algorithm but there’s a lot of heavy lifting done by heuristics. Nevertheless, there is some reason to hope the heuristic aspects don’t change the safety properties.
On the “modeling” extreme, your idealized model is something like AIXI: completely infeasible and bears little direct resemblance to the actual algorithm in your AI. However, there is still some reason to believe real AIs will have similar properties to the idealized model.
More precisely, rather than a 1-dimensional spectrum, there are at least two parameters involved:
How close is the object you make formal statements about to the actual code of your AI, where “closeness” is measured by the strength of the arguments you have for the analogy, on a scale from “they are literally the same” to solid theoretical and/or empirical evidence to pure hand-waving/intuition
How much evidence you have for the formal statements, on a scale from “I proved it within some widely accepted mathematical foundation (e.g. PA)” to “I proved vaguely related things, tried very hard but failed to disprove the thing and/or accumulated some empirical evidence”.
[EDIT: And a 3rd parameter is, how justified/testable the assumptions of your model is. Ideally, you want these assumptions to be grounded in science. Some will likely be philosophical assumptions which cannot be tested empirically, but at least they should fit into a coherent holistic philosophical view. At the very least, you want to make sure you’re not assuming away the core parts of the problem.]
For the purposes of safety, you want to be as close to the implementation end of the spectrum as you can get. However, the model side of the spectrum is still useful as:
A backup plan which is better than nothing, more so if there is some combination of theoretical and empirical justification for the analogizing
A way to demonstrate threat models, as the OP suggests
An intermediate product that helps checking that your theory is heading in the right direction, comparing different research agendas, and maybe even making empirical tests.
Sorry, I was wrong. By Lob’s theorem, all versions of are provably equivalent, so they will trust each other.
IIUC, fixed point equations like that typically have infinitely many solution. So, you defined not one predicate, but an infinite family of them. Therefore, your agent will trust a copy of itself, but usually won’t trust variants of itself with other choices of fixed point. In this sense, this proposal is similar to proposals based on quining (as quining has many fixed points as well).
I don’t know what’s so bad about the “human male” bio. I might have swiped right on that one. (Especially if the profile had additional info that makes him sound interesting.)
TLDR: A new proposal for a prescriptive theory of multi-agent rationality, based on generalizing superrationality using local symmetries.
Hofstadter’s “superrationality” postulates that a rational agent should behave as if other rational agents are similar to itself. In a symmetric game, this implies the selection of a Pareto efficient outcome. However, it’s not clear how to apply this principle to asymmetric games. I propose to solve this by suggesting a notion of “local” symmetry that is much more ubiquitous than global (ordinary) symmetry.
We will focus here entirely on finite deterministic (pure strategy) games. Generalizing this to mixed strategies and infinite games more generally is an important question left for the future.
Consider a game with a set of players, for each player a non-empty set or pure strategies and a utility function , where . What does it mean for the game to be “symmetric”? We propose a “soft” notion of symmetry that is only sensitive to the ordinal ordering between payoffs. While the cardinal value of the payoff will be important in the stochastic context, we plan to treat the latter by explicitly expanding the strategy space.
Given two games , , a premorphism from to is defined to consist of
A mapping
For each , a mapping
Notice that these mappings induce a mapping via
A premorphism is said to be a homomorpism if and are s.t. that for any and , if then . Homomorphisms makes games into a category in a natural way.
An automorphism of is a homomorphism from to that has an inverse homomorphism. An automorphism of is said to be flat when for any , if then is the identity mapping[1]. A flat symmetry of is a group together with a group homomorphism s.t. is flat for all .
Given a flat symmetry , we can apply the superrationality principle to reduce to the “smaller” quotient game . The players are i.e. orbits in the original player set. Given , we define the set of strategies for player in the game to be
This is non-empty thanks to flatness.
Observe that there is a natural mapping given by
Finally, the utility function of is defined by
It is easy to see that the construction of is functorial w.r.t. the category structure on games that we defined.
We can generalize this further by replacing game homomorphisms with game quasimorphisms: premorphisms satisfying the following relaxed condition on and :
For any and , if then .
This is no longer closed under composition, so no longer defines a category structure[2]. However, we can still define flat quasisymmetries and the associated quotient games, and this construction is still functorial w.r.t. the original (not “quasi”) category structure. Moreover, there is a canonical homomorphism (not just a quasimorphism) from to , even when is a mere quasisymmetry.
The significance of quasisymmetries can be understood as follows.
The set of all games on a fixed set of players with fixed strategy sets naturally forms a topological space[3] . Given a group acting on the game via invertible premorphisms, the subset of where is a symmetry is not closed, in general. However, the subset of where is a quasisymmetry is closed. I believe this will be important when generalizing the formalism to infinite games.
What if a game doesn’t have even quasisymmetries? Then, we can look for a coarse graining of the game which does have them.
Consider a game . For each , let be some set and a surjection. Denote . Given any , we have the game in which:
The set of players is .
For any , their set of strategies is . In particular, it implies that the set of outcomes satisfies .
For any , their utility function is .
If for any we choose some this allows us to define the coarse-grained game in which:
The set of players is .
For any , the set of strategies is .
For any and , the utility function is .
Essentially, we rephrased as an extensive form game in which first the game is played and then, if the outcome was , the game is played. This, assuming the expected outcome of game is .
It is also possible to generalize this construction by allowing multivalued .
Given a game , a local symmetry presolution (LSP) is recursively defined to be one of the following:
Assume there is s.t. for any , . Then, any is an LSP of . [Effectively single player game]
Let be a coarse-graining of and for any , let be an LSP of . Let be an LSP of . Define by . Then, is an LSP of .
Let be a quasisymmetry of and let be an LSP of . Then, is an LSP of .
It’s easy to see that an LSP always exists, because for any game we can choose a sequence of coarse-grainings whose effect is making the players choose their strategies one by one (with full knowledge of choices by previous players). However, not all LSP are “born equal”. It seems appealing to prefer LSP which use “more” symmetry. This can be formalized as follows.
The way an LSP is constructed defines the set of coherent outcomes , which is the set of outcomes compatible with the local symmetries[4]. We define it recursively as follows:
For the LSP of an effectively single-player game, we set .
For a coarse-graining LSP, for any , let be the set of coherent outcomes of and let be the set of coherent outcomes of . We then define . Here, is defined by .
For a quasisymmetry LSP, let be the set of coherent outcomes of . We then define .
We define a local symmetry solution (LSS) to be an LSP whose set of coherent outcomes is minimal w.r.t. set inclusion.
In the Prisoner’s Dilemma, the only LSS is Cooperate-Cooperate.
In Stag Hunt, the only LSS is Stag-Stag.
In Chicken, the only LSS is Swerve-Swerve. If we add a finite number of randomized strategies, it opens more possibilities.
In Battle of the Sexes, assuming a small payoff from going to your preferred place alone, the only LSS is: each player goes to their own preferred place. If we give each player a “coinflip” strategy, then I think the only LSS is both players using the coinflip.
Consider a pure cooperation game where both players have strategies and the payoff is 1 if they choose the same strategy and 0 if they choose different strategies. Then, any outcome is an LSS. I think that, if we add a “fair coinflip” strategy, any LSS has payoff at least . I believe the LSS payoff will approach optimum if we (i) allow randomization (ii) make the game repeated with time long horizon and (iii) add some constraints on the symmetries, but I’ll leave this for a future post.
Consider a pure cooperation game where both players have strategies , the payoff is 2 if both choose , 1 if both choose and 0 if they choose different strategies. Then, the only LSS is .
In any game, an LSS guarantees each player at least their deterministic-maximin payoff. In particular, in a two-player zero-sum game in which a pure Nash equilibrium exists, any LSS is a Nash equilibrium.
Note that flatness is in general not preserved under composition.
I believe this can be interpreted as a category enriched over the category of sets and partial functions, with the monoidal structure given by Cartesian product of sets.
An affine space, even.
Maybe we can interpret the set of coherent outcomes as a sharp infradistirbution corresponding to the players’ joint belief about the outcome of the game.
Master post for multi-agency. See also.
In (non-monotonic) infra-Bayesian physicalism, there is a vaguely similar asymmetry even though it’s formalized via a loss function. Roughly speaking, the loss function expresses preferences over “which computations are running”. This means that you can have a “positive” preference for a particular computation to run or a “negative” preference for a particular computation not to run[1].
There are also more complicated possibilities, such as “if P runs then I want Q to run but if P doesn’t run then I rather that Q also doesn’t run” or even preferences that are only expressible in terms of entanglement between computations.
It’s roughly on the right track, but here are some inaccuracies in your description that stood out to me:
There is no requirement that the “hidden state space” is finite. It is perfectly fine to consider a credal set which is not a polytope (i.e. not a convex hull of a finite set of distributions).
The point of how market prices are computed, missing from your description, is that they prevent any bettor from making unbounded earnings (essentially, by making them bet against each other). This is the same principle as Garrabrant induction. In particular, this implies that if any of our models is true then the market predictions will converge to lying inside the corresponding credal set.
The market predictions do not somehow assume that “the parts of the universe we don’t observe are out to get us”. Thanks to the pessimistic better, they do satisfy the “not too optimistic condition”, but that’s “not too optimistic” relatively to the true environment.
Your entire description only talks about the “estimation” part, not about the “decision” part.
I think that there are two different questions which might be getting mixed up here:
Question 1: Can we fully classify all rules for which sets of Bayes nets imply other Bayes nets over the same variables? Naturally, this is not a fully rigorous question, since “classify” is not a well-defined notion. One possible operationalization of this question is: What is the computational complexity of determining whether a given Bayes net follows from a set of other Bayes nets? For example, if there is a set of basic moves that generate all such inferences then the problem is probably in NP (at least if the number of required moved can also be bounded).
Question 2: What if we replace “Bayes nets” by something like “string diagrams in a Markov category”? Then there might be less rules (because maybe some rules hold for Bayes nets but not in the more abstract setting).
Thank you <3
Any chance of more exposition for those of us less cognitively-inclined? =)
Read the paper! :)
It might seem long at first glance, but all the results are explained in the first 13 pages, the rest is just proofs. If you don’t care about the examples, you can stop on page 11. Naturally, I welcome any feedback on the exposition there.
I think that in 2 years we’re unlikely to accomplish anything that leaves a dent in P(DOOM), with any method, but I also think it’s more likely than not that we actually have >15 years.
As to “completing” the theory of agents, I used the phrase (perhaps perversely) in the same sense that e.g. we “completed” the theory of information: the latter exists and can actually be used for its intended applications (communication systems). Or at least in the sense we “completed” the theory of computational complexity: even though a lot of key conjectures are still unproven, we do have a rigorous understanding of what computational complexity is and know how to determine it for many (even if far from all) problems of interest.
I probably should have said “create” rather than “complete”.
(Summoned by @Alexander Gietelink Oldenziel)
I don’t understand this comment. I usually don’t think of “building a safer LLM agent” as a viable route to aligned AI. My current best guess about how to create aligned AI is Physicalist Superimitation. We can imagine other approaches, e.g. Quantilized Debate, but I am less optimistic there. More importantly, I believe that we need to complete the theory of agents first, before we can have strong confidence about which approaches are more promising.
As to heuristic implementations of infra-Bayesianism, this is something I don’t want to speculate about in public, it seems exfohazardous.
Somehow being able to derive all relevant string diagram rewriting rules for latential string diagrams, starting with some fixed set of equivalences?
What are “latential” string diagrams?
What does it it mean that you can’t derive them all from a “fixed” set? Do you imagine some strong claim e.g. that the set of rewriting rules being undecidable, or something else?
Two Bayes nets are of the same Markov equivalence class when they have precisely the same set of conditionality relations holding on them (and by extension, precisely the same undirected skeleton).
Okay, so this is not what you care about? Maybe you are saying the following: Given two diagrams X,Y, we want to ask whether any distribution compatible with X is compatible with Y. We don’t ask whether the converse also holds. This is a certain asymmetric relation, rather than an equivalence.
I found the above comment difficult to parse.
that’s not a thing that really happens?
What is the thing that doesn’t happen? Reading the rest of the paragraph only left me more confused.
we don’t quite care about Markov equivalence class
What do you mean by “Markov equivalence class”?
Thanks for this!
What I was saying up there is not a justification of Hurwicz’ decision rule. Rather, it is that if you already accept the Hurwicz rule, it can be reduced to maximin, and for a simplicity prior the reduction is “cheap” (produces another simplicity prior).
Why accept the Hurwicz’ decision rule? Well, at least you can’t be accused of a pessimism bias there. But if you truly want to dig deeper, we can start instead from an agent making decisions according to an ambidistribution, which is a fairly general (assumption-light) way of making decisions. I believe that a similar argument (easiest to see in the LF-dual cramble set representation) would allow reducing that to maximin on infradistributions (credal sets).
To make such an approach even more satisfactory, it would be good to add a justification for a simplicity ambi/infra-prior. I think this should be possible by arguing from “opinionated agents”: the ordinary Solomonoff prior is the unique semicomputable one that dominates all semicomputable measures, which decision-theoretically corresponds to something like “having preferences about as many possible worlds as we can”. Possibly, the latter principle formalized can be formalized into something which ends up picking out an infra-Solomonoff prior (and, replacing “computability” by a stronger condition, some other kind of simplicity infra-prior).
You now understand correctly. The reason I switch to colored operads is to add even more generality. My key use case is when the operad consists of terms-with-holes in a programming language, in which case the colors are the types of the terms/holes.
The following are my thoughts on the definition of learning in infra-Bayesian physicalism (IBP), which is also a candidate for the ultimate prescriptive agent desideratum.
In general, learning of hypotheses about the physical universe is not possible because of traps. On the other hand, learning of hypotheses about computable mathematics is possible in the limit of ample computing resources, as long as we can ignore side effects of computations. Moreover, learning computable mathematics implies approximating Bayesian planning w.r.t the prior about the physical universe. Hence, we focus on this sort of learning.
We consider an agent comprised of three modules, that we call Simulator, Learner and Controller. The agent’s history consists of two phases. In the Training phase, the Learner interacts with the Simulator, and in the end produces a program for the Controller. In the Deployment phase, the Controller runs the program.
Roughly speaking:
The Simulator is a universal computer whose function is performing computational experiments, which we can think of as “thought experiments” or coarse-grained simulations of potential plans. It receives commands from the Learner (which computations to run / threads to start/stop) and reports to the Learner the results. We denote the Simulator’s input alphabet by and output alphabet by .
The Learner is the machine learning (training) module. The algorithm whose desideratum we’re specifying resides here.
The Controller (as in “control theory”) is a universal computer connected to the agent’s external interface (environmental actions and observations ). It’s responsible for real-time interaction with the environment, and we can think of it as the learned policy. It is programmed by the Learner, for which purpose it has input alphabet .
We will refer to this as the SiLC architecture.
Let be our hypothesis class about computable mathematics. Let be our prior about the physical universe[1]. These have to satisfy the coherence conditions
Here, means that .
Together, these ensure that is a coherent IBP hypothesis. Notice that for any satisfying the first condition[2], there is a unique minimal coherent s.t. . Moreover, given a coherent and any , there is a unique minimal coherent s.t. .
The duration of the Training phase will be denoted by [3]. We can think of it as “computational time”.
Let the source codes of the Learner (obtained by quining), the Simulator and the Controller respectively be denoted by
Here, the argument of corresponds to and is a probability distribution in which all probabilities are rational numbers[4].
We assume that the simulator can indeed run any computation, and that any given halting computation would run fast for . These are assumptions on (or, on some combination of (i) , (ii) the definition of , and (iii) the support of all ) that we will not spell out here.
We will say that a policy is a mapping of type and a metapolicy is a mapping of type .
Given any , we can compose it with and in the obvious way[5] to yield
In particular, we can take for some metapolicy by postulating no dependence on the argument.
Denote by the set of all policies. Given metapolicy and , we define by
Given any , we say that is a -consistent counterpossible when the following conditions hold:
For all and ,
For all and ,
We denote by the set of -consistent counterpossibles.
A (deterministic) copolicy is a mapping of signature . We denote the set of copolicies by . Given a policy and a copolicy , we define in the obvious way. Given policies , we define their total variation distance[6] to be
Given , , and metapolicy , we will use the notation
Intuitively, should be thought as the counterfactual expectation of loss function assuming metapolicy , while adding a “bonus” to account for “fair” treatment of randomization by the agent. More on that below.
Given a metapolicy and , we define by
Intuitively, is the set of universe states for which at least one copy of the agent exists which followed the metapolicy until computational time .
Given a loss function [7] (which we allow to explicitly depend on computational time for greater generality), the learning condition on a metapolicy and hypothesis is
Here, is the “regret bound” function which should vanish in the limit.
Some remarks on the particulars of this definition:
There are several reasons we impose rather than :
First, we want to ignore the side effects of running computations on the Simulator (both causal side effects and “mindcrime”, i.e. the direct contribution of those computations to ). Because, taking side effects into account is usually inconsistent with the unlimited experimentation needed for learning.
Second, learning requires trusting the reports of the Simulator, which means we should only impose the policy on copies of that are actually connected to .
Third, we should also be able to trust the Controller, because otherwise we lose the semantic grounding of the agent’s external interface. (Even though this is not necessary for learning per se.).
On the other hand, we impose in the computational past because that’s valid additional information that doesn’t interfere with the learning or decision theory.
The learning criterion treats computational time myopically, so that we won’t have to worry about traps in computational time.
The reason we need randomization is, it’s often necessary for efficient learning. In the simplest non-trivial examples, we learn by IID sampling a distribution over computations (e.g. we simulate the interaction between a particular policy and our physical prior ). If we sampled deterministically instead, Murphy would be able to fool us by changing behavior precisely at the sampled instances.
The reason we need is, randomization only helps if low probability events can be ignored. However, if sufficiently many copies of the agents are instantiated, even a low probability even would be detectable. Hence, we use a “forgiving” metric that assigns low loss even to distributions that technically have high loss but are close to a different distribution with low loss.
We can consider Newcombian problems where Omega makes decisions based on the agent’s action probabilities. I suspect that if Omega’s policy is Lipschitz in the agent policy, the behavior advised by the counterfactual will converge to FDT-optimal in the limit of sufficiently many iterations.
Both in the case of ignoring side effects of computations and in the case of the treatment of randomization, we can be accused of departing from priorism (“updatelessness”). However, I think that this departure is justified. In the original TDT paper, Yudkowsky addressed the “Omega rewards irrationality” objection by postulating that, a decision problem is “fair” when it only depends on the agent’s decisions rather than on how the agent makes those decisions. Here, we use the same principle: the agent should not be judged based on its internal thought process (side effects), and it should in some sense be judged based on its decisions rather than the probabilities assigned to those decisions.
Also about priorism, this kind of agents will not endorse iterated-in-computational-time “logical” counterfactual mugging when the same coin is reused indefinitely, but will endorse it when a new coin is used every time, for an appropriate definition of “new” (or e.g. we switch to a new coin every rounds). Arguably, this solves the tension between priorism and learning observed by Demski. Formulating the precise criterion when Learning-IBP behavior converges to priorist / FDT-optimal is left for further research.
The dependence of on should ultimately involve some kind of description complexity. However, it will also involve something in the vein of “what are the computational resource bounds, according to the belief , for running certain computations, selected for their importance in testing ”. In particular, we won’t require the agent to learning anything about non-halting computations. Indeed, any hypothesis about such computations will either assert a time bound on running the non-halting computations (in which case it is false) or will fail to assert any such bound, in which case its learning complexity is known to be infinite.
We could make do without the factor but that would make the learning criterion weaker. The presence of this factor means that, roughly speaking, regret should be low even conditional on the agent existing, which seems like a reasonable demand.
Given an AI designed along these principles, we might worry about the impact of the side effects that are ignored. Maybe these can produce non-Cartesian daemons. However, during the Training phase, the algorithm has no access to external observation, which arguably makes it unlikely anything inside it can learn how to cyberattack. Moreover, during the Deployment phase, any reason for concern would be mediated by the particular algorithm the Controller runs (rather than the particulars of how it’s implemented), which is what we do take into account in our optimization. Finally, the agent might be able to self-modify to make itself safer: we can even intentionally give it the means to do so (as part of its action space ). This probably requires careful prior-shaping to work well.
This framework assumes all our hypotheses are disintegrable w.r.t. the factorization into and . It is an interesting question to understand whether we should or can relax this assumption.
For example, we can imagine to be a Solomonoff-like prior along the following lines. Every hypothesis comprising is defined by a Turing machine with access to two oracles representing and two tapes of random and “ambiguous” bits respectively. is defined by running with one oracle fielding queries about (i.e. we given a program we can request to know its counterpossible output ) and the other oracle fielding queries about some s.t. we want to decide whether for . is only allowed to return NO if there was at least one query to which the two oracles gave different answers.
We use the “duration” interpretation for simplicity, but more generally can be some parameter controlling the computing resources available in the Training phase, and we can also allow the computing resources of the Controller to scale with .
The reason we restrict to rational numbers is because we need a notion of computing the distribution. It is in principle possible to generalize further to computable numbers. On the other hand, it might be more realistic to constrain even further to e.g. dyadic rationals (which can be implemented via fair coinflips). We stick to for simplicity.
We let the Learner interact with the Simulator for timesteps, producing some output , and then run the Controller with as an input.
This is not technically a distance since it is possible to have if so long as and only disagree on histories that are inconsistent with these policies. Such and are morally equivalent.
We could also allow to have a argument, but then we would have to remove the factor from the learning condition, because the choice of policy would matter intrinsically even if the agent doesn’t exist. Alternatively, we could modify the definition of to avoid that. Or perhaps use some normalization factor more complicated than .
Btw, what are some ways we can incorporate heuristics into our algorithm while staying on level 1-2?
We don’t know how the prove to required desiderata about the heuristic, but we can still reasonably conjecture them and support the conjectures with empirical tests.
We can’t prove or even conjecture anything useful-in-itself about the heuristic, but the way the heuristic is incorporated into the overall algorithm makes it safe. For example, maybe the heuristic produces suggestions together with formal certificates of their validity. More generally, we can imagine an oracle-machine (where the heuristic is slotted into the oracle) about which we cannot necessarily prove something like a regret bound w.r.t. the optimal policy, but we can prove (or at least conjecture) a regret bound w.r.t. some fixed simple reference policy. That is, the safety guarantee shows that no matter what the oracle does, the overall system is not worse than “doing nothing”. Maybe, modulo weak provable assumptions about the oracle, e.g. that it satisfies a particular computational complexity bound.
[Epistemic status: very fresh idea, quite speculative but intriguing.] We can’t find even a guarantee like a above for a worst-case computationally bounded oracle. However, we can prove (or at least conjecture) some kind of an “average-case” guarantee. For example, maybe we have high probability of safety for a random oracle. However, assuming a uniformly random oracle is quite weak. More optimistically, maybe we can prove safety even for any oracle that is pseudorandom against some complexity class C1 (where we want C1 to be as small as possible). Even better, maybe we can prove safety for any oracle in some complexity class C2 (where we want C2 to be as large as possible) that has access to another oracle which is pseudorandom against C1. If our heuristic is not actually in this category (in particular, C2 is smaller than P and our heuristic doesn’t lie in C2), this doesn’t formally guarantee anything, but it does provide some evidence for the “robustness” of our high-level scheme.