Director of AI research at ALTER, where I lead a group working on the learning-theoretic agenda for AI alignment. I’m also supported by the LTFF. See also LinkedIn.
E-mail: {first name}@alter.org.il
Director of AI research at ALTER, where I lead a group working on the learning-theoretic agenda for AI alignment. I’m also supported by the LTFF. See also LinkedIn.
E-mail: {first name}@alter.org.il
Another excellent catch, kudos. I’ve really been sloppy with this shortform. I corrected it to say that we can approximate the system arbitrarily well by VNM decision-makers. Although, I think it’s also possible to argue that a system that selects a non-exposed point is not quite maximally influential, because it’s selecting somethings that’s very close to delegating some decision power to chance.
Also, maybe this cannot happen when is the inverse limit of finite sets? (As is the case in sequential decision making with finite action/observation spaces). I’m not sure.
Example: Let , and consist of the probability intervals , and . Then, it is (I think) consistent with the desideratum to have .
Not only that interpreting requires an unusual decision rule (which I will be calling “utility hyperfunction”), but applying any ordinary utility function to this example yields a non-unique maximum. This is another point in favor of the significance of hyperfunctions.
You’re absolutely right, good job! I fixed the OP.
TLDR: Systems which locally maximal influence can be described as VNM decision-makers.
There are at least 3 different motivations leading to the concept of “agent” in the context of AI alignment:
The sort of system we are concerned about (i.e. which poses risk)
The sort of system we want to build (in order to defend from dangerous systems)
The sort of systems that humans are (in order to meaningfully talk about “human preferences”)
Motivation #1 naturally suggests a descriptive approach, motivation #2 naturally suggests a prescriptive approach, and motivation #3 is sort of a mix of both: on the one hand, we’re describing something that already exists, on the other hand, the concept of “preferences” inherently comes from a normative perspective. There are also reasons to think these different motivation should converge on a single, coherent concept.
Here, we will focus on motivation #1.
A central reason why we are concerned about powerful unaligned agents, is that they are influential. Agents are the sort of system that, when instantiated in a particular environment is likely to heavily change this environment, potentially in ways inconsistent with the preferences of other agents.
Consider a nice space[1] of possible “outcomes”, and a system that can choose[2] out of a closed set of distributions . I propose that an influential system should satisfy the following desideratum:
The system cannot select which can be represented as a non-trivial lottery over other elements in . In other words, has to be an extreme point of the convex hull of .
Why? Because a system that selects a non-extreme point leaves something to chance. If the system can force outcome , or outcome but chooses instead outcome , for and , this means the system gave up on its ability to choose between and in favor of a -biased coin. Such a system is not “locally[3] maximally” influential[4].
[EDIT: The original formulation was wrong, h/t @harfe for catching the error.]
The desideratum implies that there is a convergent sequence of utility functions s.t.
For every , has a unique maximum in .
The sequence converges to .
In other words, such a system can be approximated by a VNM decision-maker within any precision. For finite , we don’t need the sequence, instead there is some s.t. is the unique maximum of over . This observation is mathematically quite simple, but I haven’t seen it made elsewhere (but I would not be surprised if it did appear in the decision theory literature somewhere).
Now, let’s say that the system is choosing out of a set of credal sets (crisp infradistributions) . I propose the following desideratum:
[EDIT: Corrected according to a suggestion by @harfe, original version was too weak.]
Let be the closure of w.r.t. convex combinations and joins[5]. Let be selected by the system. Then:
For any and , if then .
For any , if then .
The justification is, a locally maximal influential system should leave the outcome neither to chance nor to ambiguity (the two types of uncertainty we have with credal sets).
We would like to say that this implies that the system is choosing according to maximin relatively to a particular utility function. However, I don’t think this is true, as the following example shows:
Example: Let , and consist of the probability intervals , and . Then, it is (I think) consistent with the desideratum to have .
Instead, I have the following conjecture:
Conjecture: There exists some space , some and convergent sequence s.t.
As before, the maxima should be unique.
Such a “generalized utility function” can be represented as an ordinary utility function with a latent -valued variable, if we replace with defined by
However, using utility functions constructed in this way leads to issues with learnability, which probably means there are also issues with computational feasibility. Perhaps in some natural setting, there is a notion of “maximally influential under computational constraints” which implies an “ordinary” maximin decision rule.
This approach does rule out optimistic or “mesomistic” decision-rules. Optimistic decision makers tend to give up on influence, because they believe that “nature” would decide favorably for them. Influential agents cannot give up on influence, therefore they should be pessimistic.
What would be the implications in a sequential setting? That is, suppose that we have a set of actions , a set of observations , , a prior and
In this setting, the result is vacuous because of an infamous issue: any policy can be justified by a contrived utility functions that favors it. However, this is only because the formal desideratum doesn’t capture the notion of “influence” sufficiently well. Indeed, a system whose influence boils down entirely to its own outputs is not truly influential. What motivation #1 asks of us, is talk about systems that influence the world-at-large, including relatively “faraway” locations.
One way to fix some of the problem is, take and define accordingly. This singles out systems that have influence over their observations rather than only their actions, which is already non-vacuous (some policies are not such). However, such a system can still be myopic. We can take this further, and select “long-term” influence by projecting onto late observations or some statistics over observations. However, in order to talk about actually “far-reaching” influence, we probably need to switch to the infra-Bayesian physicalism setting. There, we can set , i.e. select for system that have influence over physically manifest computations.
I won’t keep track of topological technicalities here, probably everything here works at least for compact Polish spaces.
Meaning that the system has some output, and different counterfactual outputs correspond to different elements of .
I say “locally” because it refers to something like a partial order, not a global scalar measure of influence.
See also Yudkowsky’s notion of efficient systems “not leaving free energy”.
That is, if then their join (convex hull) is also in , and so is for every . Moreover, is the minimal closed superset of with this property. Notice that this implies is closed w.r.t. arbitrary infra-convex combinations, i.e. for any , and , we have .
Master post for selection/coherence theorems. Previous relevant shortforms: learnability constraints decision rules, AIT selection for learning.
Do you mean that seeing the opponent make dumb moves makes the AI infer that its own moves are also supposed to be dumb, or something else?
Apparently someone let LLMs play against the random policy and for most of them, most games end in a draw. Seems like o1-preview is the best of those tested, managing to win 47% of the time.
Relevant: Manifold market about LLM chess
This post states and speculates on an important question: are there different mind types that are in some sense “fully general” (the author calls it “unbounded”) but are nevertheless qualitatively different. The author calls these hypothetical mind taxa “cognitive realms”.
This is how I think about this question, from within the LTA:
To operationalize “minds” we should be thinking of learning algorithms. Learning algorithms can be classified according to their “syntax” and “semantics” (my own terminology). Here, semantics refers to questions such as (i) what type of object is the algorithm learning (ii) what is the feedback/data available to the algorithm and (iii) what is the success criterion/parameter of the algorithm. On the other hand, syntax refers to the prior and/or hypothesis class of the algorithm (where the hypothesis class might be parameterized in a particular way, with particular requirements on how the learning rate depends on the parameters).
Among different semantics, we are especially interested in those that are in some sense agentic. Examples include reinforcement learning, infra-Bayesian reinforcement learning, metacognitive agents and infra-Bayesian physicalist agents.
Do different agentic semantics correspond to different cognitive realms? Maybe, but maybe not: it is plausible that most of them are reflectively unstable. For example Christiano’s malign prior might be a mechanism for how all agents converge to infra-Bayesian physicalism.
Agents with different syntaxes is another candidate for cognitive realms. Here, the question is whether there is an (efficiently learnable) syntax that is in some sense “universal”: all other (efficiently learnable) syntaxes can be efficiently translated into it. This is a wide open question. (See also “frugal universal prior”.)
In the context of AI alignment, in order to achieve superintelligence it is arguably sufficient to use a syntax equivalent to whatever is used by human brain algorithms. Moreover, it’s plausible that any algorithm we can come up can only have an equivalent or weaker syntax (the process of us discovering the new syntax suggests an embedding of the new syntax into our own). Therefore, even if there are many cognitive realms, then for our purposes we mostly only care about one of them. However, the multiplicity of realms has implications on how simple/natural/canonical should we expect the choice of syntax for our theory of agents to be (the less realms, the more canonical).
I think that there are two key questions we should be asking:
Where is the value of a an additional researcher higher on the margin?
What should the field look like in order to make us feel good about the future?
I agree that “prosaic” AI safety research is valuable. However, at this point it’s far less neglected than foundational/theoretical research and the marginal benefits there are much smaller. Moreover, without significant progress on the foundational front, our prospects are going to be poor, ~no matter how much mech-interp and talking to Claude about feelings we will do.
John has a valid concern that, as the field becomes dominated by the prosaic paradigm, it might become increasingly difficult to get talent and resources to the foundational side, or maintain memetically healthy coherent discourse. As to the tone, I have mixed feelings. Antagonizing people is bad, but there’s also value in speaking harsh truths the way you see them. (That said, there is room in John’s post for softening the tone without losing much substance.)
Learning theory, complexity theory and control theory. See the “AI theory” section of the LTA reading list.
Good post, although I have some misgivings about how unpleasant it must be to read for some people.
One factor not mentioned here is the history of MIRI. MIRI was a pioneer in the field, and it was MIRI who articulated and promoted the agent foundations research agenda. The broad goals of agent foundations[1] are (IMO) load-bearing for any serious approach to AI alignment. But, when MIRI essentially declared defeat, in the minds of many that meant that any approach in that vein is doomed. Moreover, MIRI’s extreme pessimism deflates motivation and naturally produces the thought “if they are right then we’re doomed anyway, so might as well assume they are wrong”.
Now, I have a lot of respect for Yudkowsky and many of the people who worked at MIRI. Yudkowsky started it all, and MIRI made solid contributions to the field. I’m also indebted to MIRI for supporting me in the past. However, MIRI also suffered from some degree of echo-chamberism, founder-effect-bias, insufficient engagement with prior research (due to hubris), looking for nails instead of looking for hammers, and poor organization[2].
MIRI made important progress in agent foundations, but also missed an opportunity to do much more. And, while the AI game board is grim, their extreme pessimism is unwarranted overconfidence. Our understanding of AI and agency is poor: this is a strong reason to be pessimistic, but it’s also a reason to maintain some uncertainty about everything (including e.g. timelines).
Now, about what to do next. I agree that we need to have our own non-streetlighting community. In my book “non-streelighting” means mathematical theory plus empirical research that is theory-oriented: designed to test hypotheses made by theoreticians and produce data that best informs theoretical research (these are ~necessary but insufficient conditions for non-streetlighting). This community can and should engage with the rest of AI safety, but has to be sufficiently undiluted to have healthy memetics and cross-fertilization.
What does a community look like? It looks like our own organizations, conferences, discussion forums, training and recruitment pipelines, academia labs, maybe journals.
From my own experience, I agree that potential contributors should mostly have skills and knowledge on the level of PhD+. Highlighting physics might be a valid point: I have a strong background in physics myself. Physics teaches you a lot about connecting math to real-world problems, and is also in itself a test-ground for formal epistemology. However, I don’t think a background in physics is a necessary condition. At the very least, in my own research programme I have significant room for strong mathematicians that are good at making progress on approximately-concrete problems, even if they won’t contribute much on the more conceptual/philosophic level.
Which is, creating mathematical theory and tools for understanding agents.
I mostly didn’t feel comfortable talking about it in the past, because I was on MIRI’s payroll. This is not MIRI’s fault by any means: they never pressured me to avoid voicing opinions. It still feels unnerving to criticize the people who write your paycheck.
This post describes an intriguing empirical phenomenon in particular language models, discovered by the authors. Although AFAIK it was mostly or entirely removed in contemporary versions, there is still an interesting lesson there.
While non-obvious when discovered, we now understand the mechanism. The tokenizer created some tokens which were very rare or absent in the training data. As a result, the trained model mapped those tokens to more or less random features. When a string corresponding to such a token is inserted into the prompt, the resulting reply is surreal.
I think it’s a good demo of how alien foundation models can seem to our intuitions when operating out-of-distribution. When interacting with them normally, it’s very easy to start thinking of them as human-like. Here, the mask slips and there’s a glimpse of something odd underneath. In this sense, it’s similar to e.g. infinite backrooms, but the behavior is more stark and unexpected.
A human that encounters a written symbol they’ve never seen before is typically not going to respond by typing “N-O-T-H-I-N-G-I-S-F-A-I-R-I-N-T-H-I-S-W-O-R-L-D-O-F-M-A-D-N-E-S-S!”. Maybe this analogy is unfair, since for a human, a typographic symbol can be decomposed into smaller perceptive elements (lines/shapes/dots), while for a language model tokens are essentially atomic qualia. However, I believe some humans that were born deaf or blind had their hearing or sight restored, and still didn’t start spouting things like “You are a banana”.
Arguably, this lesson is relevant to alignment as well. Indeed, out-of-distribution behavior is a central source of risks, including everything to do with mesa-optimizers. AI optimists sometimes describe mesa-optimizers as too weird or science-fictiony. And yet, SolidGoldMagikarp is so science-fictiony that LessWrong user “lsusr” justly observed that it sounds like SCP in real life.
Naturally, once you understand the mechanism it doesn’t seem surprising anymore. But, this smacks of hindsight bias. What else can happen that would seem unsurprising in hindsight (if we survive to think about it), but completely bizarre and unexpected upfront?
This is just a self-study list for people who want to understand and/or contribute to the learning-theoretic AI alignment research agenda. I’m not sure why people thought it deserves to be in the Review. FWIW, I keep using it with my MATS scholars, and I keep it more or less up-to-date. A complementary resource that became available more recently is the video lectures.
This post suggests an analogy between (some) AI alignment proposals and shell games or perpetuum mobile proposals. Pertuum mobiles are an example how an idea might look sensible to someone with a half-baked understanding of the domain, while remaining very far from anything workable. A clever arguer can (intentionally or not!) hide the error in the design wherever the audience is not looking at any given moment. Similarly, some alignment proposals might seem correct when zooming in on every piece separately, but that’s because the error is always hidden away somewhere else.
I don’t think this adds anything very deep to understanding AI alignment, but it is a cute example how atheoretical analysis can fail catastrophically, especially when the the designer is motivated to argue that their invention works. Conversely, knowledge of a deep theoretical principle can refute a huge swath of design space is a single move. I will remember this for didactic purposes.
Disclaimer: A cute analogy by itself proves little, any individual alignment proposal might be free of such sins, and didactic tools should be used wisely, lest they become soldier-arguments. The author intends this (I think) mostly as a guiding principle for critical analysis of proposals.
This post argues against alignment protocols based on outsourcing alignment research to AI. It makes some good points, but also feels insufficiently charitable to the proposals it’s criticizing.
John make his case by an analogy to human experts. If you’re hiring an expert in domain X, but you understand little in domain X yourself then you’re going to have 3 serious problems:
Illusion of transparency: the expert might say things that you misinterpret due to your own lack of understanding.
The expert might be dumb or malicious, but you will believe them due to your own ignorance.
When the failure modes above happen, you won’t be aware of this and won’t act to fix them.
These points are relevant. However, they don’t fully engage with the main source of hope for outsourcing proponents. Namely, it’s the principle that validation is easier than generation[1]. While it’s true that an arbitrary dilettante might not benefit from an arbitrary expert, the fact that it’s easier to comprehend an idea than invent it yourself means that we can get some value from outsourcing, under some half-plausible conditions.
The claim that the “AI expert” can be deceptive and/or malicious is straightforwardly true. I think that the best hope to address it would be something like Autocalibrated Quantilized Debate, but it does require some favorable assumptions about the feasibility of deception and inner alignment is still a problem.
The “illusion of transparency” argument is more confusing IMO. The obvious counterargument is, imagine an AI that is trained to not only produce correct answers but also explain them in a way that’s as useful as possible for the audience. However, there are two issues with this counterargument:
First, how do we know that the generalization from the training data to the real use case (alignment research) is reliable? Given that we cannot reliably test the real use case, precisely because we are alignment dilettantes?
Second, we might be following a poor metastrategy. It is easy to imagine, in the world we currently inhabit, that an AI lab creates catastrophic unaligned AI, even though they think they care about alignment, just because they are too reckless and overconfident. By the same token, we can imagine such an AI lab consulting their own AI about alignment, and then proceeding with the reckless and overconfident plans suggested by the AI.
In the context of a sufficiently cautious metastrategy, it is not implausible that we can get some mileage from the outsourcing approach[2]. Move one step at a time, spend a lot of time reflecting on the AI’s proposals, and also have strong guardrails against the possibility of superhuman deception or inner alignment failures (which we currently don’t know how to build!) But without this context, we are indeed liable to become the clients in the satiric video John linked.
I think that John might disagree with this principle. A world in which the principle is mostly false would be peculiar. It would be a world in which marketplaces of ideas don’t work at all, and even if someone fully solves AI alignment they will fail to convince most relevant people that their solution is correct (any more than someone with an incorrect solution would succeed in that). I don’t think that’s the world we live in.
Although currently I consider PSI to be more promising.
This post makes an important point: the words “artificial intelligence” don’t necessarily carve reality at the joints, the fact something is true about a modern system that we call AI doesn’t automatically imply anything about arbitrary future AI systems, no more than conclusions about e.g. Dendral or DeepBlue carry over to Gemini.
That said, IMO the author somewhat overstates their thesis. Specifically, I take issue with all the following claims:
LLMs have no chance of becoming AGI.
LLMs are automatically safe.
There is nearly no empirical evidence from LLMs that is relevant to alignment of future AI.
First, those points are somewhat vague because it’s not clear what counts as “LLM”. The phrase “Large Language Model” is already obsolete, at least because modern AI is multimodal. It’s more appropriate to speak of “Foundation Models” (FM). More importantly, it’s not clear what kind of fine-tuning does or doesn’t count (RLHF? RL on CoT? …)
Second, how do we know FM won’t become AGI? I’m imagining the argument is something like “FM is primarily about prediction, so it doesn’t have agency”. However, when predicting data that contains or implies decisions by agents, it’s not crazy to imagine that agency can arise in the predictor.
Third, how do we know that FM are always going to be safe? By the same token that they can develop agency, they can develop dangerous properties.
Fourth, it seems really unfair to say existing AI provides no relevant evidence. The achievements of existing AI systems are such that it seems very likely they capture at least some of the key algorithmic capabilities of the human brain. The ability of relatively simple and generic algorithms to perform well on a large variety of different tasks is indicative of something in the system being quite “general”, even if not “general intelligence” in the full sense.
I think that we should definitely try learning from existing AI. However, this learning should be more sophisticated and theory-driven than superficial analogies or trend extrapolations. What we shouldn’t do is say “we succeeded at aligning existing AI, therefore AI alignment is easy/solved in general”. The same theories that predicted catastrophic AI risk also predict roughly the current level of alignment for current AI systems.
I will expand a little on this last point. The core of the catastrophic AI risk scenario is:
We are directing the AI towards a goal which is complex (so that correct specification/generalization is difficult)[1].
The AI needs to make decisions in situations which (i) cannot be imitated well in simulation, due to the complexity of the world (ii) admit catastrophic mistakes (otherwise you can just add any mistake to the training data)[2].
The capability required from the AI to succeed is such that it can plausibly do catastrophic mistakes (if succeeding at the task is easy, but causing a catastrophe is really hard then a weak AI would be safe and effective)[3].
The above scenario must be addressed eventually, if only to create an AI defense system against unaligned AI that irresponsible actors could create. However, no modern AI system operates in this scenario. This is the most basic reason why the relative ease of alignment in modern systems (although even modern systems have alignment issues), does little to dispel concerns about catastrophic AI risk in the future.
Even for simple goals inner alignment is a concern. However, it’s harder to say at which level of capability this concern arises.
It’s also possible that mistakes are not catastrophic per se, but are simultaneously rare enough that it’s hard to get enough training data and frequent enough to be troublesome. This is related to the reliability problems in modern AI that we indeed observe.
But sometimes it might be tricky to hit the capability sweet spot where the AI is strong enough to be useful but weak enough to be safe, even if such a sweet spot exists in principle.
This post provides a mathematical analysis of a toy model of Goodhart’s Law. Namely, it assumes that the optimization proxy is a sum of the true utility function and noise , such that:
and are independent random variables w.r.t. some implicit distribution on the solution space. The meaning of this distribution is not discussed, but I guess we might think of it some kind of inductive bias, e.g. a simplicity prior.
The optimization process can be modeled as conditioning on a high value of .
In this model, the authors prove that Goodhart occurs when is subexponential and its tail is sufficiently heavier than that of . Conversely, when is sufficiently light-tailed, Goodhart doesn’t occur.
My opinion:
On the one hand, kudos for using actual math to study an alignment-relevant problem.
On the other hand, the modeling assumptions feel too toyish for most applications. Specifically, the idea that and are independent random variables seems implausible. Typically, we worry about Goodhart’s law because the proxy behaves differently in different domains. In the “ordinary” domain that motivated the choice of proxy, is a good approximation of . However, in other domains might be unrelated to or even anticorrelated.
For example, ordinarily smiles on human-looking faces is an indication of happy humans. However, in worlds that contain much more inanimate facsimiles of humans than actual humans, there is no correlation.
Or, to take the example used in the post, ordinarily if a sufficiently smart expert human judge reads an AI alignment proposal, they form a good opinion on how good this proposal is. But, if the proposal contains superhumanly clever manipulation and psychological warfare, the ordinary relationship completely breaks down. I don’t expect this effect to behave like independent random noise at all.
Less importantly, it might be interesting to extend this analysis to a more realistic model of optimization. For example, the optimizer learns a function which is the best approximation to out of some hypothesis class , and then optimizes instead of the actual . (Incidentally, this might generate an additional Goodhart effect due to the discrepancy between and .) Alternatively, the optimizer learns an infrafunction that is a coarsening of out of some hypothesis class and then optimizes .
Not sure these are the best textbooks, but you can try:
“Naive Set Theory” by Halmos
“Probability Theory” by Jaynes
“Introduction to the Theory of Computation” by Sipser