Epistemic Status
Unsure[1], partially noticing my own confusion. Hoping Cunningham’s Law can help resolve it.
Confusions About Arguments From Expected Utility Maximisation
Some MIRI people (e.g. Rob Bensinger) still highlight EU maximisers as the paradigm case for existentially dangerous AI systems. I’m confused by this for a few reasons:
Not all consequentialist/goal directed systems are expected utility maximisers
E.g. humans
Some recent developments make me sceptical that VNM expected utility are a natural form of generally intelligent systems
Wentworth’s subagents provide a model for inexploitable agents that don’t maximise a simple unitary utility function
The main requirement for subagents to be a better model than unitary agents is path dependent preferences or hidden state variables
Alternatively, subagents natively admit partial orders over preferences
If I’m not mistaken, utility functions seem to require a (static) total order over preferences
This might be a very unreasonable ask; it does not seem to describe humans, animals, or even existing sophisticated AI systems
I think the strongest implication of Wentworth’s subagents is that expected utility maximisation is not the limit or idealised form of agency
Shard Theory suggests that trained agents (via reinforcement learning[2]) form value “shards”
Values are inherently “contextual influences on decision making”
Hence agents do not have a static total order over preferences (what a utility function implies) as what preferences are active depends on the context
Preferences are dynamic (change over time), and the ordering of them is not necessarily total
This explains many of the observed inconsistencies in human decision making
A multitude of value shards do not admit analysis as a simple unitary utility function
Reward is not the optimisation target
Reinforcement learning does not select for reward maximising agents in general
Reward “upweight certain kinds of actions in certain kinds of situations, and therefore reward chisels cognitive grooves into agents”
I’m thus very sceptical that systems optimised via reinforcement learning to be capable in a wide variety of domains/tasks converge towards maximising a simple expected utility function
I am not aware that humanity actually knows training paradigms that select for expected utility maximisers
Our most capable/economically transformative AI systems are not agents and are definitely not expected utility maximisers
Such systems might converge towards general intelligence under sufficiently strong selection pressure but do not become expected utility maximisers in the limit
The do not become agents in the limit and expected utility maximisation is a particular kind of agency
I am seriously entertaining the hypothesis that expected utility maximisation is anti-natural to selection for general intelligence
I’m not under the impression that systems optimised by stochastic gradient descent to be generally capable optimisers converge towards expected utility maximisers
The generally capable optimisers produced by evolution aren’t expected utility maximisers
I’m starting to suspect that “search like” optimisation processes for general intelligence do not in general converge towards expected utility maximisers
I.e. it may end up being the case that the only way to create a generally capable expected utility maximiser is to explicitly design one
And we do not know how to design capable optimisers for rich environments
We can’t even design an image classifier
I currently disbelieve the strong orthogonality thesis translated to practice
While it may be in theory feasible to design systems at any intelligence level with any final goal
In practice, we cannot design capable optimisers.
For intelligent systems created by “search like” optimisation, final goals are not orthogonal to cognitive ability
Sufficiently hard optimisation for most cognitive tasks would not converge towards selecting for generally capable systems
In the limit, what do systems selected for playing Go converge towards?
I posit that said limit is not “general intelligence”
The cognitive tasks/domain on which a system was optimised for performance on may instantiate an upper bound on the general capabilities of the system
You do not need much optimisation power to attain optimal performance in logical tic tac toe
Systems selected for performance at logical tic tac toe should be pretty weak narrow optimisers because that’s all that’s required for optimality in that domain
I don’t expect the systems that matter (in the par human or strongly superhuman regime) to be expected utility maximisers. I think arguments for AI x-risk that rest on expected utility maximisers are mostly disconnected from reality. I suspect that discussing the perils of expected utility maximisation in particular — as opposed to e.g. dangers from powerful (consequentialist?) optimisation processes — is somewhere between being a distraction and being actively harmful[3].
I do not think expected utility maximisation is the limit of what generally capable optimisers look like[4].
Arguments for Expected Utility Maximisation Are Unnecessary
I don’t think the case for existential risks from AI safety rest on expected utility maximisation. I kind of stopped alieving expected utility maximisers a while back (only recently have I synthesised explicit beliefs that reject it), but I still plan on working on AI existential safety, because I don’t see the core threat as resulting from expected utility maximisation.
The reasons I consider AI an existential threat mostly rely on:
Instrumental convergence for consequentialist/goal directed systems
A system doesn’t need to be a utility maximiser for a simple utility function to be goal directed (again, see humans)
Selection pressures for power seeking systems
Reasons
More economically productive/useful
Some humans are power seeking
Power seeking systems promote themselves/have better reproductive fitness
Human disempowerment is the immediate existential catastrophe scenario I foresee from power seeking
Bad game theoretic equilibria
This could lead towards dystopian scenarios in multipolar outcomes
Humans getting outcompeted by AI systems
Could slowly lead to an extinction
I do not actually expect extinction near term, but it’s not the only “existential catastrophe”:
Human disempowerment
Various forms of dystopia
- ^
I optimised for writing this quickly. So my language may be stronger/more confident that I actually feel. I may not have spent as much time accurately communicating my uncertainty as may have been warranted.
- ^
Correct me if I’m mistaken, but I’m under the impression that RL is the main training paradigm we have that selects for agents.
I don’t necessarily expect that our most capable systems would be trained via reinforcement learning, but I think our most agentic systems would be.
- ^
There may be significant opportunity cost via diverting attention from other more plausible pathways to doom.
In general, I think exposing people to bad arguments for a position is a poor persuasive strategy as people who dismiss said bad arguments may (rationally) update downwards on the credibility of the position.
- ^
I don’t necessarily think agents are that limit either. But as “Why Subagents?” shows, expected utility maximisers aren’t the limit of idealised agency.
My take is that the concept of expected utility maximization is a mistake. In Eliezer’s Coherent decisions imply consistent utilities, you can see the mistake where he writes:
Reflectively stable agents are updateless. When they make an observation, they do not limit their caring as though all the possible worlds where their observation differs do not exist.
As far as I know, every argument for utility assumes (or implies) that whenever you make an observation, you stop caring about the possible worlds where that observation went differently.
The original Timeless Decision Theory was not updateless. Nor were any of the more traditional ways of thinking about decision. Updateless Decision Theory, and subsequent decision theories corrected this mistake.
Von Neumann did not notice this mistake because he was too busy inventing the entire field. The point where we discover updatelessness is the point where we are supposed to realize that all of utility theory is wrong. I think we failed to notice.
Ironically the community that was the birthplace of updatelessness became the flag for taking utility seriously. (To be fair, this probably is the birthplace of updatelessness because we took utility seriously.)
Unfortunately, because utility theory is so simple, and so obviously correct if you haven’t thought about updatelessness, it ended up being assumed all over the place, without tracking the dependency. I think we use a lot of concepts that are built on the foundation of utility without us even realizing it.
(Note that I am saying here that utility theory is a theoretical mistake! This is much stronger than just saying that humans don’t have utility functions.)
What should I read to learn about propositions like “Reflectively stable agents are updateless” and “utility theory is a theoretical mistake”?
Did you end up finding any resources related to this?
No.
re: #2, maybe https://www.lesswrong.com/posts/A8iGaZ3uHNNGgJeaD/an-orthodox-case-against-utility-functions
but it doesn’t seem like it’s quite what Scott is talking about here. I’d love to hear more from him.
I notice that I’m confused. I’ve recently read the paper “Functional decision theory...” and it’s formulated explicitly in terms of expected utility maximization.
FDT and UDT are formulated in terms of expected utility. I am saying that the they advocate for a way of thinking about the world that makes it so that you don’t just Bayesian update on your observations, and forget about the other possible worlds.
Once you take on this worldview, the Dutch books that made you believe in expected utility in the first place are less convincing, so maybe we want to rethink utility.
I don’t know what the FDT authors were thinking, but it seems like they did not propagate the consequences of the worldview into reevaluating what preferences over outcomes look like.
To ask for decisions to be coherent, there need to be multiple possible situations in which decisions could be made, coherently across these situations or not. A UDT agent that picks a policy faces a single decision in a single possible situation. There is nothing else out there for the decision in this situation to be coherent with.
The options offered for the decision could be interpreted as lotteries over outcomes, but there is still only one decision to pick one lottery among them all, instead of many situations where the decision is to pick among a particular smaller selection of lotteries, different in each situation. So asking for coherence means asking what the updateless agent would do if most policies could be suddenly prohibited just before the decision (but after its preference is settled), if it were to update on the fact that only particular policies remained as options, which is not what actually happens.
I am not sure if there is any disagreement in this comment. What you say sounds right to me. I agree that UDT does not really set us up to want to talk about “coherence” in the first place, which makes it weird to have it be formalized in term of expected utility maximization.
This does not make me think intelligent/rational agents will/should converge to having utility.
I think coherence of unclear kind is an important principle that needs a place in any decision theory, and it motivates something other than pure updatelessness. I’m not sure how your argument should survive this. The perspective of expected utility and the perspective of updatelessness both have glaring flaws, respectively unwarranted updatefulness and lack of a coherence concept. They can’t argue against each other in their incomplete forms. Expected utility is no more a mistake than updatelessness.
Don’t updateless agents with suitably coherent preferences still have utility functions?
That depends on what you mean by “suitably coherent.” If you mean they need to satisfy the independence vNM axiom, then yes. But the point is that I don’t see any good argument why updateless agents should satisfy that axiom. The argument for that axiom passes through wanting to have a certain relationship with Bayesian updating.
Also, if by “have a utility function” you mean something other than “try to maximize expected utility,” I don’t know what you mean. To me, the cardinal (as opposed to ordinal) structure of preferences that makes me want to call something a “utility function” is about how to choose between lotteries.
Yeah by “having a utility function” I just mean “being representable as trying to maximise expected utility”.
Ah okay, interesting. Do you think that updateless agents need not accept any separability axiom at all? And if not, what justifies using the EU framework for discussing UDT agents?
In many discussions on LW about UDT, it seems that a starting point is that agent is maximising some notion of expected utility, and the updatelessness comes in via the EU formula iterating over policies rather than actions. But if we give up on some separability axiom, it seems that this EU starting point is not warranted, since every major EU representation theorem needs some version of separability.
You could take as an input parameter to UDT a preference ordering over lotteries that does not satisfy the independence axiom, but is a total order (or total preorder if you want ties). Each policy you can take results in a lottery over outcomes, and you take the policy that gives your favorite lottery. There is no need for the assumption that your preferences over lotteries is vNM.
Note that I don’t think that we really understand decision theory, and have a coherent proposal. The only thing I feel like I can say confidently is that if you are convinced by the style of argument that is used to argue for the independence axiom, then you should probably also be convinced by arguments that cause you to be updateful and thus not reflectively stable.
Okay this is very clarifying, thanks!
If the preference ordering over lotteries violates independence, then it will not be representable as maximising EU with respect to the probabilities in the lotteries (by the vNM theorem). Do you think it’s a mistake then to think of UDT as “EU maximisation, where the thing you’re choosing is policies”? If so, I believe this is the most common way UDT is framed in LW discussions, and so this would be a pretty important point for you to make more visibly (unless you’ve already made this point before in a post, in which case I’d love to read it).
I think UDT is as you say. I think it is also important to clarify that you are not updating on your observations when you decide on a policy. (If you did, it wouldn’t really be a function from observations to actions, but it is important to emphasize in UDT.)
Note that I am using “updateless” differently than “UDT”. By updateless, I mostly mean anything that is not performing Bayesian updates and forgetting the other possible worlds when it makes observations. UDT is more of a specific proposal. “Updateless” is more of negative property, defined by lack of updating.
I have been trying to write a big post on utility, and haven’t yet, and decided it would be good to give a quick argument here because of the question. The only posts I remember making against utility are in the geometric rationality sequence, especially this post.
Thanks, the clarification of UDT vs. “updateless” is helpful.
But now I’m a bit confused as to why you would still regard UDT as “EU maximisation, where the thing you’re choosing is policies”. If I have a preference ordering over lotteries that violates independence, the vNM theorem implies that I cannot be represented as maximising EU.
In fact, after reading Vladimir_Nesov’s comment, it doesn’t even seem fully accurate to view UDT taking in a preference ordering over lotteries. Here’s the way I’m thinking of UDT: your prior over possible worlds uniquely determines the probabilities of a single lottery L, and selecting a global policy is equivalent to choosing the outcomes of this lottery L. Now different UDT agents may prefer different lotteries, but this is in no sense expected utility maximisation. This is simply: some UDT agents think one lottery is the best, other might think another is the best. There is nothing in this story that resembles a cardinal utility function over outcomes that the agents are multiplying with their prior probabilities to maximise EU with respect to.
It seems that to get an EU representation of UDT, you need to impose coherence on the preference ordering over lotteries (i.e. over different prior distributions), but since UDT agents come with some fixed prior over worlds which is not updated, it’s not at all clear why rationality would demand coherence in your preference between lotteries (let alone coherence that satisfies independence).
Yeah, I don’t have a specific UDT proposal in mind. Maybe instead of “updateless” I should say “the kind of mind that might get counterfactually mugged” as in this example.
Do you expect learned ML systems to be updateless?
It seems plausible to me that updatelessness of agents is just as “disconnected from reality” of actual systems as EU maximization. Would you disagree?
No, at least probably not at the time that we lose all control.
However, I expect that systems that are self-transparent and can easily sellf-modify might quickly converge to reflective stability (and thus updatelessness). They might not, but I think the same arguments that might make you think they would develop a utility function also can be used to argue that they would develop updatelessness (and thus possibly also not develop a utility function).
I’m confused about the example you give. In the paragraph, Eliezer is trying to show that you ought to accept the independence axiom, cause you can be Dutch booked if you don’t. I’d think if you’re updateless, that means you already accept the independence axiom (cause you wouldn’t be time-consistent otherwise).
And in that sense it seems reasonable to assume that someone who doesn’t already accept the independence axiom is also not updateless.
I haven’t followed this very close, so I’m kinda out-of-the-loop… Which part of UDT/updatelessness says “don’t go for the most utility” (no-maximization) and/or “utility cannot be measured / doesn’t exist” (no-”foundation of utility”, debatably no-consequentialism)? Or maybe “utility” here means something else?
Are you just referring to the VNM theorems or are there other theorems you have in mind?
Note for self: It seems like the independence condition breaks for counterfactual mugging assuming you think we should pay. Assume P is paying $50 and N is not paying, M is receiving $1 million if you would have paid in the counterfactual and zero otherwise. We have N>P but 0.5P+0.5M>0.5N+0.5M in contradiction to independence. The issue is that the value of M is not independent of the choice between P and N.
Note that I am not saying here that rational agents can’t have a utility function. I am only saying that they don’t have to.
This is very surprising to me! Perhaps I misunderstand what you mean by “caring,” but: an agent who’s made one observation is utterly unable[1] to interact with the other possible-worlds where the observation differed; and it seems crazy[1] to choose your actions based on something they can’t affect; and “not choosing my actions based on X” is how I would define “not caring about X.”
Aside from “my decisions might be logically-correlated with decisions that agents in those worlds make (e.g. clone-prisoner’s-dilemma),” or “I am locked into certain decisions that a CDT agent would call suboptimal, because of a precommitment I made (e.g. Newcomb)” or other fancy decision-theoretic stuff. But that doesn’t seem relevant to Eliezer’s lever-coin-flip scenario you link to?
Here is a situation where you make an “observation” and can still interact with the other possible worlds. Maybe you do not want to call this an observation, but if you don’t call it an observation, then true observations probably never really happen in practice.
I was not trying to say that is relevant to the coin flip directly. I was trying to say that the move used to justify the coin flip is the same move that is rejected in other contexts, and so we should open to the idea of agents that refuse to make that move, and thus might not have utility.
Ah, that’s the crucial bit I was missing! Thanks for spelling it out.
My personal take is that everything you wrote in this post is correct, and expected utility maximisers are neither the real threat, nor a great model for thinking about dangerous AI. Thanks for writing this up!
The key question I always focus on is: where do you get your capabilities from?
For instance, with GOFAI and ordinary programming, you have some human programmer manually create a model of the scenarios the AI can face, and then manually create a bunch of rules for what to do in order to achieve things. So basically, the human programmer has a bunch of really advanced capabilities, and they use them to manually build some simple capabilities.
“Consequentialism”, broadly defined, represents an alternative class of ways to gain capabilities, namely choosing what to do based on it having the desired consequences. To some extent, this is a method humans uses, perhaps particularly the method the smartest and most autistic humans most use (which I suspect to be connected to LessWrong demographics but who knows...). Utility maximization captures the essence of consequentalism; there are various other things, such as multi-agency that one can throw on top of it, but those other things still mainly derive their capabilities from the core of utility maximization.
Self-supervised language models such as GPT-3 do not gain their capabilities from consequentialism, yet they have advanced capabilities nonetheless. How? Imitation learning, which basically works because of Aumann’s agreement theorem. Self-supervised language models mimic human text, and humans do useful stuff and describe it in text, so self-supervised language models learn the useful stuff that can be described in text.
Risk that arises purely from language models or non-consequentialist RLHF might be quite interesting and important to study. I feel less able to predict it, though, partly because I don’t know what the models will be deployed to do, or how much they can be coerced into doing, or what kinds of witchcraft are necessary to coerce the models into doing those things.
It is possible to me that imitation learning and RLHF can bring us to the frontier of human abilities, so that we have a tool that can solve tasks as well as the best humans can. However, I don’t think it will be able to much exceed that frontier. This is still superhuman, because no human is as good as all the best humans at all the tasks. But it is not far-superhuman, even though I think being far-superhuman is possible, and a key part in it not being far-superhuman is that it cannot extend its capabilities. As such, I would expect consequentialism to be necessary for creating something that is far-superhuman.
I think many of the classical AI risk arguments apply to consequentialist far-superhuman AI.
If I understood your model correctly, GPT has capability because (1) humans are consequentialists so they have capabilities and (2) GPT imitates human output (3) which requires the GPT learning the underlying human capabilities.
simulators:
I think the above quote from janus would add to (3) that it requires GPT to also learn the environment and the human-environment interactions, aside from just mimicking human capabilities. I know what you said doesn’t contradict this, but I think there’s a difference in emphasis, i.e. imitation of humans (or some other consequentialist) not necessarily being the main source of capability.
Generalizing this, it seems obviously wrong that imitation-learning-of-consequentialists is where self-supervised language models get their capabilities from? (I strongly suspect I misunderstood your argument or what you meant by capabilities, but just laying out anyways)
Like, LLM-style transformer pretrained on protein sequences get their “protein-prediction capability” purely from “environment generative-rule learning,” and none from imitation learning of a consequentialist’s output.
I think most of the capabilities on earth exist in humans, not in the environment. For instance if you have a rock, it’s just gonna sit there; it’s not gonna make a rocket and fly to the moon. This is why I emphasize GPT as getting its capabilities from humans, since there are not many other things in the environment it could get capabilities from.
I agree that insofar as there are other things in the environment with capabilities (e.g. computers outputting big tables of math results) that get fed into GPT, it also gains some capabilities from them.
I think they get their capabilities from evolution, which is a consequentialist optimizer?
I disagree especially with this, but I have not yet documented my case against it in a form I’m satisfied with.
That said, I do not endorse your case for how language models gain their capabilities. I don’t think of it as acquiring capabilities humans have.
I think of AI systems as the products of selection.
Consider a cognitive domain/task, and an optimisation process that selects system for performance on that task. In the limit of arbitrarily powerful optimisation pressure what do the systems so selected converge to?
For something like logical tic tac toe, the systems so produced will be very narrow optimisers and pretty weak ones, because very little optimisation power is needed to attain optimal performance on tic tac toe.
What about Go? The systems so produced will also be narrow optimisers, but vastly more powerful, because much more optimisation power is needed to attain optimal performance in Go.
I think the products of optimisation for the task of minimising predictive loss on sufficiently large and diverse datasets (e.g. humanity’s text corpus) converge to general intelligence.
And arbitrarily powerful optimisation pressure would create arbitrarily powerful LLMs.
I expect that LLMs can in principle scale far into the superhuman regime.
Could you expand on what you mean by general intelligence, and how it gets created selected for by the task of minimising predictive loss on sufficiently large and diverse datasets like humanity’s text corpus?
This is the part I’ve not yet written up in a form I endorse.
I’ll try to get it done before the end of the year.
Expanded: Where do you get your capabilities from?
If AI risk arguments mainly apply to consequentialist (which I assume is the same as EU-maximizing in the OP) AI, and the first half of the OP is right that such AI is unlikely to arise naturally, does that make you update against AI risk?
Yes
Not quite the same, but probably close enough.
You can have non-consequentialist EU maximizers if e.g. the actionspace and statespace is small and someone manually computed a table of the expected utilities. In that case, the consequentialism is in the entity that computed the table of the expected utilities, not the entity that selects an action based on the table.
(Though I suppose such an agent is kind of pointless since you could as well just store a table of the actions to choose.)
You can also have consequentialists that are not EU maximizers if they are e.g. a collection of consequentialist EU maximizers working together.
I don’t think consequentialism is related to utility maximisation in the way you try to present it. There are many consequentialistic agent architectures that are explicitly not utility maximising, e. g. Active Inference, JEPA, ReduNets.
Then you seem to switch your response to discussing that consequentialism is important for reaching the far-superhuman AI level. This looks at least plausible to me, but first, these far-superhuman AIs could have a non-UM consequentialistic agent architecture (see above), and second, DragonGod didn’t say that the risk is necessarily from far-superhuman AIs (even though non-UM ones): I believe he argued for that here. It’s possible even that far-superhuman intelligence is not a thing at all (except for the speed of cognition and the size of memory), but the risks that he highlights: human disempowerment and dystopian scenarios, still absolutely stand.
JEPA seems like it is basically utility maximizing to me. What distinction are you referring to?
I keep getting confused about Active Inference (I think I understood it once based on an equivalence to utility maximization, but it’s a while ago and you seem to be saying that this equivalence doesn’t hold), and I’m not familiar with ReduNets, so I would appreciate a link or an explainer to catch up.
I was sort of addressing alternative risks in this paragraph:
If you’re saying “let’s think about a more general class of agents because EU maximization is unrealistic”, that’s fair, but note that you’re potentially making the problem more difficult by trying to deal with a larger class with fewer invariants.
If you’re saying “let’s think about a distinct but not more general class of agents because that will be more alignable”, then maybe, and it’d be useful to say what the class is, but: you’re going to have trouble aligning something if you can’t even know that it has some properties that are stable under self-reflection. An EU maximizer is maybe close to being stable under self-reflection and self-modification. That makes it attractive as a theoretical tool: e.g. maybe you can point at a good utility function, and then get a good prediction of what actually happens, relying on reflective stability; or e.g. maybe you can find nearby neighbors to EU maximization that are still reflectively stable and easier to align. It makes sense to try starting from scratch, but IMO this is a key thing that any approach will probably have to deal with.
I strongly suspect that expected utility maximisers are anti-natural for selection for general capabilities.
There’s naturality as in “what does it look like, the very first thing that is just barely generally capable enough to register as a general intelligence?”, and there’s naturality as in “what does it look like, a highly capable thing that has read-write access to itself?”. Both interesting and relevant, but the latter question is in some ways an easier question to answer, and in some ways easier to answer alignment questions about. This is analogous to unbounded analysis: https://arbital.com/p/unbounded_analysis/
In other words, we can’t even align an EU maximizer, and EU maximizers have to some extent already simplified away much of the problem (e.g. the problems coming from more unconstrained self-modification).
You seem to try to bail out EU maximisation as the model because it is a limit of agency, in some sense. I don’t think this is the case.
In classical and quantum derivations of the Free Energy Principle, it is shown that the limit is the perfect predictive capability of the agent’s environment (or, more pedantically: in classic formulation, FEP is derived from basic statistical mechanics; in quantum formulation, it’s more of being postulated, but it is shown that quantum FEP in the limit is equivalent to the Unitarity Principle). Also, Active Inference, the process theory which is derived from the FEP, can be seen as a formalisation of instrumental convergence.
So, we can informally outline the “stages of life” of a self-modifying agent as follows: general intelligence → maximal instrumental convergence → maximal prediction of the environment → maximal entanglement with the environment.
What you’ve said so far doesn’t seem to address my comments, or make it clear to me what the relevant of the FEP is. I also don’t understand the FEP or the point of the FEP. I’m not saying EU maximizers are reflectively stable or a limit of agency, I’m saying that EU maximization is the least obviously reflectively unstable thing I’m aware of.
I said that the limit of agency is already proposed, from the physical perspective (FEP). And this limit is not EU maximisation. So, methodologically, you should either criticise this proposal, or suggest an alternative theory that is better, or take the proposal seriously.
If you take the proposal seriously (I do): the limit appears to be “uninteresting”. A maximally entangled system is “nothing”, it’s perceptibly indistinguishable from its environment, for a third-person observer (let’s say, in Tegmark’s tripartite partition system-environment-observer). There is no other limit. Instrumental convergence is not the limit, a strong instrumentally convergent system is still far from the limit.
This suggests that unbounded analysis, “thinking to the limit” is not useful, in this particular situation.
Any physical theory of agency must ensure “reflective stability”, by construction. I definitely don’t sense anything “reflectively unstable” in Active Inference, because it’s basically the theory of self-evidencing, and wields instrumental convergence in service of this self-evidencing. Who wouldn’t “want” this, reflectively? Active Inference agents in some sense must want this by construction because they want to be themselves, as long as possible. However they redefine themselves, and at that very moment, they also want to be themselves (redefined). The only logical possibility out of this is to not want to exist at all at some point, i. e., commit suicide, which agents (e. g., humans) actually do sometimes. But conditioned on that they want to continue to exist, they are definitely reflectively stable.
I’m talking about reflective stability. Are you saying that all agents will eventually self modify into FEP, and FEP is a rock?
Reward is not Necessary: How to Create a Compositional Self-Preserving Agent for Life-Long Learning
My current take is that we don’t have good formalisms for consequentialist goal-directed systems that are weaker than expected utility maximization, and therefore don’t really know how to reason about them. I think this is main cause of overemphasis on EUM.
For example, completeness as stated in the VNM assumptions is actually a really strong property. Aumann wrote a paper on removing completeness, but the utility function is no longer unique.
Speaking for myself, I sometimes use “EU maximization” as shorthand for one of the following concepts, depending on context:
The eventual intellectual descendant of EU maximization, i.e., the decision theory or theory of rationality that future philosophers will eventually accept as correct or ideal or normative, which presumably will have some kind of connection (even if only historical) to EU maximization.
The eventual decision procedure of a reflectively stable superintelligence.
The decision procedure of a very capable consequentialist AI, even if it’s not quite reflectively stable yet.
Hmm, I just did a search of my own LW content, and can’t actually find any instances of myself doing this, which makes me wonder why I was tempted to type the above. Perhaps what I actually do is if I see someone else mention “EU maximization”, I mentally steelman their argument by replacing the concept with one of the three above, if anyone of them would make a sensible substitution.
Do you have any actual examples of anyone talking about EU maximization lately, in connection with AI risk?
I note that EU maximization has this baggage of never strictly preferring a lottery over outcomes to the component outcomes, and you steelmen appear to me to not carry that baggage. I think that baggage is actually doing work in some people’s reasoning and intuitions.
Do you have any examples of this?
Hmm, examples are hard. Maybe the intuitions contribute to concept of edge instantiation?
I think you are referring to the case where an agent wishes to be unpredictable in an adversarial situation, right? (I genuinely do not feel confident I understand what you said.)
If so, isn’t this lottery on a different, let’s say ontological, level, instead of the level of “lotteries” that define its utility?
I parsed the Rob Bensinger tweet I linked in the OP as being about expected utility maximising when I read it, but others have pointed out that wasn’t necessarily a fair reading.
I think it depends on how you define expected utility. I agree that a definition that limits us only to analyzing end-state maximizers that seek some final state of the world is not very useful.
I don’t think that for non-trivial AI agents, the utility function should or even can be defined as a simple function over the preferable final state of the world. U:Ω→R
This function does not take into account time and an intermediate set of predicted future states that the agent will possibly have preference over. The agent may have a preference for the final state of the universe but most likely and realistically it won’t have that kind of preference except for some special strange cases. There are two reasons:
a general agent likely won’t be designed as a maximizer over one single long-term goal (like making paperclips) but rather as useful for humans over multiple domains so it would rather care more about short-term outcomes, middle-term preferences, and tasks “at hand”
the final state of the universe is generally known by us and will likely be known by a very intelligent general agent, even if you ask current GPT-3 it knows that we will end up in Big Freeze or Big Rip with the latter being more likely. Agent can’t really optimize for the end state of the Universe as there are not many actions that could change physics and there is no way to reason about the end state except for general predictions that do not end up well for this universe, whatever the agent does.
Any complex agent would likely have a utility function over possible actions that would be equal to the utility function of the set of predicted futures after action A vs the set of predicted futures without action A (or over differences between worlds in those futures). By action I mean possibly a set of smaller actions (hierarchy of actions—e.g. plans, strategies), it might not be atomic. Directly it cannot be easily computable so most likely this would be compressed to a set of important predicted future events on the level of abstraction that the agent cares about, which should constitute future worlds without action A and action A with enough approximation.
This is also how we evaluate actions. We evaluate outcomes in the short and long terms. We also care differently depending on time scope.
I say this because most sensible “alignment goals” like please don’t kill humans are time-based. What does it mean not to kill humans? It is clearly not about the final state. Remember, Big Rip or Big Freeze. Maybe AGI can kill some for a year and then no more assuming the population will go up and some people are killed anyway so it does not matter long-term? No, this is also not about the non-final but long-term outcome. Really it is a function of intermediate states. Something like the integral of some function U’(dΩ) where dΩ is a delta between outcomes of action vs non-action, over time, which can be approximated and compressed into integral over the function of an event over multiple events until some time T being maximal sensible scope.
Most of the behaviors and preferences of humans are also time-scoped, and time-limited and take multiple future states into account, mostly short-scoped. I don’t think that alignment goals can be even expressed in terms of simple end-goal (preferable final state of the world) as the problem partially comes from the attitude of eng goal justifying the means that are at the core of the utility function defined as U:Ω→R.
It seems plausible to me that even non-static human goals can be defined as utility functions over the set of differences in future outcomes (difference between two paths of events). What is also obvious to me is that we as humans are able to modify our utility function to some extent, but not very much. Nevertheless, for humans the boundaries between most baseline goals, preferences, and morality vs instrumental convergence goals are blurry. We have a lot of heuristics and biases so our minds work out some things more quickly and more efficiently than if we would on intelligence, thinking, and logic. The cost is lower consistency, less precision, and higher variability.
So I find it useful to think about agents as maximizers over utility function, but not defined as one final goal or outcome or state of the world. Rather one that maximizes the difference between two ordered sets of events in different time scopes to calculate the utility of an action.
I also don’t think agents must be initially rationally stable with an unchangeable utility function. This is also a problem as an agent can have initially a set of preferences with some hierarchy or weights, but it also can reason that some of these are incompatible with others, that the hierarchy is not logically consistent, and might seek to change it for sake of consistency to be fully coherent.
I’m not an AGI, clearly, but it is just like I think about morality right now. I learned that killing is bad. But I still can question “why we don’t kill?” and modify my worldview based on the answer (or maybe specify it in more detail in this matter). And it is a useful question as it says a lot about edge cases including abortion, euthanasia, war, etc. The same might happen for rational agents—as it might update their utility function to be stable and consistent, maybe even questioning some of the learned parts of the utility function in the process. Yes, you can say that if you can change that then it was not your terminal goal. Nevertheless, I can imagine agents with no terminal core goals at all. I’m not even sure if we as humans have any core terminal goals (maybe except avoiding death and own harm in the case of most humans in most circumstances… but some overcome that as Thích Quảng Đức did).