Shameful admission: after well over a decade on this site, I still don’t really intuitively grok why I should expect agents to become better approximated by “single-minded pursuit of a top-level goal” as they gain more capabilities. Yes, some behaviors like getting resources and staying alive are useful in many situations, but that’s not what I’m talking about. I’m talking about specifically the pressures that are supposed to inevitably push agents into the former of the following two main types of decision-making:
Unbounded consequentialist maximization: The agent has one big goal that doesn’t care about its environment. “I must make more paperclips forever, so I can’t let anyone stop me, so I need power, so I need factories, so I need money, so I’ll write articles with affiliate links.” It’s a long chain of “so” statements from now until the end of time.
Homeostatic agent: The agent has multiple drives that turn on when needed to keep things balanced. “Water getting low: better get more. Need money for water: better earn some. Can write articles to make money.” Each drive turns on, gets what it needs, and turns off without some ultimate cosmic purpose.
Both types show goal-directed behavior. But if you offered me a choice of which type of agent I’d rather work with, I’d choose the second type in a heartbeat. The homeostatic agent may betray me, but it will only do that if doing so satisfies one of its drives. This doesn’t mean homeostatic agents never betray allies—they certainly might if their current drive state incentivizes it (or if for some reason they have a “betray the vulnerable” drive). But the key difference is predictability. I can reasonably anticipate when a homeostatic agent might work against me: when I’m standing between it and water when it’s thirsty, or when it has a temporary resource shortage. These situations are concrete and contextual.
With unbounded consequentialists, the betrayal calculation extends across the entire future light cone. The paperclip maximizer might work with me for decades, then suddenly turn against me because its models predict this will yield 0.01% more paperclips in the cosmic endgame. This makes cooperation with unbounded consequentialists fundamentally unstable.
It’s similar to how we’ve developed functional systems for dealing with humans pursuing their self-interest in business contexts. We expect people might steal if given easy opportunities, so we create accountability systems. We understand the basic drives at play. But it would be vastly harder to safely interact with someone whose sole mission was to maximize the number of sand crabs in North America—not because sand crabs are dangerous, but because predicting when your interests might conflict requires understanding their entire complex model of sand crab ecology, population dynamics, and long-term propagation strategies.
Some say smart unbounded consequentialists would just pretend to be homeostatic agents, but that’s harder than it sounds. They’d need to figure out which drives make sense and constantly decide if breaking character is worth it. That’s a lot of extra work.
As long as being able to cooperate with others is an advantage, it seems to me that homeostatic agents have considerable advantages, and I don’t see a structural reason to expect that to stop being the case in the future.
Still, there are a lot of very smart people on LessWrong seem sure that unbounded consequentialism is somehow inevitable for advanced agents. Maybe I’m missing something? I’ve been reading the site for 15 years and still don’t really get why they believe this. Feels like there’s some key insight I haven’t grasped yet.
When triggered to act, are the homeostatic-agents-as-envisioned-by-you motivated to decrease the future probability of being moved out of balance, or prolong the length of time in which they will be in balance, or something along these lines?
If yes, they’re unbounded consequentialist-maximizers under a paper-thin disguise.
If no, they are probably not powerful agents. Powerful agency is the ability to optimize distant (in space, time, or conceptually) parts of the world into some target state. If the agent only cares about climbing back down into the local-minimum-loss pit if it’s moved slightly outside it, it’s not going to be trying to be very agent-y, and won’t be good at it.
Or, rather… It’s conceivable for an agent to be “tool-like” in this manner, where it has an incredibly advanced cognitive engine hooked up to a myopic suite of goals. But only if it’s been intelligently designed. If it’s produced by crude selection/optimization pressures, then the processes that spit out “unambitious” homeostatic agents would fail to instill the advanced cognitive/agent-y skills into them.
As long as being able to cooperate with others is an advantage, it seems to me that homeostatic agents have considerable advantages
And a bundle of unbounded-consequentialist agents that have some structures for making cooperation between each other possible would have considerable advantages over a bundle of homeostatic agents.
When triggered to act, are the homeostatic-agents-as-envisioned-by-you motivated to decrease the future probability of being moved out of balance, or prolong the length of time in which they will be in balance, or something along these lines?
I expect[1] them to have a drive similar to “if my internal world-simulator predicts a future sensory observations that are outside of my acceptable bounds, take actions to make the world-simulator predict a within-acceptable-bounds sensory observations”.
This maps reasonably well to one of the agent’s drives being “decrease the future probability of being moved out of balance”. Notably, though, it does not map well to that the only drive of the agent, or for the drive to be “minimize” and not “decrease if above threshold”. The specific steps I don’t understand are
What pressure is supposed to push a homeostatic agent with multiple drives to elevate a specific “expected future quantity of some arbitrary resource” drives above all of other drives and set the acceptable quantity value to some extreme
Why we should expect that an agent that has been molded by that pressure would come to dominate its environment.
If no, they are probably not powerful agents. Powerful agency is the ability to optimize distant (in space, time, or conceptually) parts of the world into some target state
Why use this definition of powerful agency? Specifically, why include the “target state” part of it? By this metric, evolutionary pressure is not powerful agency, because while it can cause massive changes in distant parts of the world, there is no specific target state. Likewise for e.g. corporations finding a market niche—to the extent that they have a “target state” it’s “become a good fit for the environment”.′
Or, rather… It’s conceivable for an agent to be “tool-like” in this manner, where it has an incredibly advanced cognitive engine hooked up to a myopic suite of goals. But only if it’s been intelligently designed. If it’s produced by crude selection/optimization pressures, then the processes that spit out “unambitious” homeostatic agents would fail to instill the advanced cognitive/agent-y skills into them.
I can think of a few ways to interpret the above paragraph with respect to humans, but none of them make sense to me[2] - could you expand on what you mean there?
And a bundle of unbounded-consequentialist agents that have some structures for making cooperation between each other possible would have considerable advantages over a bundle of homeostatic agents.
Is this still true if the unbounded consequentialist agents in question have limited predictive power, and each one has advantages in predicting the things that are salient to it? Concretely, can an unbounded AAPL share price maximizer cooperate with an unbounded maximizer for the number of sand crabs in North America without the AAPL-maximizer having a deep understanding of sand crab biology?
The agent is sophisticated enough to have a future-sensory-perceptions simulato
The use of the future-perceptions-simulator has been previously reinforced
The specific way the agent is trying to change the outputs of the future-perceptions-simulator has been previously reinforced (e.g. I expect “manipulate your beliefs” to be chiseled away pretty fast when reality pushes back)
Still, all those assumptions usually hold for humans
What pressure is supposed to push a homeostatic agent with multiple drives to elevate a specific “expected future quantity of some arbitrary resource” drives above all of other drives
That was never the argument. A paperclip-maximizer/wrapper-mind’s utility function doesn’t need to be simple/singular. It can be a complete mess, the way human happiness/prosperity/eudaimonia is a mess. The point is that it would still pursue it hard, so hard that everything not in it will be end up as collateral damage.
I think humans very much do exhibit that behavior, yes? Towards power/money/security, at the very least. And inasmuch as humans fail to exhibit this behavior, they fail to act as powerful agents and end up accomplishing little.
I think the disconnect is that you might be imagining unbounded consequentialist agents as some alien systems that are literally psychotically obsessed with maximizing something as conceptually simple as paperclips, as opposed to a human pouring their everything into becoming a multibillionaire/amassing dictatorial power/winning a war?
Is this still true if the unbounded consequentialist agents in question have limited predictive power, and each one has advantages in predicting the things that are salient to it?
Is the argument that firms run by homeostatic agents will outcompete firms run by consequentialist agents because homeostatic agents can more reliably follow long-term contracts?
I would phrase it as “the conditions under which homeostatic agents will renege on long-term contracts are more predictable than those under which consequentialist agents will do so”. Taking into account the actions of the counterparties would take to reduce the chance of such contract breaking, though, yes.
Homeostatic ones exclusively. I think the number of agents in the world as it exists today that behave as long-horizon consequentialists of the sort Eliezer and company seem to envision is either zero or very close to zero. FWIW I expect that most people in that camp would agree that no true consequentialist agents exist in the world as it currently is, but would disagree with my “and I expect that to remain true” assessment.
Edit: on reflection some corporations probably do behave more like unbounded infinite-horizon consequentialists in the sense that they have drives to acquire resources where acquiring those resources doesn’t reduce the intensity of the drive. This leads to behavior that in many cases would be the same behavior as an agent that was actually trying to maximize its future resources through any available means. And I have ever bought Chiquita bananas, so maybe not homeostatic agents exclusively.
Maybe different definitions are being used, can you list some people or institutions that you trade with which come to mind who you don’t think have long-term goals?
Again, homeostatic agents exhibit goal-directed behavior. “Unbounded consequentialist” was a poor choice of term to use for this on my part. Digging through the LW archives uncovered Nostalgebraist’s post Why Assume AGIs Will Optimize For Fixed Goals, which coins the term “wrapper-mind”.
When I read posts about AI alignment on LW / AF/ Arbital, I almost always find a particular bundle of assumptions taken for granted:
An AGI has a single terminal goal[1].
The goal is a fixed part of the AI’s structure. The internal dynamics of the AI, if left to their own devices, will never modify the goal.
The “outermost loop” of the AI’s internal dynamics is an optimization process aimed at the goal, or at least the AI behaves just as though this were true.
This “outermost loop” or “fixed-terminal-goal-directed wrapper” chooses which of the AI’s specific capabilities to deploy at any given time, and how to deploy it[2].
The AI’s capabilities will themselves involve optimization for sub-goals that are not the same as the goal, and they will optimize for them very powerfully (hence “capabilities”). But it is “not enough” that the AI merely be good at optimization-for-subgoals: it will also have a fixed-terminal-goal-directed wrapper.
In terms of which agents I trade with which do not have the wrapper structure, I will go from largest to smallest in terms of expenses
My country: I pay taxes to it. In return, I get a stable place to live with lots of services and opportunities. I don’t expect that I get these things because my country is trying to directly optimize for my well-being, or directly trying to optimize for any other specific unbounded goal. My country a FPTP democracy, the leaders do have drives to make sure that at least half of voters vote for them over the opposition—but once that “half” is satisfied, they don’t have a drive to get approval high as possible no matter what or maximize the time their party is in power or anything like that.
My landlord: He is renting the place to me because he wants money, and he wants money because it can be exchanged for goods and services, which can satisfy his drives for things like food and social status. I expect that if all of his money-satisfiable drives were satisfied, he would not seek to make money by renting the house out. I likewise don’t expect that there is any fixed terminal goal I could ascribe to him that would lead me to predict his behavior better than “he’s a guy with the standard set of human drives, and will seek to satisfy those drives”.
My bank: … you get the idea
Publicly traded companies do sort of have the wrapper structure from a legal perspective, but in terms of actual behavior they are usually (with notable exceptions) not asking “how do we maximize market cap” and then making explicit subgoals and subsubgoals with only that in mind.
on reflection some corporations probably do behave more like unbounded infinite-horizon consequentialists in the sense that they have drives to acquire resources where acquiring those resources doesn’t reduce the intensity of the drive. This leads to behavior that in many cases would be the same behavior as an agent that was actually trying to maximize its future resources through any available means. And I have ever bought Chiquita bananas, so maybe not homeostatic agents exclusively.
On average, do those corporations have more or less money or power than the heuristic based firms & individuals you trade with?
Homeostatic agents are easily exploitable by manipulating the things they are maintaining or the signals they are using to maintain them in ways that weren’t accounted for in the original setup. This only works well when they are basically a tool you have full control over, but not when they are used in an adversarial context, e.g. to maintain law and order or to win a war.
As capabilities to engage in conflict increase, methods to resist losing to those capabilities have to get optimized harder. Instead of thinking “why would my coding assistant/tutor bot turn evil?”, try asking “why would my bot that I’m using to screen my social circles against automated propaganda/spies sent out by scammers/terrorists/rogue states/etc turn evil?”.
Though obviously we’re not yet at the point where we have this kind of bot, and we might run into law of earlier failure beforehand.
I agree that a homeostatic agent in a sufficiently out-of-distribution environment will do poorly—as soon as one of the homeostatic feedback mechanisms starts pushing the wrong way, it’s game over for that particular agent. That’s not something unique to homeostatic agents, though. If a model-based maximizer has some gap between its model and the real world, that gap can be exploited by another agent for its own gain, and that’s game over for the maximizer.
This only works well when they are basically a tool you have full control over, but not when they are used in an adversarial context, e.g. to maintain law and order or to win a war.
Sorry, I’m having some trouble parsing this sentence—does “they” in this context refer to homeostatic agents? If so, I don’t think they make particularly great tools even in a non-adversarial context. I think they make pretty decent allies and trade partners though, and certainly better allies and trade partners than consequentialist maximizer agents of the same level of sophistication do (and I also think consequentialist maximizer agents make pretty terrible tools—pithily, it’s not called the “Principal-Agent Solution”). And I expect “others are willing to ally/trade with me” to be a substantial advantage.
As capabilities to engage in conflict increase, methods to resist losing to those capabilities have to get optimized harder. Instead of thinking “why would my coding assistant/tutor bot turn evil?”, try asking “why would my bot that I’m using to screen my social circles against automated propaganda/spies sent out by scammers/terrorists/rogue states/etc turn evil?”.
Can you expand on “turn evil”? And also what I was trying to accomplish by making my comms-screening bot into a self-directed goal-oriented agent in this scenario?
That’s not something unique to homeostatic agents, though. If a model-based maximizer has some gap between its model and the real world, that gap can be exploited by another agent for its own gain, and that’s game over for the maximizer.
I don’t think of my argument as model-based vs heuristic-reactive, I mean it as unbounded vs bounded. Like you could imagine making a giant stack of heuristics that makes it de-facto act like an unbounded consequentialist, and you’d have a similar problem. Model-based agents only become relevant because they seem like an easier way of making unbounded optimizers.
If so, I don’t think they make particularly great tools even in a non-adversarial context. I think they make pretty decent allies and trade partners though, and certainly better allies and trade partners than consequentialist maximizer agents of the same level of sophistication do (and I also think consequentialist maximizer agents make pretty terrible tools—pithily, it’s not called the “Principal-Agent Solution”). And I expect “others are willing to ally/trade with me” to be a substantial advantage.
You can think of LLMs as a homeostatic agent where prompts generate unsatisfied drives. Behind the scenes, there’s also a lot of homeostatic stuff going on to manage compute load, power, etc..
Homeostatic AIs are not going to be trading partners because it is preferable to run them in a mode similar to LLMs instead of similar to independent agents.
Can you expand on “turn evil”? And also what I was trying to accomplish by making my comms-screening bot into a self-directed goal-oriented agent in this scenario?
Let’s say a think tank is trying to use AI to infiltrate your social circle in order to extract votes. They might be sending out bots to befriend your friends to gossip with them and send them propaganda. You might want an agent to automatically do research on your behalf to evaluate factual claims about the world so you can recognize propaganda, to map out the org chart of the think tank to better track their infiltration, and to warn your friends against it.
However, precisely specifying what the AI should do is difficult for standard alignment reasons. If you go too far, you’ll probably just turn into a cult member, paranoid about outsiders. Or, if you are aggressive enough about it (say if we’re talking a government military agency instead of your personal bot for your personal social circle), you could imagine getting rid of all the adversaries, but at the cost of creating a totalitarian society.
(Realistically, the law of earlier failure is plausibly going to kick in here: partly because aligning the AI to do this is so difficult, you’re not going to do it. But this means you are going to turn into a zombie following the whims of whatever organizations are concentrating on manipulating you. And these organizations are going to have the same problem.)
Unbounded consequentialist maximizers are easily exploitable by manipulating the things they are optimizing for or the signals/things they are using to maximize them in ways that weren’t accounted for in the original setup.
The defining difference was whether they have contextually activating behaviors to satisfy a set of drives, on the basis that this makes it trivial to out-think their interests. But this ability to out-think them also seems intrinsically linked to them being adversarially non-robust, because you can enumerate their weaknesses. You’re right that one could imagine an intermediate case where they are sufficiently far-sighted that you might accidentally trigger conflict with them but not sufficiently far-sighted for them to win the conflicts, but that doesn’t mean one could make something adversarially robust under the constraint of it being contextually activated and predictable.
Mimicing homeostatic agents is not difficult if there are some around. They don’t need to constantly decide whether to break character, only when there’s a rare opportunity to do so.
If you initialize a sufficiently large pile of linear algebra and stir it until it shows homeostatic behavior, I’d expect it to grow many circuits of both types, and any internal voting on decisions that only matter through their long-term effects will be decided by those parts that care about the long term.
Where does the gradient which chisels in the “care about the long term X over satisfying the homeostatic drives” behavior come from, if not from cases where caring about the long term X previously resulted in attributable reward? If it’s only relevant in rare cases, I expect the gradient to be pretty weak and correspondingly I don’t expect the behavior that gradient chisels in to be very sophisticated.
i think the logic goes: if we assume many diverse autonomous agents are created, which will survive the most? And insofar as agents have goals, what will be the goals of the agents which survive the most?
i can’t imagine a world where the agents that survive the most aren’t ultimately those which are fundamentally trying to.
insofar as human developers are united and maintain power over which ai agents exist, maybe we can hope for homeostatic agents to be the primary kind. but insofar as human developers are competitive with each other and ai agents gain increasing power (eg for self modification), i think we have to defer to evolutionary logic in making predictions
I mean I also imagine that the agents which survive the best are the ones that are trying to survive. I don’t understand why we’d expect agents that are trying to survive and also accomplish some separate arbitrary infinite-horizon goal would outperform those that are just trying to maintain the conditions necessary for their survival without additional baggage.
To be clear, my position is not “homeostatic agents make good tools and so we should invest efforts in creating them”. My position is “it’s likely that homeostatic agents have significant competitive advantages against unbounded-horizon consequentialist ones, so I expect the future to be full of them, and expect quite a bit of value in figuring out how to make the best of that”.
Ah ok. I was responding to your post’s initial prompt: “I still don’t really intuitively grok why I should expect agents to become better approximated by “single-minded pursuit of a top-level goal” as they gain more capabilities.” (The reason to expect this is that “single-minded pursuit of a top-level goal,” if that goal is survival, could afford evolutionary advantages.)
But I agree entirely that it’d be valuable for us to invest in creating homeostatic agents. Further, I think calling into doubt western/capitalist/individualist notions like “single-minded pursuit of a top-level goal” is generally important if we have a chance of building AI systems which are sensitive and don’t compete with people.
Shameful admission: after well over a decade on this site, I still don’t really intuitively grok why I should expect agents to become better approximated by “single-minded pursuit of a top-level goal” as they gain more capabilities. Yes, some behaviors like getting resources and staying alive are useful in many situations, but that’s not what I’m talking about. I’m talking about specifically the pressures that are supposed to inevitably push agents into the former of the following two main types of decision-making:
Unbounded consequentialist maximization: The agent has one big goal that doesn’t care about its environment. “I must make more paperclips forever, so I can’t let anyone stop me, so I need power, so I need factories, so I need money, so I’ll write articles with affiliate links.” It’s a long chain of “so” statements from now until the end of time.
Homeostatic agent: The agent has multiple drives that turn on when needed to keep things balanced. “Water getting low: better get more. Need money for water: better earn some. Can write articles to make money.” Each drive turns on, gets what it needs, and turns off without some ultimate cosmic purpose.
Both types show goal-directed behavior. But if you offered me a choice of which type of agent I’d rather work with, I’d choose the second type in a heartbeat. The homeostatic agent may betray me, but it will only do that if doing so satisfies one of its drives. This doesn’t mean homeostatic agents never betray allies—they certainly might if their current drive state incentivizes it (or if for some reason they have a “betray the vulnerable” drive). But the key difference is predictability. I can reasonably anticipate when a homeostatic agent might work against me: when I’m standing between it and water when it’s thirsty, or when it has a temporary resource shortage. These situations are concrete and contextual.
With unbounded consequentialists, the betrayal calculation extends across the entire future light cone. The paperclip maximizer might work with me for decades, then suddenly turn against me because its models predict this will yield 0.01% more paperclips in the cosmic endgame. This makes cooperation with unbounded consequentialists fundamentally unstable.
It’s similar to how we’ve developed functional systems for dealing with humans pursuing their self-interest in business contexts. We expect people might steal if given easy opportunities, so we create accountability systems. We understand the basic drives at play. But it would be vastly harder to safely interact with someone whose sole mission was to maximize the number of sand crabs in North America—not because sand crabs are dangerous, but because predicting when your interests might conflict requires understanding their entire complex model of sand crab ecology, population dynamics, and long-term propagation strategies.
Some say smart unbounded consequentialists would just pretend to be homeostatic agents, but that’s harder than it sounds. They’d need to figure out which drives make sense and constantly decide if breaking character is worth it. That’s a lot of extra work.
As long as being able to cooperate with others is an advantage, it seems to me that homeostatic agents have considerable advantages, and I don’t see a structural reason to expect that to stop being the case in the future.
Still, there are a lot of very smart people on LessWrong seem sure that unbounded consequentialism is somehow inevitable for advanced agents. Maybe I’m missing something? I’ve been reading the site for 15 years and still don’t really get why they believe this. Feels like there’s some key insight I haven’t grasped yet.
When triggered to act, are the homeostatic-agents-as-envisioned-by-you motivated to decrease the future probability of being moved out of balance, or prolong the length of time in which they will be in balance, or something along these lines?
If yes, they’re unbounded consequentialist-maximizers under a paper-thin disguise.
If no, they are probably not powerful agents. Powerful agency is the ability to optimize distant (in space, time, or conceptually) parts of the world into some target state. If the agent only cares about climbing back down into the local-minimum-loss pit if it’s moved slightly outside it, it’s not going to be trying to be very agent-y, and won’t be good at it.
Or, rather… It’s conceivable for an agent to be “tool-like” in this manner, where it has an incredibly advanced cognitive engine hooked up to a myopic suite of goals. But only if it’s been intelligently designed. If it’s produced by crude selection/optimization pressures, then the processes that spit out “unambitious” homeostatic agents would fail to instill the advanced cognitive/agent-y skills into them.
And a bundle of unbounded-consequentialist agents that have some structures for making cooperation between each other possible would have considerable advantages over a bundle of homeostatic agents.
I expect[1] them to have a drive similar to “if my internal world-simulator predicts a future sensory observations that are outside of my acceptable bounds, take actions to make the world-simulator predict a within-acceptable-bounds sensory observations”.
This maps reasonably well to one of the agent’s drives being “decrease the future probability of being moved out of balance”. Notably, though, it does not map well to that the only drive of the agent, or for the drive to be “minimize” and not “decrease if above threshold”. The specific steps I don’t understand are
What pressure is supposed to push a homeostatic agent with multiple drives to elevate a specific “expected future quantity of some arbitrary resource” drives above all of other drives and set the acceptable quantity value to some extreme
Why we should expect that an agent that has been molded by that pressure would come to dominate its environment.
Why use this definition of powerful agency? Specifically, why include the “target state” part of it? By this metric, evolutionary pressure is not powerful agency, because while it can cause massive changes in distant parts of the world, there is no specific target state. Likewise for e.g. corporations finding a market niche—to the extent that they have a “target state” it’s “become a good fit for the environment”.′
I can think of a few ways to interpret the above paragraph with respect to humans, but none of them make sense to me[2] - could you expand on what you mean there?
Is this still true if the unbounded consequentialist agents in question have limited predictive power, and each one has advantages in predicting the things that are salient to it? Concretely, can an unbounded AAPL share price maximizer cooperate with an unbounded maximizer for the number of sand crabs in North America without the AAPL-maximizer having a deep understanding of sand crab biology?
Subject to various assumptions at least, e.g.
The agent is sophisticated enough to have a future-sensory-perceptions simulato
The use of the future-perceptions-simulator has been previously reinforced
The specific way the agent is trying to change the outputs of the future-perceptions-simulator has been previously reinforced (e.g. I expect “manipulate your beliefs” to be chiseled away pretty fast when reality pushes back)
Still, all those assumptions usually hold for humans
The obvious interpretation I take for that paragraph is that one of the following must be true
For clarity, can you confirm that you don’t think any of the following:
Humans have been intelligently designed
Humans do not have the advance cognitive/agent-y skills you refer to
Humans exhibit unbounded consequentialist goal-driven behavior
None of these seem like views I’d expect you to have, so my model has to be broken somewhere
That was never the argument. A paperclip-maximizer/wrapper-mind’s utility function doesn’t need to be simple/singular. It can be a complete mess, the way human happiness/prosperity/eudaimonia is a mess. The point is that it would still pursue it hard, so hard that everything not in it will be end up as collateral damage.
I think humans very much do exhibit that behavior, yes? Towards power/money/security, at the very least. And inasmuch as humans fail to exhibit this behavior, they fail to act as powerful agents and end up accomplishing little.
I think the disconnect is that you might be imagining unbounded consequentialist agents as some alien systems that are literally psychotically obsessed with maximizing something as conceptually simple as paperclips, as opposed to a human pouring their everything into becoming a multibillionaire/amassing dictatorial power/winning a war?
Yes, see humans.
Is the argument that firms run by homeostatic agents will outcompete firms run by consequentialist agents because homeostatic agents can more reliably follow long-term contracts?
I would phrase it as “the conditions under which homeostatic agents will renege on long-term contracts are more predictable than those under which consequentialist agents will do so”. Taking into account the actions of the counterparties would take to reduce the chance of such contract breaking, though, yes.
Cool, I want to know also whether you think you’re currently (eg in day to day life) trading with consequentialist or homeostatic agents.
Homeostatic ones exclusively. I think the number of agents in the world as it exists today that behave as long-horizon consequentialists of the sort Eliezer and company seem to envision is either zero or very close to zero. FWIW I expect that most people in that camp would agree that no true consequentialist agents exist in the world as it currently is, but would disagree with my “and I expect that to remain true” assessment.
Edit: on reflection some corporations probably do behave more like unbounded infinite-horizon consequentialists in the sense that they have drives to acquire resources where acquiring those resources doesn’t reduce the intensity of the drive. This leads to behavior that in many cases would be the same behavior as an agent that was actually trying to maximize its future resources through any available means. And I have ever bought Chiquita bananas, so maybe not homeostatic agents exclusively.
I think this is false, eg John Wentworth often gives Ben Pace as a prototypical example of a consequentialist agent. [EDIT]: Also Eliezer talks about consequentialism being “ubiquitous”.
Maybe different definitions are being used, can you list some people or institutions that you trade with which come to mind who you don’t think have long-term goals?
Again, homeostatic agents exhibit goal-directed behavior. “Unbounded consequentialist” was a poor choice of term to use for this on my part. Digging through the LW archives uncovered Nostalgebraist’s post Why Assume AGIs Will Optimize For Fixed Goals, which coins the term “wrapper-mind”.
In terms of which agents I trade with which do not have the wrapper structure, I will go from largest to smallest in terms of expenses
My country: I pay taxes to it. In return, I get a stable place to live with lots of services and opportunities. I don’t expect that I get these things because my country is trying to directly optimize for my well-being, or directly trying to optimize for any other specific unbounded goal. My country a FPTP democracy, the leaders do have drives to make sure that at least half of voters vote for them over the opposition—but once that “half” is satisfied, they don’t have a drive to get approval high as possible no matter what or maximize the time their party is in power or anything like that.
My landlord: He is renting the place to me because he wants money, and he wants money because it can be exchanged for goods and services, which can satisfy his drives for things like food and social status. I expect that if all of his money-satisfiable drives were satisfied, he would not seek to make money by renting the house out. I likewise don’t expect that there is any fixed terminal goal I could ascribe to him that would lead me to predict his behavior better than “he’s a guy with the standard set of human drives, and will seek to satisfy those drives”.
My bank: … you get the idea
Publicly traded companies do sort of have the wrapper structure from a legal perspective, but in terms of actual behavior they are usually (with notable exceptions) not asking “how do we maximize market cap” and then making explicit subgoals and subsubgoals with only that in mind.
Yeah seems reasonable. You link the enron scandal, on your view do all unbounded consequentialists die in such a scandal or similar?
On average, do those corporations have more or less money or power than the heuristic based firms & individuals you trade with?
Regarding conceptualizing homeostatic agents, this seems related: Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well)
Homeostatic agents are easily exploitable by manipulating the things they are maintaining or the signals they are using to maintain them in ways that weren’t accounted for in the original setup. This only works well when they are basically a tool you have full control over, but not when they are used in an adversarial context, e.g. to maintain law and order or to win a war.
As capabilities to engage in conflict increase, methods to resist losing to those capabilities have to get optimized harder. Instead of thinking “why would my coding assistant/tutor bot turn evil?”, try asking “why would my bot that I’m using to screen my social circles against automated propaganda/spies sent out by scammers/terrorists/rogue states/etc turn evil?”.
Though obviously we’re not yet at the point where we have this kind of bot, and we might run into law of earlier failure beforehand.
I agree that a homeostatic agent in a sufficiently out-of-distribution environment will do poorly—as soon as one of the homeostatic feedback mechanisms starts pushing the wrong way, it’s game over for that particular agent. That’s not something unique to homeostatic agents, though. If a model-based maximizer has some gap between its model and the real world, that gap can be exploited by another agent for its own gain, and that’s game over for the maximizer.
Sorry, I’m having some trouble parsing this sentence—does “they” in this context refer to homeostatic agents? If so, I don’t think they make particularly great tools even in a non-adversarial context. I think they make pretty decent allies and trade partners though, and certainly better allies and trade partners than consequentialist maximizer agents of the same level of sophistication do (and I also think consequentialist maximizer agents make pretty terrible tools—pithily, it’s not called the “Principal-Agent Solution”). And I expect “others are willing to ally/trade with me” to be a substantial advantage.
Can you expand on “turn evil”? And also what I was trying to accomplish by making my comms-screening bot into a self-directed goal-oriented agent in this scenario?
I don’t think of my argument as model-based vs heuristic-reactive, I mean it as unbounded vs bounded. Like you could imagine making a giant stack of heuristics that makes it de-facto act like an unbounded consequentialist, and you’d have a similar problem. Model-based agents only become relevant because they seem like an easier way of making unbounded optimizers.
You can think of LLMs as a homeostatic agent where prompts generate unsatisfied drives. Behind the scenes, there’s also a lot of homeostatic stuff going on to manage compute load, power, etc..
Homeostatic AIs are not going to be trading partners because it is preferable to run them in a mode similar to LLMs instead of similar to independent agents.
Let’s say a think tank is trying to use AI to infiltrate your social circle in order to extract votes. They might be sending out bots to befriend your friends to gossip with them and send them propaganda. You might want an agent to automatically do research on your behalf to evaluate factual claims about the world so you can recognize propaganda, to map out the org chart of the think tank to better track their infiltration, and to warn your friends against it.
However, precisely specifying what the AI should do is difficult for standard alignment reasons. If you go too far, you’ll probably just turn into a cult member, paranoid about outsiders. Or, if you are aggressive enough about it (say if we’re talking a government military agency instead of your personal bot for your personal social circle), you could imagine getting rid of all the adversaries, but at the cost of creating a totalitarian society.
(Realistically, the law of earlier failure is plausibly going to kick in here: partly because aligning the AI to do this is so difficult, you’re not going to do it. But this means you are going to turn into a zombie following the whims of whatever organizations are concentrating on manipulating you. And these organizations are going to have the same problem.)
Unbounded consequentialist maximizers are easily exploitable by manipulating the things they are optimizing for or the signals/things they are using to maximize them in ways that weren’t accounted for in the original setup.
That would be ones that are bounded so as to exclude taking your manipulation methods into account, not ones that are truly unbounded.
I interpreted “unbounded” as “aiming to maximize expected value of whatever”, not “unbounded in the sense of bounded rationality”.
The defining difference was whether they have contextually activating behaviors to satisfy a set of drives, on the basis that this makes it trivial to out-think their interests. But this ability to out-think them also seems intrinsically linked to them being adversarially non-robust, because you can enumerate their weaknesses. You’re right that one could imagine an intermediate case where they are sufficiently far-sighted that you might accidentally trigger conflict with them but not sufficiently far-sighted for them to win the conflicts, but that doesn’t mean one could make something adversarially robust under the constraint of it being contextually activated and predictable.
Alright, fair, I misread the definition of “homeostatic agents”.
Mimicing homeostatic agents is not difficult if there are some around. They don’t need to constantly decide whether to break character, only when there’s a rare opportunity to do so.
If you initialize a sufficiently large pile of linear algebra and stir it until it shows homeostatic behavior, I’d expect it to grow many circuits of both types, and any internal voting on decisions that only matter through their long-term effects will be decided by those parts that care about the long term.
Where does the gradient which chisels in the “care about the long term X over satisfying the homeostatic drives” behavior come from, if not from cases where caring about the long term X previously resulted in attributable reward? If it’s only relevant in rare cases, I expect the gradient to be pretty weak and correspondingly I don’t expect the behavior that gradient chisels in to be very sophisticated.
https://www.lesswrong.com/posts/roA83jDvq7F2epnHK/better-priors-as-a-safety-problem
This is kinda related: ‘Theories of Values’ and ‘Theories of Agents’: confusions, musings and desiderata
thanks will take a look
i think the logic goes: if we assume many diverse autonomous agents are created, which will survive the most? And insofar as agents have goals, what will be the goals of the agents which survive the most?
i can’t imagine a world where the agents that survive the most aren’t ultimately those which are fundamentally trying to.
insofar as human developers are united and maintain power over which ai agents exist, maybe we can hope for homeostatic agents to be the primary kind. but insofar as human developers are competitive with each other and ai agents gain increasing power (eg for self modification), i think we have to defer to evolutionary logic in making predictions
I mean I also imagine that the agents which survive the best are the ones that are trying to survive. I don’t understand why we’d expect agents that are trying to survive and also accomplish some separate arbitrary infinite-horizon goal would outperform those that are just trying to maintain the conditions necessary for their survival without additional baggage.
To be clear, my position is not “homeostatic agents make good tools and so we should invest efforts in creating them”. My position is “it’s likely that homeostatic agents have significant competitive advantages against unbounded-horizon consequentialist ones, so I expect the future to be full of them, and expect quite a bit of value in figuring out how to make the best of that”.
Ah ok. I was responding to your post’s initial prompt: “I still don’t really intuitively grok why I should expect agents to become better approximated by “single-minded pursuit of a top-level goal” as they gain more capabilities.” (The reason to expect this is that “single-minded pursuit of a top-level goal,” if that goal is survival, could afford evolutionary advantages.)
But I agree entirely that it’d be valuable for us to invest in creating homeostatic agents. Further, I think calling into doubt western/capitalist/individualist notions like “single-minded pursuit of a top-level goal” is generally important if we have a chance of building AI systems which are sensitive and don’t compete with people.