I think I’m in part a goal-directed optimizer. I want to eventually offload all of the cognition involved in being a goal-directed optimizer to a superintelligence, as opposed to having some part of it being bottlenecked by my (suboptimal/slow/unstable/unsafe) biological brain. I think this describes or will probably describe many other humans.
Competitive pressures may drive people to do this even if they aren’t ready or wouldn’t want to in the absence of such pressures.
Some people (such as e/accs) seem happy to build any kind of AGI without considerations of safety (or think they’ll be automatically safe) and therefore may build a goal-directed optimizer either because they’re the easiest kind of AGI to stumble upon, or because they’re copying human cognitive architecture or training methods as a shortcut to trying to invent/discover new ones.
Even if no AI can ever be describe as “goal-directed optimizer”, larger systems composed of humans and AIs can probably be described as such, so they are worth studying from a broader “safety” perspective even if not a narrower “alignment” perspective.
Coherence-based arguments, which I also put some weight on (but perhaps less than others)
I forgot to mention one more argument, namely that something like a goal-directed optimizer is my best guess of what a philosophically and technologically mature, reflectively stable general intelligence will look like, since it’s the only motivational structure we know that looks anywhere close to reflective stability.
I want to be careful not to overstate how close, or to rule out the possibility of discovering some completely different reflectively stable motivational structure in the future, but in our current epistemic state, reflective stability by itself already seems enough to motivate the theoretical study of goal-directed optimizers.
To clarify on my end: I think AI can definitely become an autonomous long-horizon planner, especially if we train it to be that.
That event may or may not have the consequences suggested by existing theory predicated on e.g. single-objective global utility maximizers, which predicts consequences which are e.g. notably different from the predictions of a shard-theoretic model of how agency develops. So I think there are important modeling decisions in ‘literal-minded genie’ vs ‘shard-based generalization’ vs [whatever the truth actually is]… even if each individual axiom sounds reasonable in any given theory. (I wrote this quickly, sorry if it isn’t clear)
Do you not think that a shard-based agent likely eventually turns into something like an EU maximizer (e.g. once all the shards work out a utility function that represents a compromise between their values, or some shard/coalition overpowers others and takes control)? Or how do you see the longer term outcome of shard-based agents? (I asked this question and a couple of others here but none of the main shard-theory proponents engaged with it, perhaps because they didn’t see the comment?)
I do think that a wide range of shard-based mind-structures will equilibrate into EU optimizers, but I also think this is a somewhat mild statement. My stance is that utility functions represent a yardstick by which decisions are made. “Utility was made by the agent, for the agent” as it were—and not “the agent is made to optimize the utility.” What this means is:
Suppose I start off caring about dogs and diamonds in a shard-like fashion, with certain situations making me seek out dogs and care for them (in the usual intuitive way); and similarly for diamonds. However, there will be certain situations in which the dog-shard “interferes with” the diamond-shard, such that the dog-shard e.g. makes me daydream about dogs while doing my work and thereby do worse in life overall. If I didn’t engage in this behavior, then in general I’d probably be able to get more dog-caring and diamond-acquisition. So from the vantage point of this mind and its shards, it is subjectively better to not engage in such “incoherent” behavior which is a strictly dominated strategy in expectation (i.e. leads to fewer dogs and diamonds).
Therefore, given time and sufficient self-modification ability, these shards will want to equilibrate to an algorithm which doesn’t step on its own toes like this.
This doesn’t mean, of course, that these shards decide to implement a utility function with absurd results by the initial decision-making procedure. For example, tiling the universe (half with dog-squiggles, half with diamond-squiggles) would not be a desirable outcome under the initial decision-making process. Insofar as such an outcome could be foreseen as a consequence of making decisions a proposed utility function, the shards would disprefer that utility function.[1]
So any utility function chosen should “add up to normalcy” when optimized, or at least be different in a way which is not foreseeably weird and bad by the initial shards’ reckoning. On this view, one would derive a utility function as a rule of thumb for how to make decisions effectively and (nearly) Pareto-optimally in relevant scenarios.[2]
(You can perhaps understand why, given this viewpoint, I am unconcerned/weirded out by Yudkowskian sentiments like “Unforeseen optima are extremely problematic given high amounts of optimization power.”)
This elides any practical issues with self-modification, and possible value drift from e.g. external sources, and so on. I think they don’t change the key conclusions here. I think they do change conclusions for other questions though.
Again, if I’m imagining the vantage point of dog+diamond agent, it wouldn’t want to waste tons of compute deriving a policy for weird situations it doesn’t expect to run into. The most important place to become more coherent is the expected on-policy future.
Therefore, given time and sufficient self-modification ability, these shards will want to equilibrate to an algorithm which doesn’t step on its own toes like this.
What do you think that algorithm will be? Why would it not be some explicit EU-maximization-like algorithm, with a utility function that fully represents both of their values? (At least eventually?) It seems like the best way to guarantee that the two shards will never step on each others’ toes ever again (no need to worry about running into unforeseen situations), and also allows the agent to easily merge with other similar agents in the future (thereby avoiding stepping on even more toes).
(Not saying I know for sure this is inevitable, as there could be all kinds of obstacles to this outcome, but it still seems like our best guess of what advanced AI will eventually look like?)
So any utility function chosen should “add up to normalcy” when optimized, or at least be different in a way which is not foreseeably weird and bad by the initial shards’ reckoning.
I agree with this statement, but what about:
Shards just making a mistake and picking a bad utility function. (The individual shards aren’t necessarily very smart and/or rational?)
The utility function is fine for the AI but not for us. (Would the AI shards’ values exactly match our shards, including relative power/influence, and if not, why would their utility function be safe for us?)
Competitive pressures forcing shard-based AIs to become more optimizer-like before they’re ready, or to build other kinds of more competitive but riskier AI, similar to how it’s hard for humans to stop our own AI arms race.
(You can perhaps understand why, given this viewpoint, I am unconcerned/weirded out by Yudkowskian sentiments like “Unforeseen optima are extremely problematic given high amounts of optimization power.”)
Yes, you’re helping me better understand your perspective, thanks. However as indicated by my questions above, I’m still not sure why you think shard-based AI agents would be safe in general, and in particular (among other risks) why they wouldn’t turn into dangerous goal-directed optimizers at some point.
Happy to share my reasons/arguments:
I think I’m in part a goal-directed optimizer. I want to eventually offload all of the cognition involved in being a goal-directed optimizer to a superintelligence, as opposed to having some part of it being bottlenecked by my (suboptimal/slow/unstable/unsafe) biological brain. I think this describes or will probably describe many other humans.
Competitive pressures may drive people to do this even if they aren’t ready or wouldn’t want to in the absence of such pressures.
Some people (such as e/accs) seem happy to build any kind of AGI without considerations of safety (or think they’ll be automatically safe) and therefore may build a goal-directed optimizer either because they’re the easiest kind of AGI to stumble upon, or because they’re copying human cognitive architecture or training methods as a shortcut to trying to invent/discover new ones.
Even if no AI can ever be describe as “goal-directed optimizer”, larger systems composed of humans and AIs can probably be described as such, so they are worth studying from a broader “safety” perspective even if not a narrower “alignment” perspective.
Coherence-based arguments, which I also put some weight on (but perhaps less than others)
I forgot to mention one more argument, namely that something like a goal-directed optimizer is my best guess of what a philosophically and technologically mature, reflectively stable general intelligence will look like, since it’s the only motivational structure we know that looks anywhere close to reflective stability.
I want to be careful not to overstate how close, or to rule out the possibility of discovering some completely different reflectively stable motivational structure in the future, but in our current epistemic state, reflective stability by itself already seems enough to motivate the theoretical study of goal-directed optimizers.
Thanks for sharing! :)
To clarify on my end: I think AI can definitely become an autonomous long-horizon planner, especially if we train it to be that.
That event may or may not have the consequences suggested by existing theory predicated on e.g. single-objective global utility maximizers, which predicts consequences which are e.g. notably different from the predictions of a shard-theoretic model of how agency develops. So I think there are important modeling decisions in ‘literal-minded genie’ vs ‘shard-based generalization’ vs [whatever the truth actually is]… even if each individual axiom sounds reasonable in any given theory. (I wrote this quickly, sorry if it isn’t clear)
Do you not think that a shard-based agent likely eventually turns into something like an EU maximizer (e.g. once all the shards work out a utility function that represents a compromise between their values, or some shard/coalition overpowers others and takes control)? Or how do you see the longer term outcome of shard-based agents? (I asked this question and a couple of others here but none of the main shard-theory proponents engaged with it, perhaps because they didn’t see the comment?)
I do think that a wide range of shard-based mind-structures will equilibrate into EU optimizers, but I also think this is a somewhat mild statement. My stance is that utility functions represent a yardstick by which decisions are made. “Utility was made by the agent, for the agent” as it were—and not “the agent is made to optimize the utility.” What this means is:
Suppose I start off caring about dogs and diamonds in a shard-like fashion, with certain situations making me seek out dogs and care for them (in the usual intuitive way); and similarly for diamonds. However, there will be certain situations in which the dog-shard “interferes with” the diamond-shard, such that the dog-shard e.g. makes me daydream about dogs while doing my work and thereby do worse in life overall. If I didn’t engage in this behavior, then in general I’d probably be able to get more dog-caring and diamond-acquisition. So from the vantage point of this mind and its shards, it is subjectively better to not engage in such “incoherent” behavior which is a strictly dominated strategy in expectation (i.e. leads to fewer dogs and diamonds).
Therefore, given time and sufficient self-modification ability, these shards will want to equilibrate to an algorithm which doesn’t step on its own toes like this.
This doesn’t mean, of course, that these shards decide to implement a utility function with absurd results by the initial decision-making procedure. For example, tiling the universe (half with dog-squiggles, half with diamond-squiggles) would not be a desirable outcome under the initial decision-making process. Insofar as such an outcome could be foreseen as a consequence of making decisions a proposed utility function, the shards would disprefer that utility function.[1]
So any utility function chosen should “add up to normalcy” when optimized, or at least be different in a way which is not foreseeably weird and bad by the initial shards’ reckoning. On this view, one would derive a utility function as a rule of thumb for how to make decisions effectively and (nearly) Pareto-optimally in relevant scenarios.[2]
(You can perhaps understand why, given this viewpoint, I am unconcerned/weirded out by Yudkowskian sentiments like “Unforeseen optima are extremely problematic given high amounts of optimization power.”)
This elides any practical issues with self-modification, and possible value drift from e.g. external sources, and so on. I think they don’t change the key conclusions here. I think they do change conclusions for other questions though.
Again, if I’m imagining the vantage point of dog+diamond agent, it wouldn’t want to waste tons of compute deriving a policy for weird situations it doesn’t expect to run into. The most important place to become more coherent is the expected on-policy future.
What do you think that algorithm will be? Why would it not be some explicit EU-maximization-like algorithm, with a utility function that fully represents both of their values? (At least eventually?) It seems like the best way to guarantee that the two shards will never step on each others’ toes ever again (no need to worry about running into unforeseen situations), and also allows the agent to easily merge with other similar agents in the future (thereby avoiding stepping on even more toes).
(Not saying I know for sure this is inevitable, as there could be all kinds of obstacles to this outcome, but it still seems like our best guess of what advanced AI will eventually look like?)
I agree with this statement, but what about:
Shards just making a mistake and picking a bad utility function. (The individual shards aren’t necessarily very smart and/or rational?)
The utility function is fine for the AI but not for us. (Would the AI shards’ values exactly match our shards, including relative power/influence, and if not, why would their utility function be safe for us?)
Competitive pressures forcing shard-based AIs to become more optimizer-like before they’re ready, or to build other kinds of more competitive but riskier AI, similar to how it’s hard for humans to stop our own AI arms race.
Yes, you’re helping me better understand your perspective, thanks. However as indicated by my questions above, I’m still not sure why you think shard-based AI agents would be safe in general, and in particular (among other risks) why they wouldn’t turn into dangerous goal-directed optimizers at some point.