Related:
Background and Core Concepts
I operationalised “strong coherence” as:
Informally: a system has immutable terminal goals.
Semi-formally: a system’s decision making is well described as an approximation of argmax over actions (or higher level mappings thereof) to maximise the expected value of a single fixed utility function over states.
And contended that humans, animals (and learning based agents more generally?) seem to instead have values (“contextual influences on decision making”).
The shard theory account of value formation in learning based agents is something like:
Value shards are learned computational/cognitive heuristics causally downstream of similar historical reinforcement events
Value shards activate more strongly in contexts similar to those where they were historically reinforced
And I think this hypothesis of how values form in intelligent systems could be generalised out of a RL context to arbitrary constructive optimisation processes[1]. The generalisation may be something like:
Decision making in intelligent systems is best described as “executing computations/cognition that historically correlated with higher performance on the objective functions a system was selected for performance on”[2].
This seems to be an importantly different type of decision making from expected utility maximisation[3]. For succinctness, I’d refer to systems of the above type as “systems with malleable values”.
The Argument
In my earlier post I speculated that “strong coherence is anti-natural”. To operationalise that speculation:
Premise 1: The generalised account of value formation is broadly accurate
At least intelligent systems in the real world form “contextually activated cognitive heuristics that influence decision making” as opposed to “immutable terminal goals”
Humans can program algorithms with immutable terminal goals in simplified virtual environments, but we don’t actually know how to construct sophisticated intelligent systems via design; we can only construct them as the product of search like optimisation processes[4]
And intelligent systems constructed by search like optimisation processes form malleable values instead of immutable terminal goals
I.e. real world intelligent systems form malleable values
Premise 2: Systems with malleable values do not self modify to have immutable terminal goals
Would you take a pill that would make you an expected utility maximiser[3]? I most emphatically would not.
If you accept the complexity and fragility of value theses, then self modifying to become strongly coherent just destroys most of what the current you values.
For systems with malleable values, becoming “strongly coherent” is grossly suboptimal by their current values
A similar argument might extend to such systems constructing expected utility maximisers were they given the option to
Conclusion 1: Intelligent systems in the real world do not converge towards strong coherence
Strong coherence is not the limit of effective agency in the real world
Idealised agency does not look like “(immutable) terminal goals” or “expected utility maximisation”
Conclusion 2: “strong coherence” does not naturally manifest in sophisticated real world intelligent systems
Sophisticated intelligent systems in the real world are the product of search like optimisation processes
Such optimisation processes do not produce intelligent systems that are strongly coherent
And those systems do not converge towards becoming strongly coherent as they are subjected to more selection pressure/”scaled up”/or otherwise amplified
- ^
E.g:
* Stochastic gradient descent
* Natural selection/other evolutionary processes
- ^
- ^
Of a single fixed utility function over states.
- ^
E.g I’m under the impression that humans can’t explicitly design an algorithm to achieve AlexNet accuracy on the ImageNet dataset.
I think the self supervised learning that underscores neocortical cognition is a much harder learning task.
I believe that learning is the only way there is to create capable intelligent systems that operate in the real world given our laws of physics.
There needs to be some process which, given a context, specifies what value shards should be created (or removed/edited) to better work in that context. Not clear we can’t think of this as constituting the system’s immutable goal in some sense, especially as it gets more powerful. That said it would probably not be strongly coherent by your semi-formal definition.
I think you are onto something, with the implication that building a highly intelligent, learning entity with strong coherence in this sense is unlikely, and hence, getting it morally aligned in this fashion is also unlikely. Which isn’t that bad, insofar as plans for aligning it that way honestly did not look particularly promising.
Which is why I have been advocating for instead learning from how we teach morals to existing complex intelligent agents—namely, through ethical, rewarding interactions in a controlled environment slowly allowing more freedom.
We know how to do this, it does not require us to somehow define the core of ethics mathematically. We know it works. We know how setbacks look, and how to tackle them. We know how to do this with human interactions the average person can do/train, rather than with code. It seems easier and more doable and promising in so many ways.
That doesn’t mean it will be easy, or risk free, and it still comes with a hell of a lot of problems based on the fact that AIs, even machine learning ones, are quite simply not human, they are not inherently social, they do not inherently have altruistic urges, they do not inherently have empathic abilities. But I see a clearer path to dealing with that than to directly encoding an abstract ethics into an intelligent, flexible actor.
EDIT: I found out my answer is quite similar to this other one you probably read already.
I think not.
Imagine such a malleable agent’s mind as made of parts. Each part of the mind does something. There’s some arrangement of the things each part does, and how many parts do each kind of thing. We won’t ask right now where this organization comes from, but take it for given.
Imagine that—be it by chance or design—some parts were cooperating, while some were not. “Cooperation” means making actions that bring about a consequence in a somewhat stable way, so something towards being coherent and consequentialist, although not perfectly so by any measure. The other parts would oftentimes work at cross purposes, treading on each other toes. “Working at cross purposes”, again, in other words means not being consequentialist and coherent; from the point of view of the parts, there may not even be a notion of “cross purposes” if there is no purpose.
By the nature of coherence, the ensemble of coherent and aligned parts would get to their purpose much more efficiently than the other parts are not-getting to that purpose and being a hindrance, assuming the purpose was reachable enough. This means that coherent agents are not just reflectively consistent, but also stable: once there’s some seed of coherence, it can win other the non-coherent parts.
It seems to me that humans are more coherent and consequentialist than other animals. Humans are not perfectly coherent, but the direction is towards more coherence. Actually, I’d expect that any sufficiently sophisticated bounded agent would not introspectively look coherent to itself if it spent enough time to think about it. Would the trend break after us?
Would you take a pill that made you a bit less coherent? Would you take a pill that made you a bit more coherent? (Not rhetorical questions.)
I think this fails to adequately engage with the hypothesis that values are inherently contextual.
Alternatively, the kind of cooperation you describe where a subset of values consistently optimise the system’s outputs in a consequentialist manner towards a fixed terminal goal is highly unrealistic for nontrivial terminal goals.
Shards “cooperating” manifest in a qualitatively different manner.
More generally a problem with aggregate coherence hypotheses is that a core claim of shard theory is that the different shards are weighted differently in different contexts.
In general shards activate more strongly in particular contexts, less strongly in others.
So there is no fixed weight assigned to the shards, even when just looking at the subset of shards that cooperate with each other.
As such, I don’t think the behaviour of learning agents within the shard ontology can be well aggregated into a single fixed utility function over agent states.
Not even in any sort of limit of reflection or enhancement, because values within the shard ontology are inherently contextual.
Motivate this claim please.
Nope in both cases. I’d take pills to edit particular values[1] but wouldn’t directly edit my coherence in an unqualified fashion.
I’m way too horny, and it’s honestly pretty maladaptive and inhibits my ability to execute on values I reflectively endorse more.
I agree it’s unrealistic in some sense. That’s why I qualified “assuming the purpose was reachable enough”. In this “evolutionary” interpretation of coherence, there’s a compromise between attainability of the goal and the cooperation needed to achieve it. Some goals are easier. So in my framework, where I consider humans the pinnacle of known coherence, I do not consider as valid saying that a rock is more coherent because it is very good at just being a rock. About realism, I consider humans very unlikely a priori (we seem to be alone), but once there are humans around, the important low probability thing already happened.
In this part of your answer, I am not sure whether you are saying “emerging coherence is forbidden in shard theory” or “I think emerging coherence is false in the real world”.
Answering to “emerging coherence is forbidden”: I’m not sure because I don’t know shard theory beyond what you are saying here, but: “values are inherently contextual” does not mean your system is not flexible enough to allow implementing coherent values within it, even if they do not correspond to the things you labeled “values” when defining the system. It can be unlikely, which leads back to the previous item, which leads back to the disagreement about humans being coherent.
Answering to “I think emerging coherence is false in the real world”: this leads back again to to the disagreement about humans being coherent.
The crux! I said that purely out of intuition. I find this difficult to argue because, for any specific example I think of where I say “humans are more coherent and consequentialist than the cat here”, I imagine you replying “No, humans are more intelligent than the cat, and so can deploy more effective strategies for their goals, but these goals and strategies are still all sharded, maybe even more than in the cat”. Maybe the best argument I can make is: it seems to me humans have more of a conscious outer loop than other animals, with more power over the shards, and the additional consequentiality and coherence (weighted by task difficulty) are mostly due to this outer loop, not to a collection of more capable shards. But this is not a precise empirical argument.
I think you answered the question “would you take a pill, where the only thing you know about the pill, is that it will “change your coherence” without other qualifications, and without even knowing precisely what “coherence” is?” Instead I meant to ask “how would the coherence-changing side effects of a pill you wanted to take for some other reason influence your decision”. It seems to me your note about why you would take a dehornying pill points in the direction of making you more coherent. The next question would then be “of all the value-changing pills you can imagine yourself taking, how many increase coherence, and how many decrease it?”, and the next “where does the random walk in pill space bring you?”
This isn’t a universally held view. Someone wrote a fairly compelling argument against it here: https://sohl-dickstein.github.io/2023/03/09/coherence.html
For context: the linked post exposes a well-designed survey of experts about the intelligence and coherence of various entities. The answers show a clear coherence-intelligence anti-correlation. The questions they ask the experts are:
Intelligence:
Coherence:
Of course there’s the problem of what are peoples’ judgements of “coherence” measuring. In considering possible ways of making the definition more clear, the post says:
It seems to me the kind of measure proposed for machine learning systems is at odds with the one for living beings. For ML, it’s “robustness to environmental changes”. For animals, it’s “spending all resources on survival”. For organizations, “spending all resources on the stated mission”. By the for-ML definition, humans, I’d say, win: they are the best entity at adapting, whatever their goal. By the for-animals definition, humans would lose completely. So these are strongly inconsistent definitions. I think the problem is fixing the goal a priori: you don’t get to ask “what is the entity pursuing, actually?”, but proclaim “the entity is pursuing survival and reproduction”, “the organization is pursuing what it says on paper”. Even though they are only speculative definitions, not used in the survey, I think they are evidence of confusion in the mind of who wrote them, and potentially in the survey respondents (alternative hypothesis: sloppiness, “survival+reproduction” was intended for most animals but not humans).
So, what did the experts read in the question?
Take two entities at opposite ends in the figure: the “single ant” (judged most coherent) and a human (judged least coherent).
..............
SINGLE ANT vs. HUMAN
ANT: A great heap, sir! I have a simple and clear utility function! Feed my mother the queen!
HUMAN: Wait, wait, wait. I bet you would stop feeding your queen as soon as I put you somewhere else. It’s not utility, it’s just learned patterns of behavior.
ANT: Ohi, that’s not valid sir! That’s cheating! You can do that just because you are more intelligent and powerful. An what would be your utility function, dare I ask?
HUMAN: Well, uhm, I value many things. Happiness, but sometimes also going through adversity; love; good food… I don’t know how to state my utility function. I just know that I happen to want things, and when I do, you sure can describe me as actually trying to get them, not just “doing the usual, and, you know, stuff happens”.
ANT: You are again conflating coherence with power! Truth is, many things make you powerless, like many things make me! You are big in front of me, but small in front of the universe! If I had more power, I’d be very, very good at feeding the queen!
HUMAN: As I see it, it’s you who’s conflating coherence with complexity. I’m complex, and I also happen to have a complex utility. If I set myself to a goal, I can do it even if it’s “against my nature”. I’m retargetable. I can be compactly described as goals separate from capabilities. If you magically became stronger and more intelligent, I bet you would be very, very bent on making tracks, duper gung-oh on touching other ants with your antennas in weird patterns you like, and so on. You would not get creative about it. Your supposed “utility” would shatter.
ANT: So you said yourself that if I became as intelligent as you, I’d shatter my utility, and so appear less coherent, like you are! Checkmate human!
HUMAN: Aaargh, no, you are looking at it all wrong. You would not be like me. I can recognize in myself all the patterns of shattered goals, all my shards, but I can also see beyond that. I can transcend. You, unevolved ant, magically scaled in some not well defined brute-force just-zooming sense, would be left with nothing in your ming but the small-ant shards, and insist on them.
ANT: What’s with that “not well defined etc.” nonsense? You don’t actually know! For all you know about how this works, scaling my mind could make me get bent on feeding the queen, not just “amplify” my current behaviors!
HUMAN: And conceding that possibility, would you not be more coherent then?
ANT: No way! I would be as coherent as now, just more intelligent!
HUMAN: Whatevs.
ANT: I’m super-self-consistent! I don’t care about anything but queen-feeding! I’ll happily sacrifice myself to that end! Actually, I’d not even let myself die happily, I’d die caring-for-the-queen-ly!
HUMAN: Uff, I bet my position will be misunderstood again, but anyway: I don’t know how to compactly specify my goals, I internally perceive my value as many separate pieces, so I can’t say I’m consistent in my value-seeking with a straight face. However, I’m positive that I can decide to suppress any of my value-pieces to get more whole-value, even suppress all of my value-pieces at once. This proves there’s a single consistent something I value. I just don’t know how to summarize or communicate what it is.
ANT: “That” “proves” you “value” the heck what? That proves you don’t just have many inconsistent goals, you even come equipped with inconsistent meta-goals!
HUMAN: To know what that proves, you have to look at my behavior, and my success at achieving goals I set myself to. In the few cases where I make a public precommitment, you have nice clear evidence I can ignore a lot of immediate desires for something else. That’s evidence for my mind-system doing that overall, even if I can’t specify a single, unique goal for everything I ever do at once.
ANT: If your “proof” works, then it works for me too! I surely try to avoid dying in general, yet I’ll die for the queen! Very inconsistent subgoals, very clear global goal! You’re at net disadvantage because you can not specify your goal, ant-human 2-1!
HUMAN: This is an artefact of you not being an actual ant but a rhetorical “ANT” implemented by a human. You are even more simple than a real ant, yet contained in something much larger and self-reflective. As a real ant, I expect you would have both a more complicated global goal that what appears by saying “feed the queen”, and that you would not be able to self-reflect on the totality of it.
ANT: Sophistry! You are still recognizing the greater simplicity of the real-me goal, which makes me more consistent!
HUMAN: We always come to that. I’m more complex, not less consistent.
ANT: No cycles wasted, a single track, a single anthill, a single queen, that’s your favorite ant’s jingle!
HUMAN: Funny but no. Your inter-ants communications are totally inefficient. You waste tons of time wandering almost randomly, touching the other ants here and there, to get the emergent swarm behavior. I expect nanotechnology in principle could make you able to communicate via radio. We humans invented tech to make inter-humans communications efficient to pursue our goals, you can’t, your behaviors are undermining.
ANT: All my allowed behaviors are not undermining! My mind is perfect, my body is flawed! Your mind undermines itself from the inside!
HUMAN: The question says “behaviors”, which I’d interpret as outward actions, but let’s concede the interpretation as internal behaviors of the mind. I know it’s speculative, but again, I expect real-ant to have a less clean mind-state than you make it appear, in proportion to its behavioral complexity.
ANT: No comment, apart from underlining “speculative”! Since you admitted “suppressing your goals” before, isn’t that “undermining” at it fullest?
HUMAN: You said that of yourself too.
ANT: But you seemed to imply you have a lot more of these goals-to-suppress!
HUMAN: Again: my values are more complex, and your simplicity is in part an artefact.
............
The cruxes I see in the ant-human comparison are:
we reflect on ourselves, while we do not perceive the ant as ants;
our value is more complex, and our intelligence allows us to do more complicated things to get it.
I think the experts mostly saw “behavioral simplicity” and “simply stated goals” into the question, but not the “adaptability in pursuing whatever it’s doing” proposed later for ML systems. I’d argue instead that something being a “goal” instead of a “behavior” is captured by there being many different paths taking to it, and coherence is about preferring things in some order and so modifying your behavior to that end, rather than having a prefixed simple plan.
I can’t see how to clearly disentangle complexity, coherence, intelligence. Right now I’m confused enough that I would not even know what to think if someone from the future told me “yup, science confirms humans are definitely more/less coherent than ants”.
I don’t understand what is the “discount factor” to apply when deciding how coherent is a more complex entity.
… an entity with more complex values.
… an entity with more available actions.
… an entity that makes more complicated plans.
What would be the implication of this complexity-discounted coherence notion, anyway? Do I want some “raw” coherence measure instead to understand what an entity does?
(A somewhat theologically inspired answer:)
Outside the dichotomy of values (in the shard-theory sense) vs. immutable goals, we could also talk about valuing something that is in some sense fixed, but “too big” to fit inside your mind. Maybe a very abstract thing. So your understanding of it is always partial, though you can keep learning more and more about it (and you might shift around, feeling out different parts of the elephant). And your acted-on values would appear mutable, but there would actually be a, perhaps non-obvious, coherence to them.
It’s possible this is already sort of a consequence of shard theory? In the way learned values would have coherences to accord with (perhaps very abstract or complex) invariant structure in the environment?
Oh, huh, this post was on the LW front page, and dated as posted today, so I assumed it was fresh, but the replies’ dates are actually from a month ago.
lesswrong has a bug that allows people to restore their posts to “new” status on the frontpage by moving them to draft and then back.
Uh, this seems bad and anti-social? This bug/feature should either be made an explicit feature, or is a bug, and using it is defecting. @Ruby
I mean I think it’s fine.
I have not experienced the feature being abused.
In this case I didn’t get any answers the last time I posted it and ended up needing answers so I’m reposting.
Better than posting the entire post again as a new post and losing the previous conversation (which is what would happen if not for this feature).
Like what’s the argument that it’s defecting? There are just legitimate reasons to repost stuff and you can’t really stop users from reposting stuff.
FWIW, it was a mod that informed me of this feature.
If it’s a mod telling you with the implication that it’s fine, then yeah, it’s not defecting and is good. In that case I think it should be an explicit feature in some way!
I mean I think it can be abused, and the use case where I was informed of it was a different use case (making substantial edits to a post). I do not know that they necessary approve of republishing for this particular use case.
But the alternative to republishing for this particular use case is just reposting the question as an entirely new post which seems strictly worse.
Of course there is also the alternative of not reposting the question. What’s possibly defecty is that maybe lots of people want their thing to have more attention, so it’s potentially a tragedy of the commons. Saying “well, just have those people who most want to repost their thing, repost their thing” could in theory work, but it seems wrong in practice, like you’re just opening up to people who don’t value others’s attention enough.
One could also ask specific people to comment on something, if LW didn’t pick it up.
A lot of LessWrong actually relies on just trusting users not to abuse the site/features.
I make judgment calls on when to repost keeping said trust in mind.
And if reposts were a nuisance people could just mass downvote reposts.
But in general, I think it’s misguided to try and impose a top down moderation solution given that the site already relies heavily on user trust/judgment calls.
This repost hasn’t actually been a problem and is only being an issue because we’re discussing whether it’s a problem or not.
Reposted it because I didn’t get any good answers last time, and I’m working on a post that’s a successor to this one currently and would really appreciate the good answers I did not get.
My claim is mostly that real world intelligent systems do not have values that can be well described by a single fixed utility function over agent states.
I do not see this answer as engaging with that claim at all.
If you define utility functions over agent histories, then everything is an expected utility maximiser for the function that assigns positive utility to whatever action the agent actually took and zero utility to every other action.
I think such a definition of utility function is useless.
If however you define utility functions over agent states, then your hypothesis doesn’t engage with my claim at all. The reason that real world intelligent systems aren’t utility functions isn’t because the utility function is too big to fit inside them or because of incomplete knowledge.
My claim is that no such utility function exists that adequately describes the behaviour of real world intelligent systems.
I am claiming that there is no such mathematical object, no single fixed utility function over agent states that can describe the behaviour of humans or sophisticated animals.
Such a function does not exist.
Sorry, I guess I didn’t make the connection to your post clear. I substantially agree with you that utility functions over agent-states aren’t rich enough to model real behavior. (Except, maybe, at a very abstract level, a la predictive processing? (which I don’t understand well enough to make the connection precise)).
Utility functions over world-states—which is what I thought you meant by ‘states’ at first—are in some sense richer, but I still think inadequate.
And I agree that utility functions over agent histories are too flexible.
I was sort of jumping off to a different way to look at value, which might have both some of the desirable coherence of the utility-function-over-states framing, but without its rigidity.
And this way is something like, viewing ‘what you value’ or ‘what is good’ as something abstract, something to be inferred, out of the many partial glimpses of it we have in the form of our extant values.