Have you—or anyone, really—put much thought into the implications of these ideas to AI alignment?
If it’s true that modeling humans at the level of constitutive subagents renders a more accurate description of human behavior, then any true solution to the alignment problem will need to respect this internal incoherence in humans.
This is potentially a very positive development, I think, because it suggests that a human can be modeled as a collection of relatively simple subagent utility functions, which interact and compete in complex but predictable ways. This sounds closer to a gears-level portrayal of what is happening inside a human, in contrast to descriptions of humans as having a single convoluted and impossible-to-pin-down utility function.
I don’t know if you’re at all familiar with Mark Lippman’s Folding material and his ontology for mental phenomenology. My attempt to summarize his framework of mental phenomena is as follows: there are belief-like objects (expectations, tacit or explicit, complex or simple), goal-like objects (desirable states or settings or contexts), affordances (context-activated representations of the current potential action space) and intention-like objects (plans coordinating immediate felt intentions, via affordances, toward goal-states). All cognition is “generated” by the actions and interactions of these fundamental units, which I infer must be something like neurologically fundamental. Fish and maybe even worms probably have something like beliefs, goals, affordances and intentions. Ours are just bigger, more layered, more nested and more interconnected.
The reason I bring this up is that Folding was a bit of a kick in the head to my view on subagents. Instead of seeing subagents as being fundamental, I now see subagents as expressions of latent goal-like and belief-like objects, and the brain is implementing some kind of passive program that pursues goals and avoids expectations of suffering, even if you’re not aware you have these goals or these expectations. In other words, the sense of there being a subagent is your brain running a background program that activates and acts upon the implications of these more fundamental yet hidden goals/beliefs.
None of this is at all in contradiction to anything in your Sequence. It’s more like a slightly different framing, where a “Protector Subagent” is reduced to an expression of a belief-like object via a self-protective background process. It all adds up to the same thing, pretty much, but it might be more gears-level. Or maybe not.
I definitely have some thoughts on the AI alignment implications, yes. Still working out exactly what they are. :-) A few fragmented thoughts, here’s what I wrote in the initial post of the sequence:
In a recent post, Wei Dai mentioned that “the only apparent utility function we have seems to be defined over an ontology very different from the fundamental ontology of the universe”. I agree, and I think it’s worth emphasizing that the difference is not just “we tend to think in terms of classical physics but actually the universe runs on particle physics”. Unless they’ve been specifically trained to do so, people don’t usually think of their values in terms of classical physics, either. That’s something that’s learned on topof the default ontology.
The ontology that our values are defined over, I think, shatters into a thousand shardsof disparate models held by different subagents with different priorities. It is mostly something like “predictions of receiving sensory data that has been previously classified as good or bad, the predictions formed on the basis of doing pattern matching to past streams of sensory data”. Things like e.g. intuitive physics simulators feed into these predictions, but I suspect that even intuitive physics is not the ontology over which our values are defined; clusters of sensory experiences are that ontology, with intuitive physics being a tool for predicting how to get those experiences. This is the same sense in which you might e.g. use your knowledge of social dynamics to figure out how to get into situations which have made you feel loved in the past, but your knowledge of social dynamics is not the same thing as the experience of being loved.
Also, here’s what I recently wrote to someone during a discussion about population ethics:
I view the function of ethics/morality as two-fold:
1) My brain is composed of various subagents, each of which has different priorities or interests. One way of describing them would be to say that there are consequentialist, deontologist, virtue ethical, and egoist subagents, though that too seems potentially misleading. Subagents probably don’t really care about ethical theories directly, rather they care about sensory inputs and experiences of emotional tone. In any case, they have differing interests and will often disagree about what to do. The _personal_ purpose of ethics is to come up with the kinds of principles that all subagents can broadly agree upon as serving all of their interests, to act as a guide for personal decision-making.
(There’s an obvious connection from here to moral parliament views of ethics, but in those views the members of the parliament are often considered to be various ethical theories—and like I mentioned, I do not think that subagents really care about ethical theories directly. Also, the decision-making procedures within a human brain differ substantially from those of a parliament. E.g. some subagents will get more voting power on times when the person is afraid or sexually aroused, and there need to be commonly-agreed upon principles which prevent temporarily-powerful agents from using their power to take actions which would then be immediately reversed when the balance of power shifted back.)
2) Besides disagreements between subagents within the same mind, there are also disagreements among people in a society. Here the purpose of ethics is again to act as providing common principles which people can agree to abide by; murder is wrong because the overwhelming majority of people agree that they would prefer to live in a society where nobody gets murdered.
You mention that person-affecting views are intractable as a solution to generating betterness-rankings between worlds. But part of what I was trying to gesture at when I said that the whole approach may be flawed, is that generating betterness-rankings between worlds does not seem like a particularly useful goal to have.
On my view, ethics is something like an ongoing process of negotiation about what to do, as applied to particular problems: trying to decide which kind of world is better in general and in the abstract, seems to me like trying to decide whether a hammer or a saw is better in general. Neither is: it depends on what exactly is the problem that you are trying to decide on and its context. Different contexts and situations will elicit different views from different people/subagents, so the implicit judgment of what kind of a world is better than another may differ based on which contextual features of any given decision happen to activate which particular subagents/people.
Getting back to your suggested characterization of my position as “we ought to act as if something like a person affecting view were true”—I would say “yes, at least sometimes, when the details of the situation seem to warrant it, or at least that is the conclusion which my subagents have currently converged on”. :slightly_smiling_face: I once ( https://www.lesswrong.com/posts/R6KFnJXyk79Huvmrn/two-arguments-for-not-thinking-about-ethics-too-much ) wrote that:
> I’ve increasingly come to think that living one’s life according to the judgments of any formal ethical system gets it backwards—any such system is just a crude attempt of formalizing our various intuitions and desires, and they’re mostly useless in determining what we should actually do. To the extent that the things that I do resemble the recommendations of utilitarianism (say), it’s because my natural desires happen to align with utilitarianism’s recommended courses of action, and if I say that I lean towards utilitarianism, it just means that utilitarianism produces the least recommendations that would conflict with what I would want to do anyway.
Similarly, I can endorse the claim that “we should sometimes act as if the person-affecting view was true”, and I can mention in conversation that I support a person-affecting view. When I do so, I’m treating it as a shorthand for something like “the judgments generated by my internal subagents sometimes produce similar judgments as the principle called ‘person-affecting view’ does, and I think that adopting it as a societal principle in some situations would cause good results (in terms of being something that would produce the kinds of behavioral criteria that both my and most people’s subagents could consider to produce good outcomes)”.
Also a bunch of other thoughts which partially contradict the above comments, and are too time-consuming to write in this margin. :)
Re: Folding, I started reading the document and found the beginning valuable, but didn’t get around reading it to the end. I’ll need to read the rest, thanks for the recommendation. I definitely agree that this
Instead of seeing subagents as being fundamental, I now see subagents as expressions of latent goal-like and belief-like objects, and the brain is implementing some kind of passive program that pursues goals and avoids expectations of suffering, even if you’re not aware you have these goals or these expectations. In other words, the sense of there being a subagent is your brain running a background program that activates and acts upon the implications of these more fundamental yet hidden goals/beliefs.
sounds very plausible. I think I was already hinting at something like that in this post, when I suggested that essentially the same subsystem (habit-based learning) could contain competing neural patterns corresponding to different habits, and treated those as subagents. Similarly, a lot of “subagents” could emerge from essentially the same kind of program acting on contradictory beliefs or goals… but I don’t know how I would empirically test one possibility over the other (unless reading the Folding document gives me ideas), so I’ll just leave that part of the model undefined.
Have you—or anyone, really—put much thought into the implications of these ideas to AI alignment?
If it’s true that modeling humans at the level of constitutive subagents renders a more accurate description of human behavior, then any true solution to the alignment problem will need to respect this internal incoherence in humans.
This is potentially a very positive development, I think, because it suggests that a human can be modeled as a collection of relatively simple subagent utility functions, which interact and compete in complex but predictable ways. This sounds closer to a gears-level portrayal of what is happening inside a human, in contrast to descriptions of humans as having a single convoluted and impossible-to-pin-down utility function.
I don’t know if you’re at all familiar with Mark Lippman’s Folding material and his ontology for mental phenomenology. My attempt to summarize his framework of mental phenomena is as follows: there are belief-like objects (expectations, tacit or explicit, complex or simple), goal-like objects (desirable states or settings or contexts), affordances (context-activated representations of the current potential action space) and intention-like objects (plans coordinating immediate felt intentions, via affordances, toward goal-states). All cognition is “generated” by the actions and interactions of these fundamental units, which I infer must be something like neurologically fundamental. Fish and maybe even worms probably have something like beliefs, goals, affordances and intentions. Ours are just bigger, more layered, more nested and more interconnected.
The reason I bring this up is that Folding was a bit of a kick in the head to my view on subagents. Instead of seeing subagents as being fundamental, I now see subagents as expressions of latent goal-like and belief-like objects, and the brain is implementing some kind of passive program that pursues goals and avoids expectations of suffering, even if you’re not aware you have these goals or these expectations. In other words, the sense of there being a subagent is your brain running a background program that activates and acts upon the implications of these more fundamental yet hidden goals/beliefs.
None of this is at all in contradiction to anything in your Sequence. It’s more like a slightly different framing, where a “Protector Subagent” is reduced to an expression of a belief-like object via a self-protective background process. It all adds up to the same thing, pretty much, but it might be more gears-level. Or maybe not.
I definitely have some thoughts on the AI alignment implications, yes. Still working out exactly what they are. :-) A few fragmented thoughts, here’s what I wrote in the initial post of the sequence:
Also, here’s what I recently wrote to someone during a discussion about population ethics:
Also a bunch of other thoughts which partially contradict the above comments, and are too time-consuming to write in this margin. :)
Re: Folding, I started reading the document and found the beginning valuable, but didn’t get around reading it to the end. I’ll need to read the rest, thanks for the recommendation. I definitely agree that this
sounds very plausible. I think I was already hinting at something like that in this post, when I suggested that essentially the same subsystem (habit-based learning) could contain competing neural patterns corresponding to different habits, and treated those as subagents. Similarly, a lot of “subagents” could emerge from essentially the same kind of program acting on contradictory beliefs or goals… but I don’t know how I would empirically test one possibility over the other (unless reading the Folding document gives me ideas), so I’ll just leave that part of the model undefined.
I sort of started in this vicinity but then ended up somewhere else.