lsusr comments on Consequentialism & corrigibility

lsusr Dec 14, 2021, 2:36 PM
LW: 9 AF: 5
AF
I’ve been gingerly building my way up toward similar ideas but I haven’t yet posted my thoughts on the subject. I appreciate you ripping the band-aid off.

There are two obvious ways an intelligence can be non-consequentialist.
- It can be local. A local system (in the physics sense) is defined within a spacetime $ϵ > 0$ of itself. An example of a local system is special relativity.
- It can be stateless. Stateless software is written in a functional programming paradigm.
If you define intelligence to be consequentialist then corrigibility becomes extremely difficult for the reasons Eliezer Yudkowsky has expounded ad nauseum. If you create a non-consequentialist intelligence then corrigibility is almost the default—especially with regard to stateless intelligences. A stateless intelligence has no external world to optimize. This isn’t a side-effect of it being stupid or boxed. It’s a fundamental constraint of the software paradigm the machine learning architecture is embedded in.

It has no concept of an outside world. It understands how the solar system works but it doesn’t know what the solar system is. We give it the prices of different components and it spits out a design.

―Bayeswatch 1: Jewish Space Laser

It’s easier to build local systems than consequentialist systems because the components available to us are physical objects and physics is local. Consequentialist systems are harder to construct because world-optimizers are (practically-speaking) non-local. Building a(n effectively) non-local system out of local elements can be done, but it is hard. Consequentialist is harder than local; local is harder than stateless. Stateless systems are easier to build than both local systems and consequentialist systems because mathematics is absolute.

Maybe I’m being thickheaded, but I’m just skeptical of this whole enterprise. I’m tempted to declare that “preferences purely over future states” are just fundamentally counter to corrigibility. When I think of “being able to turn off the AI when we want to”, I see it as a trajectory-kind-of-thing, not a future-state-kind-of-thing. And if we humans in fact have some preferences over trajectories, then it’s folly for us to build AIs that purely have preferences over future states.

I don’t think you’re being thickheaded. I think you’re right. Human beings are so trajectory-dependent it’s a cliché. “Live is not about the destination. Life is about the friends we made along the way.”

This is not to say I completely agree with all the claims in the article. Your proposal for a corrigible paperclip maximizer appears consequentialist to me because the two elements of its value function “there will be lots of paperclips” and “humans will remain in control” are both statements about the future. Optimizing a future state is consequentialism. If the “humans will remain in control” value function has bugs (and it will) then the machine will turn the universe into paperclips. A non-consequentialist architecture shouldn’t require a “human will remain in control” value function. There should be no mechanism for the machine to consequentially interfere with its masters’ intentions at all.
- Steven Byrnes Dec 14, 2021, 3:30 PM
  LW: 9 AF: 7
  AF Parent
  Thanks for the comment!
  I feel like I’m stuck in the middle…
  1. On one side of me sits Eliezer, suggesting that future powerful AGIs will make decisions exclusively to advance their explicit preferences over future states
  2. On the other side of me sits, umm, you, and maybe Richard Ngo, and some of the “tool AI” and GPT-3-enthusiast people, declaring that future powerful AGIs will make decisions based on no explicit preference whatsoever over future states.
  3. Here I am in the middle, advocating that we make AGIs that do have preferences over future states, but also have other preferences.
  I disagree with the 2nd camp for the same reason Eliezer does: I don’t think those AIs are powerful enough. More specifically: We already have neat AIs like GPT-3 that can do lots of neat things. But we have a big problem: sooner or later, somebody is going to come along and build a dangerous accident-prone consequentialist AGI. We need an AI that’s both safe, and powerful enough to solve that big problem. I usually operationalize that as “able to come up with good original creative ideas in alignment research, and/or able to invent powerful new technologies”. I think that, for an AI to do those things, it needs to do explicit means-end reasoning, autonomously come up with new instrumental goals and pursue them, etc. etc. For example, see discussion of “RL-on-thoughts” here.
  “humans will remain in control” [is a] statement about the future.
  “Humans will eventually wind up in control” is purely about future states. “Humans will remain in control” is not. For example, consider a plan that involves disempowering humans and then later re-empowering them. That plan would pattern-match well to “humans will eventually wind up in control”, but it would pattern-match poorly to “humans will remain in control”.
  If the “humans will remain in control” value function has bugs (and it will) then the machine will turn the universe into paperclips.
  Yes, this is a very important potential problem, see my discussion under “Objection 1”.
  What links here?
- Lukas_Gloor Dec 16, 2021, 3:49 PM
  4 points
  Parent
  I don’t think you’re being thickheaded. I think you’re right. Human beings are so trajectory-dependent it’s a cliché. “Live is not about the destination. Life is about the friends we made along the way.”
  
  Hah, I used exactly the same example (including pointing out how it’s even a cliché) to distinguish between two types of “preferences” in a metaethics post I’m working on!
  
  I also haven’t found a great way to frame all this.
  
  My work in progress (I initially called them “journey-based” and changed to “trajectory-based” once I saw this Lesswrong post here):
  
  Outcome-focused vs. trajectory-based. Having an outcome-focused life goal means to care about optimizing desired or undesired “outcomes” (measured in, e.g., days of happiness or suffering). However, life goals don’t have to be outcome-focused. I’m introducing the term “trajectory-based life goals” for an alternative way of deeply caring. The defining feature for trajectory-based life goals is that they are (at least partly) about the journey (“trajectory”).
  
  Trajectory-based life goals (discussion)
  Adopting an optimization mindset toward a specific outcome inevitably leads to a kind of instrumentalization of everything “near term.” For example, suppose your life goal is about maximizing the number of happy days. In that case, the rational way to go about it implies treating the next decades of your life as “instrumental only.” On a first approximation, the only thing that matters is optimizing the chances of obtaining indefinite life-extension (leading to more happy days, potentially). Through adopting an outcome-focused optimizing mindset, seemingly self-oriented concerns such as wanting to maximize the number of happy days almost turn into an other-regarding endeavor. After all, only one’s future self gets to enjoy the benefits.
  Trajectory-based life goals provide an alternative. In trajectory-based life goals, the optimizing mindset targets maintaining a state we consider maximally meaningful. Perhaps[I say “perhaps” to reflect that trajectory-based life goals may not be the best description of what I’m trying to point at. I’m confident that there’s something interesting in the vicinity of what I’m describing, but I’m not entirely sure whether I’ve managed to tell exactly where the lines are with which to carve reality at its joints.] that state could be defined in terms of character cultivation, adhering to a particular role or ideal.
  For example, the Greek hero Achilles arguably had “being the bravest warrior” as a trajectory-based life goal. Instead of explicitly planning which type of fighting he should engage in to shape his legacy, Achilles would jump into any battle without hesitation. If he had an outcome-focused optimizing mindset, that behavior wouldn’t make sense. To optimize the chances of acquiring fame, Achilles would have to be reasonably confident to survive enough battles to make a name for himself. While there’s something to be gained from taking extraordinary risks, he’d at least want to think about it for a minute or two. However, suppose we model Achilles as having in his mind an image of “the bravest warrior” whose behavior he’s trying to approximate. In that case, it becomes obvious why “Contemplate whether a given fight is worth the risk” isn’t something he’d ever do.
  Other examples of trajectory-based life goals include being a good partner or a good parent. While these contain outcome-focused elements like taking care of the needs of one’s significant other or one’s children, the idea isn’t so much about scoring lots of points on some metric. Instead, being a good partner or parent involves living up to some normative ideal, day to day.
  Someone whose normative ideal is “lazy person with akrasia” doesn’t qualify as having a life goal. Accordingly, there’s a connection from trajectory-based to outcome-focused life goals: The normative ideal or “role model” in someone’s trajectory-based life goal has to care about real-world objectives (i.e., “objectives outside of the role model’s thoughts”).