nostalgebraist comments on What’s Up With Confusingly Pervasive Goal Directedness?

nostalgebraist 26 Jan 2022 0:06 UTC
16 points
I’m having some trouble phrasing this comment clearly, and I’m also not sure how relevant it is to the post except that the post inspired the thoughts, so bear with me...
It seems important to distinguish between several things that could vary with time, over the course of a plan or policy:
1. What information is known.
  - This is related to Nate’s comment here: it is much more computationally feasible to specify a plan/policy if it’s allowed to contain terms that say “make an observation, then run this function on it to decide the next step,” rather than writing out a lookup table pairing every sequence of observations to the next action.
2. What objective function is being maximized.
  - This is usually assumed (?) to be static in this kind of discussion, but in principle the objective could vary in response to future observations.
    
    In principle, this is equivalent to a static objective function with terms for “how it would respond” to each possible sequence of observations (ignoring subtleties about orders over world-states vs. world-histories). But this has exactly the same structure as the previous point: it’s more feasible to say “make an observation, then run this function to update the objective” than to unroll the same thing into a lookup table known entirely at the start.
3. Object-level features of the actions that are chosen.
  - Some of the properties under discussion, like Nate’s “lasing,” are about this stuff changing equivariantly / “in the right way” as other things change.
The recent discussions about consequentialism seem to be about the case where we have a task that takes a significant amount of real-world time, over which many observations (1) will be made with implications for subsequent decisions—but over which the objective (2) is approximately unchanging. This setup leads to various scary properties of what the policies actually do (3).
But, I don’t understand the rationale for focusing on this case where the objective (2) doesn’t change. (In the sense of “doesn’t change” specified above—that we can specify it simply over long time horizons, rather than incurring an exp(T) cost for unrolling its updates on observations sequences.)
One reason to care about this case is a hope for oracle AI, since oracle AI is something that receives “questions” (objectives simple enough for us to feel we understand) and returns “answers” (plans that may take time). This might produce a good argument that oracle AI is unsafe, but it doesn’t apply to systems with changing objectives.
In the case of human intelligence, it seems to me that (2) evolves not too much more slowly than (1), and becomes importantly non-constant for longer-horizon cases of human planning.
If I set myself a brief and trivial goal like “make the kitchen cleaner over the next five minutes,” I will spend those five minutes acting much like a clean-kitchen-at-all-costs optimizer, with all my subgoals pointing coherently in that direction (“wash this dish,” “pick up the sponge”). If I set myself a longer-term goal like “get a new job,” I may well find my preferences about the outcome have evolved substantially well before the task is complete.
This fact seems orthogonal to the fact that I am “good at search” relative to all known things that aren’t humans. Relative to all non-humans, I’m very good at finding policies that are high-EV for the targets I’m trying to hit. But my targets evolve over time.
Indeed, I imagine this is why the complexity of human value doesn’t create more of a problem for human action than it does. I don’t have a simply-specifiable constant objective with a term for “make people happy” (or whatever); I have an objective with an update rule that reacts to human feedback over time. The update rule may have been optimized for something on an evolutionary timescale, but it’s not obvious its application in an individual human can be modeled as optimizing anything.
(For a case that has the intelligence gap of humans/AGI, consider human treatment of animals. I’ve heard this brought up as an analogy for misaligned AI, and it’s an interesting one. But the shape of the problem is not “humans are good at search, and have an objective which omits ‘animal values,’ or includes them in the wrong way.” Sometimes people just decide to become vegan for ethical reasons! Sometimes whole cultures do.
This looks like a real case of individual values being updated, i.e. I don’t think the right model of someone who goes vegan at age 31 is “this person is maximizing an objective which gives them points for eating animals, but only until age 31, and negative points thereafter.”)
If we think of humans as a prototype case of an “inner optimizer,” with evolution the outer optimizer, we have to note that the inner optimizer doesn’t have a constant objective, even though the outer one does. The inner optimizer is very powerful, has the lasing property, and all of that, but it gets applied to a changing objective, which seems to produce qualitatively different results in terms of corrigibility, Goodhart, etc. The same thing could be true of an AGI, if it’s the product of something like gradient descent rather than a system with an internal objective we explicitly wrote. This is not strong evidence that it will be true, but it at least motivates asking the question.
(It seems noteworthy, here, that when people talk about the causes of human misery / “non-satisfaction of human values,” they typically point to things like scarcity, coordination problems, and society-level optimization systems with constant objectives. If we’re good at search, and human value is complex, why aren’t we constantly harming each other by executing incorrigibly on misaligned plans at an individual level? Something fitting this description no doubt happens, but it causes less damage that a naive application of AI safety theory would lead one to expect.)