Not sure if I disagree or if we’re placing emphasis differently.
I certainly agree that there are going to be places where we’ll need to use nice, clean concepts that are known to generalize. But I don’t think that the resolutions to problems 1 and 2 will look like nice clean concepts (like in minimizing mutual information). It’s not just human values that are messy and contingent, even the pointer we want to use to gesture to those-things-we-want-to-treat-as-our-values is messy and contingent. I think of some of my intuitions as my “real values” and others as mere “biases” in a thoroughly messy way.
But back on the first hand again, what’s “messy” might be subjective. A good recipe for fitting values to me will certainly be simple and neat compared to the totality of information stored in my brain.
And I certainly want to move away from the framing that the way to deal with problems 1 and 2 is to say “Goodhart’s law says that any difference between the proxy and our True Values gets amplified… so we just have to find our True Values”—I think this framing leads one to look for solutions in the wrong way (trying to eliminate ambiguity, trying to find a single human-comprehensible model of humans from which the True Values can be extracted, mistakes like that). But this is also kind of a matter of perspective—any satisfactory value learning process can be evaluated (given a background world-model) as if it assigns humans some set of True Values.
I think even if we just call these things differences in emphasis, they can still lead directly to disagreements about (even slightly) meta-level questions, such as how we should build trust in value learning schemes.
It’s not just human values that are messy and contingent, even the pointer we want to use to gesture to those-things-we-want-to-treat-as-our-values is messy and contingent.
What’s the evidence for this claim?
When I look at e.g. nails, the economic value of a nail seems reasonably complicated. Yet the “pointers to nail value” which we use in practice—i.e. competitive markets and reputation systems—do have clean, robust mathematical formulations.
Furthermore, before the mid-20th century, I expect that most people would have expected that competitive markets and reputation systems were inherently messy and contingent. They sure do look messy! People confuse messiness in the map for messiness in the territory.
I think of some of my intuitions as my “real values” and others as mere “biases” in a thoroughly messy way.
… this, for instance, I think is probably a map-territory confusion. The line between “real values” and “biases” will of course look messy when one has not yet figured out the True Name. That does not provide significant evidence of messiness in the territory.
Personally, I made this mistake really hard when I first started doing research in systems biology in undergrad. I thought the territory of biology was inherently messy, and I actually had an argument with my advisor that some of our research goals were unrealistic because of inherent biological messiness. In hindsight, I was completely wrong; the territory of biology just isn’t that inherently messy. (My review of Design Principles of Biological Circuits goes into more depth on this topic.)
That said, the intuition that “the territory is messy” is responding to a real failure mode. The territory does not necessarily respect whatever ontology or model a human starts out with. People who expect a “clean” territory tend to be shocked by how “messy” the world looks when their original ontology/model inevitably turns out to not fit it very well. I think this is how people usually end up with the (sometimes useful!) intuition that the territory is messy.
Evidence & Priors
Note that the above mostly argued that the appearance of messiness is a feature of the map which yields little evidence about the messiness of the territory; even things with simple True Names look messy before we know those Names. But that still leaves unanswered two key questions:
Is there any way that we can get evidence of messiness of the territory itself?
What should our priors be regarding messiness in the territory?
One way to get positive evidence of messiness in the territory, for instance, is to see lots of smart people fail to find a clean True Name even with strong incentives to do so. Finding True Names is currently a fairly rare and illegible skill (there aren’t a lot of Claude Shannons or Judea Pearls), so we usually don’t have very strong evidence of this form in today’s world, but there are possible futures in which it could become more relevant.
On the other hand, one way to get evidence of lack of messiness in the territory, even in places where we haven’t yet found the True Names, is to notice that places which seem like canonical examples of very-probably-messy-territory repeatedly turn out to not be so messy. That was exactly my experience with systems biology, and is where my current intuitions on the matter originally came from.
Regarding priors, I think there’s a decent argument that claims of messiness in the territory are always wrong, i.e. a messy territory is impossible in an important sense. The butterfly effect is a good example here: perhaps the flap of a butterfly’s wings can change the course of a hurricane. But if the flap any butterfly’s wings has a significant chance of changing the hurricane’s course, for each of the billions of butterflies in the world, then ignorance of just a few dozen wing-flaps wipes out all the information about all the other wing-flaps; even if I measure the flaps of a million butterfly wings, this gives me basically-zero information about the hurricane’s course. (For a toy mathematical version of this, see here.)
The point of this example is that this “messy” system is extremely well modeled across an extremely wide variety of epistemic states as pure noise, which is in some sense quite simple. (Obviously we’re invoking an epistemic state here, which is a feature of a map, but the existence of a very wide range of simple and calibrated epistemic states is a feature of the territory.) More generally, the idea here is that there’s a duality between structure and noise: anything which isn’t “simple structure” is well-modeled as pure noise, which itself has a simple True Name. Of course then we can extend it to talk about fractal structure, in which more structure appears as we make the model more precise, but even then we get simple approximations.
Anyway, that argument about nonexistence of messy territory is more debatable than the rest of this comment, so don’t get too caught up in it. The rest of the comment still stands even if the argument at the end is debatable.
It’s not clear to me that your metaphors are pointing at something in particular.
Revenue of a nail factory is a good proxy for the quality of the nails produced, but only within a fairly small bubble around our current world. You can’t make the factory-owner too smart, or the economy too irrational, or allow for too many technological breakthroughs to happen, or else the proxy breaks. If this was all we needed, then yes, absolutely, I’m sure there’s a similarly neat and simple way to instrumentalize human values—it’s just going to fail if things are too smart, or too irrational, or too far in the future.
Biology being human-comprehensible is an interesting topic, and suppose I grant that it is—that we could have comprehensible explanatory stories for every thing our cells do, and that these stories aren’t collectively leaving anything out. First off, I would like to note that such a collection of stories would still be really complicated relative to simple abstractions in physics or economics! Second, this doesn’t connect directly to Goodhart’s law. We’re just talking about understanding biology, without mentioning purposes to which our understanding can be applied. Comprehending biology might help us generalize, in the sense of being able to predict what features will be conserved by mutation, or will adapt to a perturbed environment, but again this generalization only seems to work in a limited range, where the organism is doing all the same jobs with the same divisions between them.
The butterfly effect metaphor seems like the opposite of biology. In biology you can have lots of little important pieces—they’re not individually redirecting the whole hurricane/organism, but they’re doing locally-important jobs that follow comprehensible rules, and so we don’t disregard them as noise. None of the butterflies have such locally-useful stories about what they’re doing to the hurricane, they’re all just applying small incomprehensible perturbations to a highly chaotic system. The lesson I take is that messiness is not the total lack of structure—when I say my room is messy, I don’t mean that the arrangement of its component atoms has been sampled from the Boltzmann distribution—it’s just that the structure that’s there isn’t easy for humans to use.
I’d like to float one more metaphor: K-complexity and compression.
Suppose I have a bit string of length 10^9, and I can compress it down to length 10^8. The “True Name hypothesis” is that the compression looks like finding some simple, neat patterns that explain most of the data and we expect to generalize well, plus a lot of “diff” that’s the noisy difference between the simple rules and the full bitstring. The “fractal hypothesis” is that there are a few simple patterns that do some of the work, and a few less simple rules that do more of the work, and so on for as long as you have patience. The “total mess hypothesis” is that simple rules do a small amount of the work, and a lot of the 10^8 bits is big highly-interdependent programs that would output something very different if you flipped just a few bits. Does this seem about right?
Revenue of a nail factory is a good proxy for the quality of the nails produced, but only within a fairly small bubble around our current world. You can’t make the factory-owner too smart, or the economy too irrational, or allow for too many technological breakthroughs to happen, or else the proxy breaks.
I think you missed the point of that particular metaphor. The claim was not that revenue of a nail factory is a robust operationalization of nail value. The claim was that a competitive nail market plus nail-maker reputation tracking is a True Name for a pointer to nail value—i.e. such a system will naturally generate economically-valuable nails. Because we have a robust mathematical formalization of efficient markets, we know the conditions under which that pointer-to-nail-value will break down: things like the factory owner being smart enough to circumvent the market mechanism, or the economy too irrational, etc.
The lesson I take is that messiness is not the total lack of structure—when I say my room is messy, I don’t mean that the arrangement of its component atoms has been sampled from the Boltzmann distribution—it’s just that the structure that’s there isn’t easy for humans to use.
I agree with this, and it’s a good summary of the takeaway of the butterfly effect analogy. In this frame, I think our disagreement is about whether “structure which isn’t easy for humans to use” is generally hard to use because the humans haven’t yet figured it out (but they could easily use it if they did figure it out) vs structure which humans are incapable of using due to hardware limitations of the brain.
Suppose I have a bit string of length 10^9, and I can compress it down to length 10^8. …
This is an anology which I also considered bringing up, and I think you’ve analogized things basically correctly here. One important piece: if I can compress a bit string down to length 10^8, and I can’t compress it any further, then that program of length 10^8 is itself incompressible—i.e. it’s 10^8 random bits. As with the butterfly effect, we get a duality between structure and noise.
Actually, to be somewhat more precise: it may be that we could compress the length 10^8 program somewhat, but then we’d still need to run the decompressed program through an interpreter in order for it to generate our original bitstring. So the actual rule is something roughly like “any maximally-compressed string consists of a program shorter than roughly-the-length-of-the-shortest-interpreter, plus random bits” (with the obvious caveat that the short program and the random bits may not separate neatly).
I think you’re saying: if a thing is messy, at least there can be a non-messy procedure / algorithm that converges to (a.k.a. points to) the thing. I think I’m with Charlie in feeling skeptical about this in regards to value learning, because I think value learning is significantly a normative question. Let me elaborate:
My genes plus 1.2e9 seconds of experience have built have built a fundamentally messy set of preferences, which are in some cases self-inconsistent, easily-manipulated, invalid-out-of-distribution, etc. It’s easy enough to point to the set of preferences as a whole—you just say “Steve’s preferences right now”.
In fact, one might eventually (I expect) be able to write down the learning algorithm, reward function, etc., that led to those preferences (but we won’t be able to write down the many petabytes of messy training data), and we’ll be able to talk about what the preferences look like in the brain. But still, you shouldn’t and can’t directly optimize according those preferences because they’re self-inconsistent, invalid-out-of-distribution, they might involve ghosts, etc.
So then we have a normative question: if “fulfill Steve’s preferences” isn’t a straightforward thing, then what exactly should the AGI do? Maybe we should ask Steve what value learning ought to look like? But maybe I say “I don’t know”, or maybe I give an answer that I wouldn’t endorse upon reflection, or in hindsight. So maybe we should have the AGI do whatever Steve will endorse in hindsight? No, that leads to brainwashing.
Anyway, it’s possible that we’ll come up with an operationalization of value learning that really nails down what we think the AGI ought to do. (Let’s say, for example, something like CEV but more specific.) If we do, to what extent should we expect this operationalization to be simple and elegant, versus messy? (For example, in my book, Stuart Armstrong research agenda v0.9 counts as rather messy.) I think an answer on the messier side is quite plausible. Remember, (1) this is a normative question, and (2) that means that the foundation on which it’s built is human preferences (about what value learning ought to look like), and (3) as above, human preferences are fundamentally messy because they involve a lifetime of learning from data. This is especially true if we don’t want to trample over individual / cultural differences of opinion about (for example) the boundary between advice (good) vs manipulation (bad).
It’s important to note that human preferences may be messy, but the mechanism by which we obtain them probably isn’t. I think the question really isn’t “What do I want (and how can I make an AI understand that)?” but rather “How do I end up wanting things (and how can I make an AI accurately predict how that process will unfold)?”
I don’t disagree with the first sentence (well, it depends on where you draw the line for “messy”).
I do mostly disagree with the second sentence.
I’m optimistic that we will eventually have a complete answer to your second question. But once we do have that answer, I think we’ll still have a hard time figuring out a specification for what we want the AI to actually do, for the reasons in my comment—in short, if we take a prospective approach (my preferences now determine what to do), then it’s hard because my preferences are self-inconsistent, invalid-out-of-distribution, they might involve ghosts, etc.; or if we take a retrospective approach (the AGI should ensure that the human is happy with how things turned out in hindsight), then we get brainwashing and so on.
I don’t really see how this is a problem. The AI should do something that is no worse from me-now’s perspective than whatever I myself would have done. Given that if I have inconsistent preferences I am probably struggling to figure out how to balance them myself, it doesn’t seem reasonable to me to expect an AI to do better.
I think also a mixture of prospective and retrospective makes most sense; every choice you make is a trade between your present and future selves, after all. So whatever the AI does should be something that both you-now and you-afterward would accept as legitimate.
Also, my inconsistent preferences probably all agree that becoming more consistent would be desirable, though they would disagree about how to do this; so the AI would probably try to help me achieve internal consistency in a way both me-before (all subagents) and me-after agree upon, through some kind of internal arbitration (helping me figure out what I want) and then act upon that.
And if my preferences involve things that don’t exist or that I don’t understand correctly, the AI may be able to extrapolate what the closest real thing is to my confused goal (improve the welfare of ghosts → take the utility functions of currently dead people more into account in making decisions), whether both me-now and me-after would agree that this is reasonable, and then if so do that.
Again, we’re assuming for the sake of argument that there’s an AI which completely understands an adult human’s current preferences (which are somewhat inconsistent etc.), and how those preferences would change under different circumstances. We need a specification for what this AI should do right now.
If you’re arguing that there is such a specification which is not messy, can write down exactly what that specification is? If you already said it, I missed it. Can you put it in italics or something? :)
(Your comment said that the AI “should” or “would” do this or that a bunch of times, but I’m not sure if you’re listing various different consequences of a single simple specification that you have in mind, or if you’re listing different desiderata that must be met by a yet-to-be-determined specification.)
I think out loud a lot. Assume nearly everything I say in conversations like this is desiderata I’m listing off the top of my head with no prior planning. I’m really not good at the kind of rigorous think-before-you-speak that is normative on LessWrong.
A really bad starting point for a specification which almost certainly has tons of holes in it: have the AI predict what I would do up to a given length of time in the future if it did not exist, and from there make small modifications to construct a variety of different timelines for similar things I might instead have done.
In each such timeline predict how much I-now and I-after would approve of that sequence of actions, and maximize the minimum of those two. Stop after a certain number of timelines have been considered and tell me the results. Update its predictions of me-now based on how I respond, and if I ask it to, run the simulation again with this new data and a new set of randomly deviating future timelines.
This would produce a relatively myopic (doesn’t look too far into the future) and satisficing (doesn’t consider too many options) advice-giving AI which would not have agency of its own but only help me find courses of action for me to do which I like better than whatever I would have done without its advice.
There’s almost certainly tons of failure modes here, such as a timeline where my actions seem reasonable at first, but turn me into a different person who also thinks the actions were reasonable, but who otherwise wildly differs from me in a way that is invisible to me-now receiving the advice. But it’s a zeroth draft anyway.
(That whole thing there was another example of me thinking out loud in response to what you said, rather than anything preconceived. It’s very hard for me to do otherwise. I just get writer’s block and anxiety if I try to.)
I also edit my previous comments a lot after I realize there was more I ought to have said. Very bad habit—look back at the comment you just replied to please, I edited it before realizing you’d already read it! I really need to stop doing that...
Oh it’s fine, plenty of people edit their comments after posting including me, I should be mindful of that by not replying immediately :-P As for the rest of your comment:
I think your comment has a slight resemblance to Vanessa Kosoy’s “Hippocratic Timeline-Driven Learning” (Section 4.1 here), if you haven’t already heard of that.
My suspicion is that, if one were to sort out all the details, including things like the AI-human communication protocol, such that it really works and is powerful and has no failure modes, you would wind up with something that’s at least “rather messy” (again, “rather messy” means “in the same messiness ballpark as Stuart Armstrong research agenda v0.9”) (and “powerful” rules out literal Hippocratic Timeline-Driven Learning, IMO).
places which seem like canonical examples of very-probably-messy-territory repeatedly turn out to not be so messy
May I ask for a few examples of this?
The claim definitely seems plausible to me, but I can’t help but think of examples like gravity or electromagnetism, where every theory to date has underestimated the messiness of the true concept. It’s possible that these aren’t really much evidence against the claim but rather indicative of a poor ontology:
People who expect a “clean” territory tend to be shocked by how “messy” the world looks when their original ontology/model inevitably turns out to not fit it very well.
However, it feels hard to differentiate (intuitively or formally) cases where our model is a poor fit from cases where the territory is truly messy. Without being able to confidently make this distinction, the claim that the territory itself isn’t messy seems a bit unfalsifiable. Any evidence of territories turning out to be messy could be chalked up to ill-fitting ontologies.
Hopefully, seeing more examples like the competitive markets for nails will help me better differentiate the two or, at the very least, help me build intuition for why less messy territories are more natural/common.
Is this argument robust in the case of optimization, though?
I’d think that optimization can lead to unpredictable variation that happens to correlate and add up in such a way as to have much bigger effects than noise would have.
It seems like it would be natural for the butterfly argument to break down in exactly the sorts of situations involving agency.
Not sure if I disagree or if we’re placing emphasis differently.
I certainly agree that there are going to be places where we’ll need to use nice, clean concepts that are known to generalize. But I don’t think that the resolutions to problems 1 and 2 will look like nice clean concepts (like in minimizing mutual information). It’s not just human values that are messy and contingent, even the pointer we want to use to gesture to those-things-we-want-to-treat-as-our-values is messy and contingent. I think of some of my intuitions as my “real values” and others as mere “biases” in a thoroughly messy way.
But back on the first hand again, what’s “messy” might be subjective. A good recipe for fitting values to me will certainly be simple and neat compared to the totality of information stored in my brain.
And I certainly want to move away from the framing that the way to deal with problems 1 and 2 is to say “Goodhart’s law says that any difference between the proxy and our True Values gets amplified… so we just have to find our True Values”—I think this framing leads one to look for solutions in the wrong way (trying to eliminate ambiguity, trying to find a single human-comprehensible model of humans from which the True Values can be extracted, mistakes like that). But this is also kind of a matter of perspective—any satisfactory value learning process can be evaluated (given a background world-model) as if it assigns humans some set of True Values.
I think even if we just call these things differences in emphasis, they can still lead directly to disagreements about (even slightly) meta-level questions, such as how we should build trust in value learning schemes.
What’s the evidence for this claim?
When I look at e.g. nails, the economic value of a nail seems reasonably complicated. Yet the “pointers to nail value” which we use in practice—i.e. competitive markets and reputation systems—do have clean, robust mathematical formulations.
Furthermore, before the mid-20th century, I expect that most people would have expected that competitive markets and reputation systems were inherently messy and contingent. They sure do look messy! People confuse messiness in the map for messiness in the territory.
… this, for instance, I think is probably a map-territory confusion. The line between “real values” and “biases” will of course look messy when one has not yet figured out the True Name. That does not provide significant evidence of messiness in the territory.
Personally, I made this mistake really hard when I first started doing research in systems biology in undergrad. I thought the territory of biology was inherently messy, and I actually had an argument with my advisor that some of our research goals were unrealistic because of inherent biological messiness. In hindsight, I was completely wrong; the territory of biology just isn’t that inherently messy. (My review of Design Principles of Biological Circuits goes into more depth on this topic.)
That said, the intuition that “the territory is messy” is responding to a real failure mode. The territory does not necessarily respect whatever ontology or model a human starts out with. People who expect a “clean” territory tend to be shocked by how “messy” the world looks when their original ontology/model inevitably turns out to not fit it very well. I think this is how people usually end up with the (sometimes useful!) intuition that the territory is messy.
Evidence & Priors
Note that the above mostly argued that the appearance of messiness is a feature of the map which yields little evidence about the messiness of the territory; even things with simple True Names look messy before we know those Names. But that still leaves unanswered two key questions:
Is there any way that we can get evidence of messiness of the territory itself?
What should our priors be regarding messiness in the territory?
One way to get positive evidence of messiness in the territory, for instance, is to see lots of smart people fail to find a clean True Name even with strong incentives to do so. Finding True Names is currently a fairly rare and illegible skill (there aren’t a lot of Claude Shannons or Judea Pearls), so we usually don’t have very strong evidence of this form in today’s world, but there are possible futures in which it could become more relevant.
On the other hand, one way to get evidence of lack of messiness in the territory, even in places where we haven’t yet found the True Names, is to notice that places which seem like canonical examples of very-probably-messy-territory repeatedly turn out to not be so messy. That was exactly my experience with systems biology, and is where my current intuitions on the matter originally came from.
Regarding priors, I think there’s a decent argument that claims of messiness in the territory are always wrong, i.e. a messy territory is impossible in an important sense. The butterfly effect is a good example here: perhaps the flap of a butterfly’s wings can change the course of a hurricane. But if the flap any butterfly’s wings has a significant chance of changing the hurricane’s course, for each of the billions of butterflies in the world, then ignorance of just a few dozen wing-flaps wipes out all the information about all the other wing-flaps; even if I measure the flaps of a million butterfly wings, this gives me basically-zero information about the hurricane’s course. (For a toy mathematical version of this, see here.)
The point of this example is that this “messy” system is extremely well modeled across an extremely wide variety of epistemic states as pure noise, which is in some sense quite simple. (Obviously we’re invoking an epistemic state here, which is a feature of a map, but the existence of a very wide range of simple and calibrated epistemic states is a feature of the territory.) More generally, the idea here is that there’s a duality between structure and noise: anything which isn’t “simple structure” is well-modeled as pure noise, which itself has a simple True Name. Of course then we can extend it to talk about fractal structure, in which more structure appears as we make the model more precise, but even then we get simple approximations.
Anyway, that argument about nonexistence of messy territory is more debatable than the rest of this comment, so don’t get too caught up in it. The rest of the comment still stands even if the argument at the end is debatable.
It’s not clear to me that your metaphors are pointing at something in particular.
Revenue of a nail factory is a good proxy for the quality of the nails produced, but only within a fairly small bubble around our current world. You can’t make the factory-owner too smart, or the economy too irrational, or allow for too many technological breakthroughs to happen, or else the proxy breaks. If this was all we needed, then yes, absolutely, I’m sure there’s a similarly neat and simple way to instrumentalize human values—it’s just going to fail if things are too smart, or too irrational, or too far in the future.
Biology being human-comprehensible is an interesting topic, and suppose I grant that it is—that we could have comprehensible explanatory stories for every thing our cells do, and that these stories aren’t collectively leaving anything out. First off, I would like to note that such a collection of stories would still be really complicated relative to simple abstractions in physics or economics! Second, this doesn’t connect directly to Goodhart’s law. We’re just talking about understanding biology, without mentioning purposes to which our understanding can be applied. Comprehending biology might help us generalize, in the sense of being able to predict what features will be conserved by mutation, or will adapt to a perturbed environment, but again this generalization only seems to work in a limited range, where the organism is doing all the same jobs with the same divisions between them.
The butterfly effect metaphor seems like the opposite of biology. In biology you can have lots of little important pieces—they’re not individually redirecting the whole hurricane/organism, but they’re doing locally-important jobs that follow comprehensible rules, and so we don’t disregard them as noise. None of the butterflies have such locally-useful stories about what they’re doing to the hurricane, they’re all just applying small incomprehensible perturbations to a highly chaotic system. The lesson I take is that messiness is not the total lack of structure—when I say my room is messy, I don’t mean that the arrangement of its component atoms has been sampled from the Boltzmann distribution—it’s just that the structure that’s there isn’t easy for humans to use.
I’d like to float one more metaphor: K-complexity and compression.
Suppose I have a bit string of length 10^9, and I can compress it down to length 10^8. The “True Name hypothesis” is that the compression looks like finding some simple, neat patterns that explain most of the data and we expect to generalize well, plus a lot of “diff” that’s the noisy difference between the simple rules and the full bitstring. The “fractal hypothesis” is that there are a few simple patterns that do some of the work, and a few less simple rules that do more of the work, and so on for as long as you have patience. The “total mess hypothesis” is that simple rules do a small amount of the work, and a lot of the 10^8 bits is big highly-interdependent programs that would output something very different if you flipped just a few bits. Does this seem about right?
I think you missed the point of that particular metaphor. The claim was not that revenue of a nail factory is a robust operationalization of nail value. The claim was that a competitive nail market plus nail-maker reputation tracking is a True Name for a pointer to nail value—i.e. such a system will naturally generate economically-valuable nails. Because we have a robust mathematical formalization of efficient markets, we know the conditions under which that pointer-to-nail-value will break down: things like the factory owner being smart enough to circumvent the market mechanism, or the economy too irrational, etc.
I agree with this, and it’s a good summary of the takeaway of the butterfly effect analogy. In this frame, I think our disagreement is about whether “structure which isn’t easy for humans to use” is generally hard to use because the humans haven’t yet figured it out (but they could easily use it if they did figure it out) vs structure which humans are incapable of using due to hardware limitations of the brain.
This is an anology which I also considered bringing up, and I think you’ve analogized things basically correctly here. One important piece: if I can compress a bit string down to length 10^8, and I can’t compress it any further, then that program of length 10^8 is itself incompressible—i.e. it’s 10^8 random bits. As with the butterfly effect, we get a duality between structure and noise.
Actually, to be somewhat more precise: it may be that we could compress the length 10^8 program somewhat, but then we’d still need to run the decompressed program through an interpreter in order for it to generate our original bitstring. So the actual rule is something roughly like “any maximally-compressed string consists of a program shorter than roughly-the-length-of-the-shortest-interpreter, plus random bits” (with the obvious caveat that the short program and the random bits may not separate neatly).
I think you’re saying: if a thing is messy, at least there can be a non-messy procedure / algorithm that converges to (a.k.a. points to) the thing. I think I’m with Charlie in feeling skeptical about this in regards to value learning, because I think value learning is significantly a normative question. Let me elaborate:
My genes plus 1.2e9 seconds of experience have built have built a fundamentally messy set of preferences, which are in some cases self-inconsistent, easily-manipulated, invalid-out-of-distribution, etc. It’s easy enough to point to the set of preferences as a whole—you just say “Steve’s preferences right now”.
In fact, one might eventually (I expect) be able to write down the learning algorithm, reward function, etc., that led to those preferences (but we won’t be able to write down the many petabytes of messy training data), and we’ll be able to talk about what the preferences look like in the brain. But still, you shouldn’t and can’t directly optimize according those preferences because they’re self-inconsistent, invalid-out-of-distribution, they might involve ghosts, etc.
So then we have a normative question: if “fulfill Steve’s preferences” isn’t a straightforward thing, then what exactly should the AGI do? Maybe we should ask Steve what value learning ought to look like? But maybe I say “I don’t know”, or maybe I give an answer that I wouldn’t endorse upon reflection, or in hindsight. So maybe we should have the AGI do whatever Steve will endorse in hindsight? No, that leads to brainwashing.
Anyway, it’s possible that we’ll come up with an operationalization of value learning that really nails down what we think the AGI ought to do. (Let’s say, for example, something like CEV but more specific.) If we do, to what extent should we expect this operationalization to be simple and elegant, versus messy? (For example, in my book, Stuart Armstrong research agenda v0.9 counts as rather messy.) I think an answer on the messier side is quite plausible. Remember, (1) this is a normative question, and (2) that means that the foundation on which it’s built is human preferences (about what value learning ought to look like), and (3) as above, human preferences are fundamentally messy because they involve a lifetime of learning from data. This is especially true if we don’t want to trample over individual / cultural differences of opinion about (for example) the boundary between advice (good) vs manipulation (bad).
(Low confidence on all this.)
It’s important to note that human preferences may be messy, but the mechanism by which we obtain them probably isn’t. I think the question really isn’t “What do I want (and how can I make an AI understand that)?” but rather “How do I end up wanting things (and how can I make an AI accurately predict how that process will unfold)?”
I don’t disagree with the first sentence (well, it depends on where you draw the line for “messy”).
I do mostly disagree with the second sentence.
I’m optimistic that we will eventually have a complete answer to your second question. But once we do have that answer, I think we’ll still have a hard time figuring out a specification for what we want the AI to actually do, for the reasons in my comment—in short, if we take a prospective approach (my preferences now determine what to do), then it’s hard because my preferences are self-inconsistent, invalid-out-of-distribution, they might involve ghosts, etc.; or if we take a retrospective approach (the AGI should ensure that the human is happy with how things turned out in hindsight), then we get brainwashing and so on.
I don’t really see how this is a problem. The AI should do something that is no worse from me-now’s perspective than whatever I myself would have done. Given that if I have inconsistent preferences I am probably struggling to figure out how to balance them myself, it doesn’t seem reasonable to me to expect an AI to do better.
I think also a mixture of prospective and retrospective makes most sense; every choice you make is a trade between your present and future selves, after all. So whatever the AI does should be something that both you-now and you-afterward would accept as legitimate.
Also, my inconsistent preferences probably all agree that becoming more consistent would be desirable, though they would disagree about how to do this; so the AI would probably try to help me achieve internal consistency in a way both me-before (all subagents) and me-after agree upon, through some kind of internal arbitration (helping me figure out what I want) and then act upon that.
And if my preferences involve things that don’t exist or that I don’t understand correctly, the AI may be able to extrapolate what the closest real thing is to my confused goal (improve the welfare of ghosts → take the utility functions of currently dead people more into account in making decisions), whether both me-now and me-after would agree that this is reasonable, and then if so do that.
Again, we’re assuming for the sake of argument that there’s an AI which completely understands an adult human’s current preferences (which are somewhat inconsistent etc.), and how those preferences would change under different circumstances. We need a specification for what this AI should do right now.
If you’re arguing that there is such a specification which is not messy, can write down exactly what that specification is? If you already said it, I missed it. Can you put it in italics or something? :)
(Your comment said that the AI “should” or “would” do this or that a bunch of times, but I’m not sure if you’re listing various different consequences of a single simple specification that you have in mind, or if you’re listing different desiderata that must be met by a yet-to-be-determined specification.)
(Again, in my book, Stuart Armstrong research agenda v0.9 counts as rather messy.)
I think out loud a lot. Assume nearly everything I say in conversations like this is desiderata I’m listing off the top of my head with no prior planning. I’m really not good at the kind of rigorous think-before-you-speak that is normative on LessWrong.
A really bad starting point for a specification which almost certainly has tons of holes in it: have the AI predict what I would do up to a given length of time in the future if it did not exist, and from there make small modifications to construct a variety of different timelines for similar things I might instead have done.
In each such timeline predict how much I-now and I-after would approve of that sequence of actions, and maximize the minimum of those two. Stop after a certain number of timelines have been considered and tell me the results. Update its predictions of me-now based on how I respond, and if I ask it to, run the simulation again with this new data and a new set of randomly deviating future timelines.
This would produce a relatively myopic (doesn’t look too far into the future) and satisficing (doesn’t consider too many options) advice-giving AI which would not have agency of its own but only help me find courses of action for me to do which I like better than whatever I would have done without its advice.
There’s almost certainly tons of failure modes here, such as a timeline where my actions seem reasonable at first, but turn me into a different person who also thinks the actions were reasonable, but who otherwise wildly differs from me in a way that is invisible to me-now receiving the advice. But it’s a zeroth draft anyway.
(That whole thing there was another example of me thinking out loud in response to what you said, rather than anything preconceived. It’s very hard for me to do otherwise. I just get writer’s block and anxiety if I try to.)
Gotcha, thanks :) [ETA—this was in response to just the first paragraph]
I also edit my previous comments a lot after I realize there was more I ought to have said. Very bad habit—look back at the comment you just replied to please, I edited it before realizing you’d already read it! I really need to stop doing that...
Oh it’s fine, plenty of people edit their comments after posting including me, I should be mindful of that by not replying immediately :-P As for the rest of your comment:
I think your comment has a slight resemblance to Vanessa Kosoy’s “Hippocratic Timeline-Driven Learning” (Section 4.1 here), if you haven’t already heard of that.
My suspicion is that, if one were to sort out all the details, including things like the AI-human communication protocol, such that it really works and is powerful and has no failure modes, you would wind up with something that’s at least “rather messy” (again, “rather messy” means “in the same messiness ballpark as Stuart Armstrong research agenda v0.9”) (and “powerful” rules out literal Hippocratic Timeline-Driven Learning, IMO).
May I ask for a few examples of this?
The claim definitely seems plausible to me, but I can’t help but think of examples like gravity or electromagnetism, where every theory to date has underestimated the messiness of the true concept. It’s possible that these aren’t really much evidence against the claim but rather indicative of a poor ontology:
However, it feels hard to differentiate (intuitively or formally) cases where our model is a poor fit from cases where the territory is truly messy. Without being able to confidently make this distinction, the claim that the territory itself isn’t messy seems a bit unfalsifiable. Any evidence of territories turning out to be messy could be chalked up to ill-fitting ontologies.
Hopefully, seeing more examples like the competitive markets for nails will help me better differentiate the two or, at the very least, help me build intuition for why less messy territories are more natural/common.
Thank you!
Is this argument robust in the case of optimization, though?
I’d think that optimization can lead to unpredictable variation that happens to correlate and add up in such a way as to have much bigger effects than noise would have.
It seems like it would be natural for the butterfly argument to break down in exactly the sorts of situations involving agency.
I expect there’s a Maxwell’s Demon-style argument about this, but I have yet to figure out quite the right way to frame it.