Some people say that e.g. inner alignment failed for evolution in creating humans. In order for that claim of historical alignment difficulty to cash out, it feels like humans need to be “winners” of evolution in some sense, as otherwise species that don’t achieve as full agency as humans do seem like a plausibly more relevant comparison to look at. This is kind of a partial post, playing with the idea but not really deciding anything definitive.
CLAIM A: “IF there’s a learning algorithm whose reward function is X, THEN the trained models that it creates will not necessarily explicitly desire X.”
This is obviously true, and every animal including humans serves as an example. For most animals, it’s trivially true, because most animals doesn’t even know what inclusive genetic fitness is, so obviously they don’t explicitly desire it.
So here’s a stronger claim:
CLAIM B: “CLAIM A is true even if the trained model is sophisticated enough to fully understand what X is, and to fully understand that it was itself created by this learning algorithm.”
This one is true too, and I think humans are the only example we have. I mean, the claim is really obvious if you know how algorithms work etc., but of course some people question it anyway, so it can be nice to have a concrete illustration.
Neither of those claims has anything to do with humans being the “winners” of evolution. I don’t think there’s any real alignment-related claim that does. Although, people say all kinds of things, I suppose. So anyway, if there’s really something substantive that this post is responding to, I suggest you try to dig it out.
Neither of those claims has anything to do with humans being the “winners” of evolution. I don’t think there’s any real alignment-related claim that does. Although, people say all kinds of things, I suppose. So anyway, if there’s really something substantive that this post is responding to, I suggest you try to dig it out.
The evolution analogy of how evolution failed to birth intelligent minds that valued what evolution valued is an intuition pump that does get used in explaining outer/inner alignment failures, and is part of why in some corners there’s a general backdrop of outer/ inner alignment being so hard.
It’s also used in the sharp left turn, where the capabilities of an optimization process like humans outstripped their alignment to evolutionary objectives, and the worry is that an AI could do the same to us, and evolutionary analogies do get used here.
Both Eliezer Yudkowsky and Nate Soares use arguments that rely on evolution failing to get a selection target inside us, thus misaligning us to evolution:
The OP talks about the fact that evolution produced lots of organisms on Earth, of which humans are just one example, and that if we view the set of all life, arguably more of it consists of bacteria or trees than humans. Then this comment thread has been about the question: so what? Why bring that up? Who cares?
Like, here’s where I think we’re at in the discussion:
Nate or Eliezer: “Evolution made humans, and humans don’t care about inclusive genetic fitness.”
tailcalled: “Ah, but did you know that evolution also made bacteria and trees?”
Nate or Eliezer: “…Huh? What does that have to do with anything?”
If you think that the existence on Earth of lots of bacteria and trees is a point that specifically undermines something that Nate or Eliezer said, then can you explain the details?
I wouldn’t go this far yet. E.g. I’ve been playing with the idea that the weighting where humans “win” evolution is something like adversarial robustness. This just wasn’t really a convincing enough weighting to be included in the OP. But if something like that turns out correct then one could imagine that e.g. humans get outcompeted by something that’s even more adversarially robust. Which is basically the standard alignment problem.
Like I did not in fact interject in response to Nate or Eliezer. Someone asked me what triggered my line of thought, and I explained that it came from their argument, but I also said that my point was currently too incomplete.
Meta-level comment: I don’t think it’s good to dismiss original arguments immediately and completely.
Object-level comment:
Neither of those claims has anything to do with humans being the “winners” of evolution.
I think it might be more complicated than that:
We need to define what “a model produced by a reward function” means, otherwise the claims are meaningless. Like, if you made just a single update to the model (based on the reward function), calling it “a model produced by the reward function” is meaningless (’cause no real optimization pressure was applied). So we do need to define some goal of optimization (which determines who’s a winner and who’s a loser).
We need to argue that the goal is sensible. I.e. somewhat similar to a goal we might use while training our AIs.
Here’s some things we can try:
We can try defining all currently living species as winners. But is it sensible? Is it similar to a goal we would use while training our AIs? “Let’s optimize our models for N timesteps and then use all surviving models regardless of any other metrics” ← I think that’s not sensible, especially if you use an algorithm which can introduce random mutations into the model.
We can try defining species which avoided substantial changes for the longest time as winners. This seems somewhat sensible, because those species experienced the longest optimization pressure. But then humans are not the winners.
We can define any species which gained general intelligence as winners. Then humans are the only winners. This is sensible because of two reasons. First, with general intelligence deceptive alignment is possible: if humans knew that Simulation Gods optimize organisms for some goal, humans could focus on that goal or kill all competing organisms. Second, many humans (in our reality) value creating AGI more than solving any particular problem.
I think the later is the strongest counter-argument to “humans are not the winners”.
Right, I think there are variants of it that might work out, but there’s also the aspect where some people argue that AGI will turn out to essentially be a bag-of-heuristics or similar, where inner alignment becomes less necessary because the heuristics achieve the outer goal even if they don’t do it as flexibly as they could.
Richard Kennaway asked why I would think in those lines but the point of the OP isn’t to make an argument about AI alignment, it’s merely to think in those lines. Conclusions can come later once I’m finished exploring it.
Some people say that e.g. inner alignment failed for evolution in creating humans. In order for that claim of historical alignment difficulty to cash out, it feels like humans need to be “winners” of evolution in some sense, as otherwise species that don’t achieve as full agency as humans do seem like a plausibly more relevant comparison to look at. This is kind of a partial post, playing with the idea but not really deciding anything definitive.
Here’s a sensible claim:
CLAIM A: “IF there’s a learning algorithm whose reward function is X, THEN the trained models that it creates will not necessarily explicitly desire X.”
This is obviously true, and every animal including humans serves as an example. For most animals, it’s trivially true, because most animals doesn’t even know what inclusive genetic fitness is, so obviously they don’t explicitly desire it.
So here’s a stronger claim:
CLAIM B: “CLAIM A is true even if the trained model is sophisticated enough to fully understand what X is, and to fully understand that it was itself created by this learning algorithm.”
This one is true too, and I think humans are the only example we have. I mean, the claim is really obvious if you know how algorithms work etc., but of course some people question it anyway, so it can be nice to have a concrete illustration.
(More discussion here.)
Neither of those claims has anything to do with humans being the “winners” of evolution. I don’t think there’s any real alignment-related claim that does. Although, people say all kinds of things, I suppose. So anyway, if there’s really something substantive that this post is responding to, I suggest you try to dig it out.
The evolution analogy of how evolution failed to birth intelligent minds that valued what evolution valued is an intuition pump that does get used in explaining outer/inner alignment failures, and is part of why in some corners there’s a general backdrop of outer/ inner alignment being so hard.
It’s also used in the sharp left turn, where the capabilities of an optimization process like humans outstripped their alignment to evolutionary objectives, and the worry is that an AI could do the same to us, and evolutionary analogies do get used here.
Both Eliezer Yudkowsky and Nate Soares use arguments that rely on evolution failing to get a selection target inside us, thus misaligning us to evolution:
https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#Yudkowsky_argues_against_AIs_being_steerable_by_gradient_descent_
https://www.lesswrong.com/posts/GNhMPAWcfBCASy8e6/a-central-ai-alignment-problem-capabilities-generalization
The OP talks about the fact that evolution produced lots of organisms on Earth, of which humans are just one example, and that if we view the set of all life, arguably more of it consists of bacteria or trees than humans. Then this comment thread has been about the question: so what? Why bring that up? Who cares?
Like, here’s where I think we’re at in the discussion:
Nate or Eliezer: “Evolution made humans, and humans don’t care about inclusive genetic fitness.”
tailcalled: “Ah, but did you know that evolution also made bacteria and trees?”
Nate or Eliezer: “…Huh? What does that have to do with anything?”
If you think that the existence on Earth of lots of bacteria and trees is a point that specifically undermines something that Nate or Eliezer said, then can you explain the details?
Oh, I was responding to something different, my apologies.
I wouldn’t go this far yet. E.g. I’ve been playing with the idea that the weighting where humans “win” evolution is something like adversarial robustness. This just wasn’t really a convincing enough weighting to be included in the OP. But if something like that turns out correct then one could imagine that e.g. humans get outcompeted by something that’s even more adversarially robust. Which is basically the standard alignment problem.
Like I did not in fact interject in response to Nate or Eliezer. Someone asked me what triggered my line of thought, and I explained that it came from their argument, but I also said that my point was currently too incomplete.
Meta-level comment: I don’t think it’s good to dismiss original arguments immediately and completely.
Object-level comment:
I think it might be more complicated than that:
We need to define what “a model produced by a reward function” means, otherwise the claims are meaningless. Like, if you made just a single update to the model (based on the reward function), calling it “a model produced by the reward function” is meaningless (’cause no real optimization pressure was applied). So we do need to define some goal of optimization (which determines who’s a winner and who’s a loser).
We need to argue that the goal is sensible. I.e. somewhat similar to a goal we might use while training our AIs.
Here’s some things we can try:
We can try defining all currently living species as winners. But is it sensible? Is it similar to a goal we would use while training our AIs? “Let’s optimize our models for N timesteps and then use all surviving models regardless of any other metrics” ← I think that’s not sensible, especially if you use an algorithm which can introduce random mutations into the model.
We can try defining species which avoided substantial changes for the longest time as winners. This seems somewhat sensible, because those species experienced the longest optimization pressure. But then humans are not the winners.
We can define any species which gained general intelligence as winners. Then humans are the only winners. This is sensible because of two reasons. First, with general intelligence deceptive alignment is possible: if humans knew that Simulation Gods optimize organisms for some goal, humans could focus on that goal or kill all competing organisms. Second, many humans (in our reality) value creating AGI more than solving any particular problem.
I think the later is the strongest counter-argument to “humans are not the winners”.
Right, I think there are variants of it that might work out, but there’s also the aspect where some people argue that AGI will turn out to essentially be a bag-of-heuristics or similar, where inner alignment becomes less necessary because the heuristics achieve the outer goal even if they don’t do it as flexibly as they could.
Richard Kennaway asked why I would think in those lines but the point of the OP isn’t to make an argument about AI alignment, it’s merely to think in those lines. Conclusions can come later once I’m finished exploring it.