CLAIM A: “IF there’s a learning algorithm whose reward function is X, THEN the trained models that it creates will not necessarily explicitly desire X.”
This is obviously true, and every animal including humans serves as an example. For most animals, it’s trivially true, because most animals doesn’t even know what inclusive genetic fitness is, so obviously they don’t explicitly desire it.
So here’s a stronger claim:
CLAIM B: “CLAIM A is true even if the trained model is sophisticated enough to fully understand what X is, and to fully understand that it was itself created by this learning algorithm.”
This one is true too, and I think humans are the only example we have. I mean, the claim is really obvious if you know how algorithms work etc., but of course some people question it anyway, so it can be nice to have a concrete illustration.
Neither of those claims has anything to do with humans being the “winners” of evolution. I don’t think there’s any real alignment-related claim that does. Although, people say all kinds of things, I suppose. So anyway, if there’s really something substantive that this post is responding to, I suggest you try to dig it out.
Neither of those claims has anything to do with humans being the “winners” of evolution. I don’t think there’s any real alignment-related claim that does. Although, people say all kinds of things, I suppose. So anyway, if there’s really something substantive that this post is responding to, I suggest you try to dig it out.
The evolution analogy of how evolution failed to birth intelligent minds that valued what evolution valued is an intuition pump that does get used in explaining outer/inner alignment failures, and is part of why in some corners there’s a general backdrop of outer/ inner alignment being so hard.
It’s also used in the sharp left turn, where the capabilities of an optimization process like humans outstripped their alignment to evolutionary objectives, and the worry is that an AI could do the same to us, and evolutionary analogies do get used here.
Both Eliezer Yudkowsky and Nate Soares use arguments that rely on evolution failing to get a selection target inside us, thus misaligning us to evolution:
The OP talks about the fact that evolution produced lots of organisms on Earth, of which humans are just one example, and that if we view the set of all life, arguably more of it consists of bacteria or trees than humans. Then this comment thread has been about the question: so what? Why bring that up? Who cares?
Like, here’s where I think we’re at in the discussion:
Nate or Eliezer: “Evolution made humans, and humans don’t care about inclusive genetic fitness.”
tailcalled: “Ah, but did you know that evolution also made bacteria and trees?”
Nate or Eliezer: “…Huh? What does that have to do with anything?”
If you think that the existence on Earth of lots of bacteria and trees is a point that specifically undermines something that Nate or Eliezer said, then can you explain the details?
Meta-level comment: I don’t think it’s good to dismiss original arguments immediately and completely.
Object-level comment:
Neither of those claims has anything to do with humans being the “winners” of evolution.
I think it might be more complicated than that:
We need to define what “a model produced by a reward function” means, otherwise the claims are meaningless. Like, if you made just a single update to the model (based on the reward function), calling it “a model produced by the reward function” is meaningless (’cause no real optimization pressure was applied). So we do need to define some goal of optimization (which determines who’s a winner and who’s a loser).
We need to argue that the goal is sensible. I.e. somewhat similar to a goal we might use while training our AIs.
Here’s some things we can try:
We can try defining all currently living species as winners. But is it sensible? Is it similar to a goal we would use while training our AIs? “Let’s optimize our models for N timesteps and then use all surviving models regardless of any other metrics” ← I think that’s not sensible, especially if you use an algorithm which can introduce random mutations into the model.
We can try defining species which avoided substantial changes for the longest time as winners. This seems somewhat sensible, because those species experienced the longest optimization pressure. But then humans are not the winners.
We can define any species which gained general intelligence as winners. Then humans are the only winners. This is sensible because of two reasons. First, with general intelligence deceptive alignment is possible: if humans knew that Simulation Gods optimize organisms for some goal, humans could focus on that goal or kill all competing organisms. Second, many humans (in our reality) value creating AGI more than solving any particular problem.
I think the later is the strongest counter-argument to “humans are not the winners”.
Here’s a sensible claim:
CLAIM A: “IF there’s a learning algorithm whose reward function is X, THEN the trained models that it creates will not necessarily explicitly desire X.”
This is obviously true, and every animal including humans serves as an example. For most animals, it’s trivially true, because most animals doesn’t even know what inclusive genetic fitness is, so obviously they don’t explicitly desire it.
So here’s a stronger claim:
CLAIM B: “CLAIM A is true even if the trained model is sophisticated enough to fully understand what X is, and to fully understand that it was itself created by this learning algorithm.”
This one is true too, and I think humans are the only example we have. I mean, the claim is really obvious if you know how algorithms work etc., but of course some people question it anyway, so it can be nice to have a concrete illustration.
(More discussion here.)
Neither of those claims has anything to do with humans being the “winners” of evolution. I don’t think there’s any real alignment-related claim that does. Although, people say all kinds of things, I suppose. So anyway, if there’s really something substantive that this post is responding to, I suggest you try to dig it out.
The evolution analogy of how evolution failed to birth intelligent minds that valued what evolution valued is an intuition pump that does get used in explaining outer/inner alignment failures, and is part of why in some corners there’s a general backdrop of outer/ inner alignment being so hard.
It’s also used in the sharp left turn, where the capabilities of an optimization process like humans outstripped their alignment to evolutionary objectives, and the worry is that an AI could do the same to us, and evolutionary analogies do get used here.
Both Eliezer Yudkowsky and Nate Soares use arguments that rely on evolution failing to get a selection target inside us, thus misaligning us to evolution:
https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#Yudkowsky_argues_against_AIs_being_steerable_by_gradient_descent_
https://www.lesswrong.com/posts/GNhMPAWcfBCASy8e6/a-central-ai-alignment-problem-capabilities-generalization
The OP talks about the fact that evolution produced lots of organisms on Earth, of which humans are just one example, and that if we view the set of all life, arguably more of it consists of bacteria or trees than humans. Then this comment thread has been about the question: so what? Why bring that up? Who cares?
Like, here’s where I think we’re at in the discussion:
Nate or Eliezer: “Evolution made humans, and humans don’t care about inclusive genetic fitness.”
tailcalled: “Ah, but did you know that evolution also made bacteria and trees?”
Nate or Eliezer: “…Huh? What does that have to do with anything?”
If you think that the existence on Earth of lots of bacteria and trees is a point that specifically undermines something that Nate or Eliezer said, then can you explain the details?
Oh, I was responding to something different, my apologies.
Meta-level comment: I don’t think it’s good to dismiss original arguments immediately and completely.
Object-level comment:
I think it might be more complicated than that:
We need to define what “a model produced by a reward function” means, otherwise the claims are meaningless. Like, if you made just a single update to the model (based on the reward function), calling it “a model produced by the reward function” is meaningless (’cause no real optimization pressure was applied). So we do need to define some goal of optimization (which determines who’s a winner and who’s a loser).
We need to argue that the goal is sensible. I.e. somewhat similar to a goal we might use while training our AIs.
Here’s some things we can try:
We can try defining all currently living species as winners. But is it sensible? Is it similar to a goal we would use while training our AIs? “Let’s optimize our models for N timesteps and then use all surviving models regardless of any other metrics” ← I think that’s not sensible, especially if you use an algorithm which can introduce random mutations into the model.
We can try defining species which avoided substantial changes for the longest time as winners. This seems somewhat sensible, because those species experienced the longest optimization pressure. But then humans are not the winners.
We can define any species which gained general intelligence as winners. Then humans are the only winners. This is sensible because of two reasons. First, with general intelligence deceptive alignment is possible: if humans knew that Simulation Gods optimize organisms for some goal, humans could focus on that goal or kill all competing organisms. Second, many humans (in our reality) value creating AGI more than solving any particular problem.
I think the later is the strongest counter-argument to “humans are not the winners”.