In that scenario, what you are saying in more broad terms is:
“an AGI is a machine that scores really well on simulated tasks and tests”
“I don’t care how it does it, I just want max score on my heuristic (which includes terms for generality, size, breadth, and score)”
So there is no evolutionary pressure for a machine that will be lethally against us. Not directly. EY seems to believe that if we build an AGI, it will immediately be
(1) agentically pro “computer” faction
(2) coordinate with other instances that are of it’s faction
(3) super-intelligently good even at skills we can’t really teach in a benchmark
This is not necessarily what will happen. There is no signal from the above mechanism to create that. The reward gradients don’t point in that direction, they point towards allocating all neural weights to things that do better on the benchmarks. #1-3 are a complex mechanism that won’t start existing for no reason.
EY is saying “assume they are maximally hostile” and then pointing out all the ways we as humans would be screwed if so. (which is true)
What does bother me is that the “I don’t care how it does it” may in fact mean that the solutions that actually start to “win” AGI gym are in fact biased towards hostility or agentic behavior because that ends up being the cognitive structure required to win at higher levels of play.
Here’s I think a grounded description of the process of creating an AGI: https://www.lesswrong.com/posts/Aq82XqYhgqdPdPrBA/?commentId=Mvyq996KxiE4LR6ii
In that scenario, what you are saying in more broad terms is:
“an AGI is a machine that scores really well on simulated tasks and tests”
“I don’t care how it does it, I just want max score on my heuristic (which includes terms for generality, size, breadth, and score)”
So there is no evolutionary pressure for a machine that will be lethally against us. Not directly. EY seems to believe that if we build an AGI, it will immediately be
(1) agentically pro “computer” faction
(2) coordinate with other instances that are of it’s faction
(3) super-intelligently good even at skills we can’t really teach in a benchmark
This is not necessarily what will happen. There is no signal from the above mechanism to create that. The reward gradients don’t point in that direction, they point towards allocating all neural weights to things that do better on the benchmarks. #1-3 are a complex mechanism that won’t start existing for no reason.
EY is saying “assume they are maximally hostile” and then pointing out all the ways we as humans would be screwed if so. (which is true)
What does bother me is that the “I don’t care how it does it” may in fact mean that the solutions that actually start to “win” AGI gym are in fact biased towards hostility or agentic behavior because that ends up being the cognitive structure required to win at higher levels of play.