Firstly, the AI is only set up like this in it’s training environment, where it sets these parameters. If you set it loose on the world, it may encounter a different environment, where the optimal way of doing G is different, but it will still retain the initial strategy. For example an apple picker trained on red apples may have developed the strategy “pick red things”, which maximises G in the training environment, but makes it totally useless in a green apple field.
You seem to be ruling out a priori the possibility that an AI might have any desires whatsoever concerning the future state of the world, and to take foresighted actions that are expected to satisfy those preference. You seem to be treating the AI’s structure as a model-free “situation → action” mapping. Unless I’m reading too much into your one example.
Deletion valley
This section likewise seems to be ruling out a priori the possibility that an AI might brainstorm a plan to kill 1 person, decide that it’s a bad idea on net, and then not do it, and then later brainstorm a plan to kill 8 billion people, decide it’s a good idea on net, and then do it. Right? Humans didn’t go 10% of the way to the moon, then were rewarded, then went 20%, etc. They decided that they wanted to go 100% of the way to the moon, and then figured out how to do it, and then did it.
I think this is related to the above. I think you’re imagining an AI with inclinations to do certain things in certain situations. But it’s also possible for there to be an AI with desires for the future to be a certain way, and an inclination to take actions that seem good in light of those desires, and a general ability to understand the world well enough to anticipate the consequences of possible courses of action without actually doing them. With those properties, there isn’t necessarily a deletion valley. Humans have all those properties.
A more likely scenario for catastrophe comes from a key assumption I made in part 1: that bad behavior is punished, and not rewarded.
Yet again, you’re constantly focused on behavior, as if there is nothing else in the story. But what about bad thoughts? It’s perfectly possible (indeed, likely) for bad thoughts to be rewarded, a.k.a. doing the right thing for the wrong reason, because interpretability is hard. If the AI chooses not to lie and murder because it anticipates getting caught if it does, then the AI is displaying good behavior arising from a bad motivation. If we then reward the AI (as we presumably would), then why should we expect that to change? And then eventually the AI finds itself in a situation where it can lie and murder without getting caught, and we should expect it to do so. See Ajeya’s article Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover for example.
You seem to be ruling out a priori the possibility that an AI might have any desires whatsoever concerning the future state of the world, and to take foresighted actions that are expected to satisfy those preference. You seem to be treating the AI’s structure as a model-free “situation → action” mapping. Unless I’m reading too much into your one example.
This section likewise seems to be ruling out a priori the possibility that an AI might brainstorm a plan to kill 1 person, decide that it’s a bad idea on net, and then not do it, and then later brainstorm a plan to kill 8 billion people, decide it’s a good idea on net, and then do it. Right? Humans didn’t go 10% of the way to the moon, then were rewarded, then went 20%, etc. They decided that they wanted to go 100% of the way to the moon, and then figured out how to do it, and then did it.
I think this is related to the above. I think you’re imagining an AI with inclinations to do certain things in certain situations. But it’s also possible for there to be an AI with desires for the future to be a certain way, and an inclination to take actions that seem good in light of those desires, and a general ability to understand the world well enough to anticipate the consequences of possible courses of action without actually doing them. With those properties, there isn’t necessarily a deletion valley. Humans have all those properties.
Yet again, you’re constantly focused on behavior, as if there is nothing else in the story. But what about bad thoughts? It’s perfectly possible (indeed, likely) for bad thoughts to be rewarded, a.k.a. doing the right thing for the wrong reason, because interpretability is hard. If the AI chooses not to lie and murder because it anticipates getting caught if it does, then the AI is displaying good behavior arising from a bad motivation. If we then reward the AI (as we presumably would), then why should we expect that to change? And then eventually the AI finds itself in a situation where it can lie and murder without getting caught, and we should expect it to do so. See Ajeya’s article Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover for example.