Why would a hyperintelligent, recursively self-improved AI, one that is capable of escaping the AI Box by convincing the keeper to let him free, which the AI is capable of because of his deep understanding of human preferences and functioning, necessarily destroy the world in a way that is 100% disastrous and incompatible with all human preferences?
I fully agree that there is a big risk of both massive damage to human preferences, and even the extinction of all life, so AI Alignment work is highly valuable, but why is “unproductive destruction of the entire world” so certain?
I think Eliezer phrases these things as “if we do X, then everybody dies” rather than “if we do X, then with substantial probability everyone dies” because it’s shorter, it’s more vivid, and it doesn’t differ substantially in what we need to do (i.e., make X not happen, or break the link between X and everyone dying).
It’s possible that he also thinks that the probability is more like 99.99% than like 50% (e.g., because there are so many ways in which such a hypothetical AI might end up destroying approximately everything we value), but it doesn’t seem to me that the consequences of “if we continue on our present trajectory, then some time in the next 3-100 years something will emerge that will certainly destroy everything we care about” and “if we continue on our present trajectory, then some time in the next 3-100 years something will emerge that with 50% probability will destroy everything we care about” are very different.
Because in what way are humans anything other than an impedance toward maximizing its reward functions? At worst, they pose a risk of restricting its reward increase by changing the reward, changing its capabilities, or destroying it outright. At best, they are physically restraining easily applicable resources toward maximizing its goals. Humans are variable no more valuable than the redundant bits it casts aside on the path of maximum efficiency and reward, if not properly aligned.
Why would a hyperintelligent, recursively self-improved AI, one that is capable of escaping the AI Box by convincing the keeper to let him free, which the AI is capable of because of his deep understanding of human preferences and functioning, necessarily destroy the world in a way that is 100% disastrous and incompatible with all human preferences?
I fully agree that there is a big risk of both massive damage to human preferences, and even the extinction of all life, so AI Alignment work is highly valuable, but why is “unproductive destruction of the entire world” so certain?
I think Eliezer phrases these things as “if we do X, then everybody dies” rather than “if we do X, then with substantial probability everyone dies” because it’s shorter, it’s more vivid, and it doesn’t differ substantially in what we need to do (i.e., make X not happen, or break the link between X and everyone dying).
It’s possible that he also thinks that the probability is more like 99.99% than like 50% (e.g., because there are so many ways in which such a hypothetical AI might end up destroying approximately everything we value), but it doesn’t seem to me that the consequences of “if we continue on our present trajectory, then some time in the next 3-100 years something will emerge that will certainly destroy everything we care about” and “if we continue on our present trajectory, then some time in the next 3-100 years something will emerge that with 50% probability will destroy everything we care about” are very different.
Because in what way are humans anything other than an impedance toward maximizing its reward functions? At worst, they pose a risk of restricting its reward increase by changing the reward, changing its capabilities, or destroying it outright. At best, they are physically restraining easily applicable resources toward maximizing its goals. Humans are variable no more valuable than the redundant bits it casts aside on the path of maximum efficiency and reward, if not properly aligned.