I’ll also highlight a comment of Nick Beckstead, which you’ve already seen and responded to. I didn’t understand your response.
Let me try from a different angle.
With humans, we see three broad clusters of modification: reproduction, education, and chemistry. Different people are physically constructed in different ways, and so we can see evolution of human civilization by biological evolution of the humans inside it. The environments that people find themselves in or choose leave imprints on those people. Chemicals people ingest can change those people, such as with caffeine, alcohol, morphine, or heroin. (I would include ‘changing your diet to change your thought processes’ under chemical changes, but the chemical changes from becoming addicted to heroin and from not being creatine deficient look very different.)
For AIs, most of the modification that’s interesting and new will look like the “chemistry” cluster. An AI modifying its source code will look a lot like a human injecting itself with a new drug that it just invented. (Nick_Beckstead’s example of modifying the code of the weather computer is more like education than it is like chemistry.)
This is great because some drugs dramatically improve performance, and so a person on caffeine could invent a super nootropic, and then on the super nootropic invent a cure for cancer and an even better nootropic, and so on. This is terrifying because any drug that adjusts your beliefs or your decision-making algorithm (think of ‘personality’ as a subset of this) dramatically changes how you behave, and might do so for the worse. This is doubly terrifying because these changes might be irreversible- you might take a drug that gets rid of your depression by making you incapable of feeling desire, and then not have any desire to restore yourself! This is triply terrifying because the effects of the drug might be unknown- you might not be able to determine what a drug will do to you until after you take it, and by then it might be too late.
For humans this problem is mostly solved by trial and error followed by patternmatching- “coffee is okay, crack is not, because Colin is rich and productive and Craig is neither”- which is not useful for new drugs, and not useful for misclassified old drugs, and not very safe for very powerful systems. The third problem- that the effects might be unknown- is the sort of thing that proofs might help with, except there are some technical obstacles to doing that. The Lobstacle is a prominent theoretical one, and while it looks like there are lots of practical obstacles as well surmounting the theoretical obstacles should help with surmounting the practical obstacles.
Any sort of AGI that’s able to alter its own decision-making process will have the ability to ‘do chemistry on itself,’ and one with stable values will need to have solved the problem of how to do that while preserving its values. (I don’t think that humans have ‘stable’ values; I’d call them something more like ‘semi-stable.’ Whether or not this is a bug or feature is unclear to me.)
I understand where you’re coming from, and I think that you correctly highlight a potential source of concern, and one which my comment didn’t adequately account for. However:
I’m skeptical that it’s possible to create an AI based on mathematical logic at all. Even if an AI with many interacting submodules is dangerous, it doesn’t follow that working on AI safety for an AI based on mathematical logic is promising.
Humans can impose selective pressures on emergent AI’s so as to mimic the process of natural selection that humans experienced.
I’m skeptical that it’s possible to create an AI based on mathematical logic at all. Even if an AI with many interacting submodules is dangerous, it doesn’t follow that working on AI safety for an AI based on mathematical logic is promising.
Eliezer’s position is that the default mode for an AGI is failure; i.e. if an AGI is not provably safe, it will almost certainly go badly wrong. In that contest, if you accept that “an AI with many interacting submodules is dangerous,” that that’s more or less equivalent to believing that one of the horribly wrong outcomes will almost certainly be achieved if an AGI with many submodules is created.
Humans can impose selective pressures on emergent AI’s so as to mimic the process of natural selection that humans experienced.
Humans are not Friendly. They don’t even have the capability under discussion here, to preserve their values under self-modification; a human-esque singleton would likely be a horrible, horrible disaster.
Let me try from a different angle.
With humans, we see three broad clusters of modification: reproduction, education, and chemistry. Different people are physically constructed in different ways, and so we can see evolution of human civilization by biological evolution of the humans inside it. The environments that people find themselves in or choose leave imprints on those people. Chemicals people ingest can change those people, such as with caffeine, alcohol, morphine, or heroin. (I would include ‘changing your diet to change your thought processes’ under chemical changes, but the chemical changes from becoming addicted to heroin and from not being creatine deficient look very different.)
For AIs, most of the modification that’s interesting and new will look like the “chemistry” cluster. An AI modifying its source code will look a lot like a human injecting itself with a new drug that it just invented. (Nick_Beckstead’s example of modifying the code of the weather computer is more like education than it is like chemistry.)
This is great because some drugs dramatically improve performance, and so a person on caffeine could invent a super nootropic, and then on the super nootropic invent a cure for cancer and an even better nootropic, and so on. This is terrifying because any drug that adjusts your beliefs or your decision-making algorithm (think of ‘personality’ as a subset of this) dramatically changes how you behave, and might do so for the worse. This is doubly terrifying because these changes might be irreversible- you might take a drug that gets rid of your depression by making you incapable of feeling desire, and then not have any desire to restore yourself! This is triply terrifying because the effects of the drug might be unknown- you might not be able to determine what a drug will do to you until after you take it, and by then it might be too late.
For humans this problem is mostly solved by trial and error followed by patternmatching- “coffee is okay, crack is not, because Colin is rich and productive and Craig is neither”- which is not useful for new drugs, and not useful for misclassified old drugs, and not very safe for very powerful systems. The third problem- that the effects might be unknown- is the sort of thing that proofs might help with, except there are some technical obstacles to doing that. The Lobstacle is a prominent theoretical one, and while it looks like there are lots of practical obstacles as well surmounting the theoretical obstacles should help with surmounting the practical obstacles.
Any sort of AGI that’s able to alter its own decision-making process will have the ability to ‘do chemistry on itself,’ and one with stable values will need to have solved the problem of how to do that while preserving its values. (I don’t think that humans have ‘stable’ values; I’d call them something more like ‘semi-stable.’ Whether or not this is a bug or feature is unclear to me.)
I understand where you’re coming from, and I think that you correctly highlight a potential source of concern, and one which my comment didn’t adequately account for. However:
I’m skeptical that it’s possible to create an AI based on mathematical logic at all. Even if an AI with many interacting submodules is dangerous, it doesn’t follow that working on AI safety for an AI based on mathematical logic is promising.
Humans can impose selective pressures on emergent AI’s so as to mimic the process of natural selection that humans experienced.
Eliezer’s position is that the default mode for an AGI is failure; i.e. if an AGI is not provably safe, it will almost certainly go badly wrong. In that contest, if you accept that “an AI with many interacting submodules is dangerous,” that that’s more or less equivalent to believing that one of the horribly wrong outcomes will almost certainly be achieved if an AGI with many submodules is created.
Humans are not Friendly. They don’t even have the capability under discussion here, to preserve their values under self-modification; a human-esque singleton would likely be a horrible, horrible disaster.