they would then only need a slight preponderance of virtue over vice
This assumes that morality has only one axis, which I find highly unlikely. I would expect the seed to quickly radicalize, becoming good in ways that the seed likes, and becoming evil in ways that the seed likes. Under this model, if given a random axis, the seed comes up good 51% of the time, I would expect the aligned AI to remain 51% good.
Assuming the axes do interact, if they do so inconveniently, for instance if we posit that evil has higher evolutionary fitness, or that self destruction becomes trivially easy at high levels, an error along any one axis could break the entire system.
Also, if I do grant this, then I would expect the reverse to also be true.
One might therefore wish to only share code for the ethical part of the AI
This assumes you can discern the ethical part and that the ethical part is separate from the intelligent part.
Even given that, I still expect massive resources to be diverted from morality towards intelligence, A: because people want power and people with secrets are stronger than those without and B: because people don’t trust black boxes, and will want to know what’s inside before it kills them.
Thence, civilization would just reinvent the same secrets over and over again, and then the time limit runs out.
they would then only need a slight preponderance of virtue over vice
This assumes that morality has only one axis, which I find highly unlikely.
This is a good and important point. A more realistic discussion of aligning with an idealized human agent might consider personality traits, cognitive abilities, and intrinsic values as among the properties of the individual agent that are worth optimizing, and that’s clearly a multidimensional situation in which the changes can interact, even in confounding ways.
So perhaps I can make my point more neutrally as follows. There is both variety and uniformity among human beings, regarding properties like personality, cognition, and values. A process like alignment, in which the corresponding properties of an AI are determined by the properties of the human being(s) with which it is aligned, might increase this variety in some ways, or decrease it in others. Then, among the possible outcomes, only certain ones are satisfactory, e.g. an AI that will be safe for humanity even if it becomes all-powerful.
The question is, how selective must one be, in choosing who to align the AI with. In his original discussions of this topic, back in the 2000s, Eliezer argued that this is not an important issue, compared to identifying an alignment process that works at all. He gave as a contemporary example, Al Qaeda terrorists: with a good enough alignment process, you could start with them as the human prototype, and still get a friendly AI, because they have all the basic human traits, and for a good enough alignment process, that should be enough to reach a satisfactory outcome. On the other hand, with a bad alignment process, you could start with the best people we have, and still get an unfriendly AI.
One might therefore wish to only share code for the ethical part of the AI
This assumes you can discern the ethical part and that the ethical part is separate from the intelligent part.
Well, again we face the fact that different software architectures and development methodologies will lead to different situations. Earlier, it was that some alignment methodologies will be more sensitive to initial conditions than others. Here it’s the separability of intelligence and ethics, or problem-solving ability and problems that are selected to be solved. There are definitely some AI designs where the latter can be cleanly separated, such as an expected-utility maximizer with arbitrary utility function. On the other hand, it looks very hard to pull apart these two things in a language model like GPT-3.
My notion was that the risk of sharing code is greatest for algorithms that are powerful general problem solvers which have no internal inhibitions regarding the problems that they solve; and that the code most worth sharing, is “ethical code” that protects from bad runaway outcomes by acting as an ethical filter.
But even more than that, I would emphasize that the most important thing is just to solve the problem of alignment in the most important case, namely autonomous superhuman AI. So long as that isn’t figured out, we’re gambling on our future in a big way.
This assumes that morality has only one axis, which I find highly unlikely. I would expect the seed to quickly radicalize, becoming good in ways that the seed likes, and becoming evil in ways that the seed likes. Under this model, if given a random axis, the seed comes up good 51% of the time, I would expect the aligned AI to remain 51% good.
Assuming the axes do interact, if they do so inconveniently, for instance if we posit that evil has higher evolutionary fitness, or that self destruction becomes trivially easy at high levels, an error along any one axis could break the entire system.
Also, if I do grant this, then I would expect the reverse to also be true.
This assumes you can discern the ethical part and that the ethical part is separate from the intelligent part.
Even given that, I still expect massive resources to be diverted from morality towards intelligence, A: because people want power and people with secrets are stronger than those without and B: because people don’t trust black boxes, and will want to know what’s inside before it kills them.
Thence, civilization would just reinvent the same secrets over and over again, and then the time limit runs out.
This is a good and important point. A more realistic discussion of aligning with an idealized human agent might consider personality traits, cognitive abilities, and intrinsic values as among the properties of the individual agent that are worth optimizing, and that’s clearly a multidimensional situation in which the changes can interact, even in confounding ways.
So perhaps I can make my point more neutrally as follows. There is both variety and uniformity among human beings, regarding properties like personality, cognition, and values. A process like alignment, in which the corresponding properties of an AI are determined by the properties of the human being(s) with which it is aligned, might increase this variety in some ways, or decrease it in others. Then, among the possible outcomes, only certain ones are satisfactory, e.g. an AI that will be safe for humanity even if it becomes all-powerful.
The question is, how selective must one be, in choosing who to align the AI with. In his original discussions of this topic, back in the 2000s, Eliezer argued that this is not an important issue, compared to identifying an alignment process that works at all. He gave as a contemporary example, Al Qaeda terrorists: with a good enough alignment process, you could start with them as the human prototype, and still get a friendly AI, because they have all the basic human traits, and for a good enough alignment process, that should be enough to reach a satisfactory outcome. On the other hand, with a bad alignment process, you could start with the best people we have, and still get an unfriendly AI.
Well, again we face the fact that different software architectures and development methodologies will lead to different situations. Earlier, it was that some alignment methodologies will be more sensitive to initial conditions than others. Here it’s the separability of intelligence and ethics, or problem-solving ability and problems that are selected to be solved. There are definitely some AI designs where the latter can be cleanly separated, such as an expected-utility maximizer with arbitrary utility function. On the other hand, it looks very hard to pull apart these two things in a language model like GPT-3.
My notion was that the risk of sharing code is greatest for algorithms that are powerful general problem solvers which have no internal inhibitions regarding the problems that they solve; and that the code most worth sharing, is “ethical code” that protects from bad runaway outcomes by acting as an ethical filter.
But even more than that, I would emphasize that the most important thing is just to solve the problem of alignment in the most important case, namely autonomous superhuman AI. So long as that isn’t figured out, we’re gambling on our future in a big way.