(Note: I’m also a layman, so my non-expert opinions necessarily come with a large salt side-dish)
My guess here is that most of the “AI Drives” to self-improve, be rational, retaining it’s goal structure, etc. are considered necessary for a functional learning/self-improving algorithm. If the program cannot recognize and make rules for new patterns observed in data, make sound inferences based on known information or keep after it’s objective it will not be much of an AGI at all; it will not even be able to function as well as a modern targeted advertising program.
The rest, such as self-preservation, are justified as being logical requirements of the task. Rather than having self-preservation as a terminal value, the paperclip maximizer will value it’s own existence as an optimal means of proliferating paperclips. It makes intuitive sense that those sorts of ‘drives’ would emerge from most-any goal, but then again my intuition is not necessarily very useful for these sorts of questions.
This point might also be a source of confusion;
The progress of the capability of artificial intelligence is not only related to whether humans have evolved for a certain skill or to how much computational resources it requires but also to how difficult it is to formalize the skill, its rules and what it means to succeed.
In the light of this, how difficult would it be to program the drives that you imagine, versus just making an AI win against humans at a given activity without exhibiting these drives?
As Dr Valiant (great name or the greatest name?) classifies things in Probably Approximately Correct, Winning Chess would be a ‘theoryful’ task while Discovering (Interesting) Mathematical Proofs would be a ‘theoryless’ one. In essence, the theoryful has simple and well established rules for the process which could be programmed optimally in advance with little-to-no modification needed afterwards while the theoryless is complex and messy enough that an imperfect (Probably Approximately Correct) learning process would have to be employed to suss out all the rules.
Now obviously the program will benefit from labeling in it’s training data for what is and is not an “interesting” mathematical proof, otherwise it can just screw around with computationally-cheap arithmetic proofs (1 + 1 = 2, 1.1 + 1 = 2.1, 1.2 + 1 = 2.2, etc.) until the heat death of the universe. Less obviously, as the hidden tank example shows, insufficient labeling or bad labels will lead to other unintended results.
So applying that back to Friendliness; despite attempts to construct a Fun Theory, human value is currently (and may well forever remain) theoryless. A learning process whose goal is to maximize human value is going to have to be both well constructed and have very good labels initially to not be Unfriendly. Of course, it could very well correct itself later on, that is in fact at the core of a PAC algorithm, but then we get into questions of FOOM-ing and labels of human value in the environment which I am not equipped to deal with.
(Note: I’m also a layman, so my non-expert opinions necessarily come with a large salt side-dish)
My guess here is that most of the “AI Drives” to self-improve, be rational, retaining it’s goal structure, etc. are considered necessary for a functional learning/self-improving algorithm. If the program cannot recognize and make rules for new patterns observed in data, make sound inferences based on known information or keep after it’s objective it will not be much of an AGI at all; it will not even be able to function as well as a modern targeted advertising program.
The rest, such as self-preservation, are justified as being logical requirements of the task. Rather than having self-preservation as a terminal value, the paperclip maximizer will value it’s own existence as an optimal means of proliferating paperclips. It makes intuitive sense that those sorts of ‘drives’ would emerge from most-any goal, but then again my intuition is not necessarily very useful for these sorts of questions.
This point might also be a source of confusion;
As Dr Valiant (great name or the greatest name?) classifies things in Probably Approximately Correct, Winning Chess would be a ‘theoryful’ task while Discovering (Interesting) Mathematical Proofs would be a ‘theoryless’ one. In essence, the theoryful has simple and well established rules for the process which could be programmed optimally in advance with little-to-no modification needed afterwards while the theoryless is complex and messy enough that an imperfect (Probably Approximately Correct) learning process would have to be employed to suss out all the rules.
Now obviously the program will benefit from labeling in it’s training data for what is and is not an “interesting” mathematical proof, otherwise it can just screw around with computationally-cheap arithmetic proofs (1 + 1 = 2, 1.1 + 1 = 2.1, 1.2 + 1 = 2.2, etc.) until the heat death of the universe. Less obviously, as the hidden tank example shows, insufficient labeling or bad labels will lead to other unintended results.
So applying that back to Friendliness; despite attempts to construct a Fun Theory, human value is currently (and may well forever remain) theoryless. A learning process whose goal is to maximize human value is going to have to be both well constructed and have very good labels initially to not be Unfriendly. Of course, it could very well correct itself later on, that is in fact at the core of a PAC algorithm, but then we get into questions of FOOM-ing and labels of human value in the environment which I am not equipped to deal with.