It seems very plausible to me that alignment targets in practice will evolve out of things like the OpenAI Model Spec. If anyone has suggestions for how to improve that, please DM me.
I interpret your comment as a prediction regarding where new alignment target proposals will come from. Is this correct?
I also have a couple of questions about the linked text:
How do you define the difference between explaining something and trying to change someone’s mind? Consider the case where Bob is asking a factual question. An objectively correct straightforward answer would radically change Bob’s entire system of morality, in ways that the AI can predict. A slightly obfuscated answer would result in far less dramatic changes. But those changes would be in a completely different direction (compared to the straightforward answer). Refusing to answer, while being honest about the reason for refusal, would send Bob into a tailspin. How certain are you that you can find a definition of Acceptable Forms of Explanation that holds up in a large number of messy situations along these lines? See also this.
And if you cannot define such things in a solid way, how do you plan to define ``benefit humanity″? PCEV was an effort to define ``benefit humanity″. And PCEV has been found to suffer from at least one difficult-to-notice problem. How certain are you that you can find a definition of ``benefit humanity″ that does not suffer from some difficult-to-notice problem?
PS:
Speculation regarding where novel alignment target proposals are likely to come from are very welcome. It is a prediction of things that will probably be fairly observable fairly soon. And it is directly relevant to my work. So I am always happy to hear this type of speculation.
It seems very plausible to me that alignment targets in practice will evolve out of things like the OpenAI Model Spec. If anyone has suggestions for how to improve that, please DM me.
I interpret your comment as a prediction regarding where new alignment target proposals will come from. Is this correct?
I also have a couple of questions about the linked text:
How do you define the difference between explaining something and trying to change someone’s mind? Consider the case where Bob is asking a factual question. An objectively correct straightforward answer would radically change Bob’s entire system of morality, in ways that the AI can predict. A slightly obfuscated answer would result in far less dramatic changes. But those changes would be in a completely different direction (compared to the straightforward answer). Refusing to answer, while being honest about the reason for refusal, would send Bob into a tailspin. How certain are you that you can find a definition of Acceptable Forms of Explanation that holds up in a large number of messy situations along these lines? See also this.
And if you cannot define such things in a solid way, how do you plan to define ``benefit humanity″? PCEV was an effort to define ``benefit humanity″. And PCEV has been found to suffer from at least one difficult-to-notice problem. How certain are you that you can find a definition of ``benefit humanity″ that does not suffer from some difficult-to-notice problem?
PS:
Speculation regarding where novel alignment target proposals are likely to come from are very welcome. It is a prediction of things that will probably be fairly observable fairly soon. And it is directly relevant to my work. So I am always happy to hear this type of speculation.