ThomasCederborg comments on The case for more Alignment Target Analysis (ATA)

ThomasCederborg 21 Sep 2024 14:40 UTC
3 points
2
I interpret your comment as a prediction regarding where new alignment target proposals will come from. Is this correct?

I also have a couple of questions about the linked text:
How do you define the difference between explaining something and trying to change someone’s mind? Consider the case where Bob is asking a factual question. An objectively correct straightforward answer would radically change Bob’s entire system of morality, in ways that the AI can predict. A slightly obfuscated answer would result in far less dramatic changes. But those changes would be in a completely different direction (compared to the straightforward answer). Refusing to answer, while being honest about the reason for refusal, would send Bob into a tailspin. How certain are you that you can find a definition of Acceptable Forms of Explanation that holds up in a large number of messy situations along these lines? See also this.
And if you cannot define such things in a solid way, how do you plan to define ``benefit humanity″? PCEV was an effort to define ``benefit humanity″. And PCEV has been found to suffer from at least one difficult-to-notice problem. How certain are you that you can find a definition of ``benefit humanity″ that does not suffer from some difficult-to-notice problem?
PS:
Speculation regarding where novel alignment target proposals are likely to come from are very welcome. It is a prediction of things that will probably be fairly observable fairly soon. And it is directly relevant to my work. So I am always happy to hear this type of speculation.