I feel like I expect a failure mode where people exploit ambiguity and norm-laden concepts to convince themselves of happy fairy tales. I should think more about this and write a comment.
Just wanted to point out that this is already something we need to worry about all the time in alignment. Calling them training stories doesn’t create such failure mode, it makes them obvious to people like you and me who are wary of narrative explanations in science.
Yes. I have the intuition that training stories will make this problem worse. But I don’t think my intuition on this matter is trustworthy (what experience do I have to base it on?) so don’t worry about it. We’ll try it and see what happens.
(to explain the intuition a little bit: With inner/outer alignment, any would-be AGI creator will have to face up to the fact that they haven’t solved outer alignment, because it’ll be easy for a philosopher to find differences between the base objective they’ve programmed and True Human Values. With training stories, I expect lots of people to be saying more sophisticated versions of “It just does what I meant it to do, no funny business.”)
Just wanted to point out that this is already something we need to worry about all the time in alignment. Calling them training stories doesn’t create such failure mode, it makes them obvious to people like you and me who are wary of narrative explanations in science.
Yes. I have the intuition that training stories will make this problem worse. But I don’t think my intuition on this matter is trustworthy (what experience do I have to base it on?) so don’t worry about it. We’ll try it and see what happens.
(to explain the intuition a little bit: With inner/outer alignment, any would-be AGI creator will have to face up to the fact that they haven’t solved outer alignment, because it’ll be easy for a philosopher to find differences between the base objective they’ve programmed and True Human Values. With training stories, I expect lots of people to be saying more sophisticated versions of “It just does what I meant it to do, no funny business.”)