Yes. I have the intuition that training stories will make this problem worse. But I don’t think my intuition on this matter is trustworthy (what experience do I have to base it on?) so don’t worry about it. We’ll try it and see what happens.
(to explain the intuition a little bit: With inner/outer alignment, any would-be AGI creator will have to face up to the fact that they haven’t solved outer alignment, because it’ll be easy for a philosopher to find differences between the base objective they’ve programmed and True Human Values. With training stories, I expect lots of people to be saying more sophisticated versions of “It just does what I meant it to do, no funny business.”)
Yes. I have the intuition that training stories will make this problem worse. But I don’t think my intuition on this matter is trustworthy (what experience do I have to base it on?) so don’t worry about it. We’ll try it and see what happens.
(to explain the intuition a little bit: With inner/outer alignment, any would-be AGI creator will have to face up to the fact that they haven’t solved outer alignment, because it’ll be easy for a philosopher to find differences between the base objective they’ve programmed and True Human Values. With training stories, I expect lots of people to be saying more sophisticated versions of “It just does what I meant it to do, no funny business.”)