I appreciate the effort and strong-upvoted this post because I think it’s following a good methodology of trying to build concrete gear-level models and concretely imagining what will happen, but also think this is really very much not what I expect to happen, and in my model of the world is quite deeply confused about how this will go (mostly by vastly overestimating the naturalness of the diamond abstraction, underestimating convergent instrumental goals and associated behaviors, and relying too much on the shard abstraction). I don’t have time to write a whole response, but in the absence of a “disagreevote” on posts am leaving this comment.
Thanks. Am interested in hearing more at some point.
I also want to note that insofar as this extremely basic approach (“reward the agent for diamond-related activities”) is obviously doomed for reasons the community already knew about, then it should be vulnerable to a convincing linkpost comment which points out a fatal, non-recoverable flaw in my reasoning (like: “TurnTrout, you’re ignoring the obvious X and Y problems, linked here:”). I’m posting this comment as an invitation for people to reply with that, if appropriate![1]
And if there is nothing previously known to be obviously fatal, then I think the research community moved on too quickly by assuming the frame of inner/outer alignment. Even if this proposal has a new fatal flaw, that implies the perceived old fatal flaws (like “the agent games its imperfect objective”) were wrong / only applicable in that particular frame.
ETA: I originally said “devastating” instead of “convincing.” To be clear: I am looking for curteous counterarguments focused on truth-seeking, and not optimized for “devastation” in a social sense.
That’s not to say you should have supplied it. I think it’s good for people to say “I disagree” if that’s all they have time for, and I’m glad you did.
I appreciate the effort and strong-upvoted this post because I think it’s following a good methodology of trying to build concrete gear-level models and concretely imagining what will happen, but also think this is really very much not what I expect to happen, and in my model of the world is quite deeply confused about how this will go (mostly by vastly overestimating the naturalness of the diamond abstraction, underestimating convergent instrumental goals and associated behaviors, and relying too much on the shard abstraction). I don’t have time to write a whole response, but in the absence of a “disagreevote” on posts am leaving this comment.
Thanks. Am interested in hearing more at some point.
I also want to note that insofar as this extremely basic approach (“reward the agent for diamond-related activities”) is obviously doomed for reasons the community already knew about, then it should be vulnerable to a convincing linkpost comment which points out a fatal, non-recoverable flaw in my reasoning (like: “TurnTrout, you’re ignoring the obvious X and Y problems, linked here:”). I’m posting this comment as an invitation for people to reply with that, if appropriate![1]
And if there is nothing previously known to be obviously fatal, then I think the research community moved on too quickly by assuming the frame of inner/outer alignment. Even if this proposal has a new fatal flaw, that implies the perceived old fatal flaws (like “the agent games its imperfect objective”) were wrong / only applicable in that particular frame.
ETA: I originally said “devastating” instead of “convincing.” To be clear: I am looking for curteous counterarguments focused on truth-seeking, and not optimized for “devastation” in a social sense.
That’s not to say you should have supplied it. I think it’s good for people to say “I disagree” if that’s all they have time for, and I’m glad you did.