My personal interpretation of the hope that lies in pursuing a brain-like AGI research agenda very specifically hinges on absolutely not leaving it ‘up to chance’ to hopefully stumble into an agentive mind that has compassion/empathy/kindness. I think, for reasons roughly in agreement with the ones you express here, that that is a doomed endeavor.
Here is what I believe:
Relatedly, but more specifically: empathy (and other critical parts of the human variant of niceness) seem(s) critically dependent on quirks in the human architecture.
This summarizes my current belief in that I do think we must study and replicate the core functionality of those specific empathy-related quirks in order to have any hope of getting empathy-related behaviors.
I think that the next logical step of “the agentive mind reflectively notices this game-theoretically suboptimal behavior in itself and edits it out” is a risk, but one that can be mitigated by keeping the agent in a secure information-controlled environment with alarms and security measures taken to prevent it from self-modifying. In such an environment it could suggest something like a architecture improvement for our next generation of AGIs, but that plan would be something we would analyze carefully before experimenting with. Not simply let the agent spawn new agents.
I think a thornier point that I feel less confident about is the risk that the agentive mind “resolves “philosophical” questions very differently” and thus does not generalize niceness into highly abstract realms of thought and planning. I believe this point is in need of more careful consideration. I don’t think ‘hope for the best’ is a good plan here. I think we can potentially come up with a plan though. And I think we can potentially run iterative experiments and make incremental changes to a safely-contained agentive mind to try to get closer to a mind that robustly generalizes it’s hardwired empathy to abstract planning.
So, I think this is definitely not a solution-complete path to alignment. I think it would be a hopeless path without strong interpretability tools and a very safe containment, and the ability to carefully limit the capabilities of the agentic mind during testing with various sorts of impairments. I think the assumption of a superintelligent AGI with no adjustable knobs on its inference speed or intelligence is tantamount to saying, “oops, too late, we already failed”. Like, trying to plan out how to survive a free solo rock climb starting from the assumption that you’ve already slipped from a lethal height and are in the process of falling. The hope of success, however slim, was almost entirely before the slip.
My personal interpretation of the hope that lies in pursuing a brain-like AGI research agenda very specifically hinges on absolutely not leaving it ‘up to chance’ to hopefully stumble into an agentive mind that has compassion/empathy/kindness. I think, for reasons roughly in agreement with the ones you express here, that that is a doomed endeavor.
Here is what I believe:
This summarizes my current belief in that I do think we must study and replicate the core functionality of those specific empathy-related quirks in order to have any hope of getting empathy-related behaviors.
I think this testing should be conducted in carefully secured and censored simulation environments as described here by Jacob Cannell: https://www.lesswrong.com/posts/WKGZBCYAbZ6WGsKHc/love-in-a-simbox-is-all-you-need
I think that the next logical step of “the agentive mind reflectively notices this game-theoretically suboptimal behavior in itself and edits it out” is a risk, but one that can be mitigated by keeping the agent in a secure information-controlled environment with alarms and security measures taken to prevent it from self-modifying. In such an environment it could suggest something like a architecture improvement for our next generation of AGIs, but that plan would be something we would analyze carefully before experimenting with. Not simply let the agent spawn new agents.
I think a thornier point that I feel less confident about is the risk that the agentive mind “resolves “philosophical” questions very differently” and thus does not generalize niceness into highly abstract realms of thought and planning. I believe this point is in need of more careful consideration. I don’t think ‘hope for the best’ is a good plan here. I think we can potentially come up with a plan though. And I think we can potentially run iterative experiments and make incremental changes to a safely-contained agentive mind to try to get closer to a mind that robustly generalizes it’s hardwired empathy to abstract planning.
So, I think this is definitely not a solution-complete path to alignment. I think it would be a hopeless path without strong interpretability tools and a very safe containment, and the ability to carefully limit the capabilities of the agentic mind during testing with various sorts of impairments. I think the assumption of a superintelligent AGI with no adjustable knobs on its inference speed or intelligence is tantamount to saying, “oops, too late, we already failed”. Like, trying to plan out how to survive a free solo rock climb starting from the assumption that you’ve already slipped from a lethal height and are in the process of falling. The hope of success, however slim, was almost entirely before the slip.