Nathan Helm-Burger comments on Niceness is unnatural

Nathan Helm-Burger 13 Oct 2022 2:04 UTC
12 points
2
My personal interpretation of the hope that lies in pursuing a brain-like AGI research agenda very specifically hinges on absolutely not leaving it ‘up to chance’ to hopefully stumble into an agentive mind that has compassion/empathy/kindness. I think, for reasons roughly in agreement with the ones you express here, that that is a doomed endeavor.
Here is what I believe:
Relatedly, but more specifically: empathy (and other critical parts of the human variant of niceness) seem(s) critically dependent on quirks in the human architecture.
This summarizes my current belief in that I do think we must study and replicate the core functionality of those specific empathy-related quirks in order to have any hope of getting empathy-related behaviors.
I think this testing should be conducted in carefully secured and censored simulation environments as described here by Jacob Cannell: https://www.lesswrong.com/posts/WKGZBCYAbZ6WGsKHc/love-in-a-simbox-is-all-you-need
I think that the next logical step of “the agentive mind reflectively notices this game-theoretically suboptimal behavior in itself and edits it out” is a risk, but one that can be mitigated by keeping the agent in a secure information-controlled environment with alarms and security measures taken to prevent it from self-modifying. In such an environment it could suggest something like a architecture improvement for our next generation of AGIs, but that plan would be something we would analyze carefully before experimenting with. Not simply let the agent spawn new agents.
I think a thornier point that I feel less confident about is the risk that the agentive mind “resolves “philosophical” questions very differently” and thus does not generalize niceness into highly abstract realms of thought and planning. I believe this point is in need of more careful consideration. I don’t think ‘hope for the best’ is a good plan here. I think we can potentially come up with a plan though. And I think we can potentially run iterative experiments and make incremental changes to a safely-contained agentive mind to try to get closer to a mind that robustly generalizes it’s hardwired empathy to abstract planning.
So, I think this is definitely not a solution-complete path to alignment. I think it would be a hopeless path without strong interpretability tools and a very safe containment, and the ability to carefully limit the capabilities of the agentic mind during testing with various sorts of impairments. I think the assumption of a superintelligent AGI with no adjustable knobs on its inference speed or intelligence is tantamount to saying, “oops, too late, we already failed”. Like, trying to plan out how to survive a free solo rock climb starting from the assumption that you’ve already slipped from a lethal height and are in the process of falling. The hope of success, however slim, was almost entirely before the slip.