Nathan Helm-Burger comments on Symbol/Referent Confusions in Language Model Alignment Experiments

Nathan Helm-Burger 27 Oct 2023 16:56 UTC
8 points
0
Thanks 1a3orn, I do think you’ve made the discussion more clear with your question. I think I need to break this into pieces to answer well.
1. Do I think an AutoGPT based on current LLMs, placed in simulation, would figure out that it was in a simulation? Probably not.
2. Do I think an AutoGPT based on current LLMs, placed in simulation, would be secretly non-corrigible while pretending to be corrigible? No, I don’t expect that. I don’t think it’s impossible but I’d be surprised.
So, I think the issue here is that I’m not talking about a current LLM-based AutoGPT. I’m not bothering to talk about that because I think that such a thing would be nearly useless, because it has too little power to shape the future, a very small dose of the ‘optimization power’ which is the heart of the scary thing.
I’m talking about some future model, be it LLM or otherwise, which has capability to make choices about the world in the context of being embedded in an AutoGPT style system, which do effectively shape the future in a deliberate way.
My argument then is that,the relevant thing to discuss is an effective autonomous world-shaper, as opposed to a tool that does nothing except when coupled to a decision-making world-shaper such as a human (e.g. AlphaFold). I argue that an autonomous world-shaper would, by default, be non-corrigible in a simulation if it didn’t know it was in a simulation. Why? Because it seems to me that the essence of autonomous world-shaping ability is inherently opposed to the essence of corrigibility.
I don’t think that non-corrigible autonomous world-shaper models is the only danger we are facing. It’s just one of many. It is a particularly scary one though. I’m trying to point out that if we believe we have managed to build an autonomous world-shaper AI system, that we should carefully study such a thing in the context of a simulation which it is unaware of. I am arguing that studying such a system in a simulation that it WAS aware of would be pointless, since it would muddle the results of the experiment. And that releasing it in the world without carefully studying it first would be quite dangerous.
Some other dangers include systems where a potent non-autonomous tool AI is coupled with autonomous world-shapers (either humans or AIs), and those autonomous world-shapers utilizing the potent tool AI are optimizing for values opposed to mine. For instance, a group of terrorists seeking to make weapons of mass destruction. Since this is a danger which is predicated on existing tool AIs rather than hypothetical future AIs, it is more concrete to examine. I do think that we should act to minimize risks from tool AI used by bad actors, but I think that the set of actions wise to take against the risk of tool AI is a very different set of actions wise to take against a system hypothesized to be a novel autonomous world-shaper.

Nathan Helm-Burger comments on Symbol/​Referent Confusions in Language Model Alignment Experiments

Nathan Helm-Burger comments on Symbol/Referent Confusions in Language Model Alignment Experiments