My summary: This is related to the The Waluigi Effect (mega-post) but extends the hypothesis to that “Waluigi” hostile simulacra finding ways to perpetuate itself and gain influence first over the simulator, then over the real world.
Okay, I came back and read this more fully. I think this is entirely plausible. But I also think it’s mostly irrelevant. Long before someone accidentally runs a smart enough LLM for long enough, with access to enough tools to pose a threat, they’ll deliberately run it as an agent. The prompt “you’re a helpful assistant that wants to accomplish [x]; make a plan and execute it, using [this set of APIs] to gather information and take actions as appropriate.
And long before that, people will use more complex scaffolding to create dangerous language model cognitive architectures out of less capable LLMs.
I could be wrong about this, and I invite pushback. Again, I take the possibility you raise seriously.
My summary: This is related to the The Waluigi Effect (mega-post) but extends the hypothesis to that “Waluigi” hostile simulacra finding ways to perpetuate itself and gain influence first over the simulator, then over the real world.
Okay, I came back and read this more fully. I think this is entirely plausible. But I also think it’s mostly irrelevant. Long before someone accidentally runs a smart enough LLM for long enough, with access to enough tools to pose a threat, they’ll deliberately run it as an agent. The prompt “you’re a helpful assistant that wants to accomplish [x]; make a plan and execute it, using [this set of APIs] to gather information and take actions as appropriate.
And long before that, people will use more complex scaffolding to create dangerous language model cognitive architectures out of less capable LLMs.
I could be wrong about this, and I invite pushback. Again, I take the possibility you raise seriously.