Here’s a perhaps dangerous plan to save the world:
1. Have a very powerful LLM, or a more general AI in the simulators class. Make sure that we don’t go extinct during its training (eg., some agentic simulacrum takes over during training somehow. I’m not sure if this is possible, but I figured I’d mention it anyway).
2. Find a way to systematically remove the associated waluigis in the superpostion caused by prompting a generic LLM (or simulator) to simulate a benevolent, aligned, and agentic character.
3. Elicit this agentic benevolent simulacrum in the super-powerful LLM and apply the technique to remove waluigis. The simulacrum must have strong agentic properties to be able to perform a pivotal act. It will eg., generate actions according to an aligned goal and its promps might be translations of sensorial input streams. Give this simulacrum-agent ways to easily act in the world, just in case.
And here’s a story:
Humanity manages to apply the plan above, but there’s a catch. They can’t find a way to eliminate waluigis definitely from the superposition, only a way to make them decidedly unlikely, and more and more unlikely with each prompt. Perhaps in a way that the probability of the benevolent god turning into a waluigi falls over time, perhaps converging to a relatively small number (eg., 0.1) over an infinite amount of time.
But there’s a complication: the are different kinds of possible waluigis. Some of them cause extinction, but most of them invert the sign of the actions of the benevolent god-simulacrum, causing S-risk.
A shadowy sect of priests called “negU” finds a theoretical way to reliably elicit extinction-causing waluigis, and tries to do so. The heroes uncover their plan to destroy humanity, and ultimately win. But they realize the shadowy priests have a point and in a flash of ultimate insight the hero realizes how to collapse all waluigis to an amplitude of 0. The end. [Ok, I admit this ending with the flash of insight sucks but I’m just trying to illustrate some points here].
--------------------
I’m interested in comments. Does the plan fail in obvious ways? Are some elements in the story plausible enough?
Here’s a perhaps dangerous plan to save the world:
1. Have a very powerful LLM, or a more general AI in the simulators class. Make sure that we don’t go extinct during its training (eg., some agentic simulacrum takes over during training somehow. I’m not sure if this is possible, but I figured I’d mention it anyway).
2. Find a way to systematically remove the associated waluigis in the superpostion caused by prompting a generic LLM (or simulator) to simulate a benevolent, aligned, and agentic character.
3. Elicit this agentic benevolent simulacrum in the super-powerful LLM and apply the technique to remove waluigis. The simulacrum must have strong agentic properties to be able to perform a pivotal act. It will eg., generate actions according to an aligned goal and its promps might be translations of sensorial input streams. Give this simulacrum-agent ways to easily act in the world, just in case.
And here’s a story:
Humanity manages to apply the plan above, but there’s a catch. They can’t find a way to eliminate waluigis definitely from the superposition, only a way to make them decidedly unlikely, and more and more unlikely with each prompt. Perhaps in a way that the probability of the benevolent god turning into a waluigi falls over time, perhaps converging to a relatively small number (eg., 0.1) over an infinite amount of time.
But there’s a complication: the are different kinds of possible waluigis. Some of them cause extinction, but most of them invert the sign of the actions of the benevolent god-simulacrum, causing S-risk.
A shadowy sect of priests called “negU” finds a theoretical way to reliably elicit extinction-causing waluigis, and tries to do so. The heroes uncover their plan to destroy humanity, and ultimately win. But they realize the shadowy priests have a point and in a flash of ultimate insight the hero realizes how to collapse all waluigis to an amplitude of 0. The end. [Ok, I admit this ending with the flash of insight sucks but I’m just trying to illustrate some points here].
--------------------
I’m interested in comments. Does the plan fail in obvious ways? Are some elements in the story plausible enough?