Honest Why-not-just question: if the WE is roughly “you’ll get exactly one layer of deception” (aka a Waluigi), why not just anticipate by steering through that effect? To choose an anti-good Luigi to get a good Waluigi?
To choose an anti-good Luigi to get a good Waluigi?
I’m not sure what you mean by that. In literary terms, would that just be an evil protagonist who may at some point have the twist of turning out to secretly be genuinely good? But there don’t seem to be too many stories or histories like that, and the ones that start with evil protagonist usually end with that: villains like Hitler, Stalin, Mao, or Pol Pot don’t suddenly redeem themselves spontaneously. (Stories where the villain is redeemed almost always start with a good Luigi/hero, like Luke Skywalker redeeming Darth Vader.) Can you name 3 examples which start solely with an ‘anti-good Luigi’ and end in a ‘good Waluigi’?
And if the probability of such a twist remains meaningful, that doesn’t address the asymmetry: bad agents can be really bad, while good agents can do only a little good, and the goal is systems of 100% goodness with ~100% probability, not 99% badness and then maybe a short twist ending of goodness with 1% probability (even if that twist would ensure no additional layers of deception—deliberately instantiating an overtly evil agent just to avoid it being secretly evil would seem like burning down the village to save it).
I think these meet your criterion of starting solely with anti-good characters:
Cecil from FF4 starts as a literal dark knight before realizing he’s working for an evil empire, becoming a paladin, and saving the world.
John Preston from Equilibrium (the protagonist, played by Christian Bale) is a fascist secret police agent until he accidentally feels emotion, then realizes that anti-emotion fascism is bad and overthrows it.
Megamind from Megamind is a supervillain who realizes that actually he should be a hero. (Maybe this shouldn’t count because there’s initially a superhero? But the protagonist is Megamind throughout.)
Grace from Infinity Train season 3 starts as a cult leader trying to maximize the in-universe utility function (literally!), but got the sign wrong so she’s absolutely terrible. But she meets a small child and realizes she’s terrible and works to overcome that.
Gru from Despicable Me starts out a supervillain but eventually becomes a loving father and member of the “Anti-Villain League”.
Joel from The Last of Us is a murderer in the post-apocalypse who is redeemed by finding a surrogate daughter figure and at the end of the story… I have been advised this is not a suitable role-model for an AI, please disregard.
Some themes of such redemption stories (safety implications left to the reader):
Adopting one or more children (1, 4, 5, 6)
Having an even eviler version of yourself to oppose (2, 3, 4, 5)
“Good” simply means “our targeted property” here. So my point is, if WE is true to any property P, we could get a P-Waluigi through some anti-P (pseudo-)naive targeting.
I don’t get your second point, we’re talking about simulacra not agent, and obviously this idea would only be part of a larger solution at best. For any property P, I expect several anti-P so you don’t have to instanciate an actually bad Luigi, my idea is more about to trap deception as a one-layer only.
I’m confused, could you clarify? I interpret your “Wawaluigi” as two successive layers of deception within a simulacra, which is unlikely if WE is reliable, right? I didn’t say anything about Wawaluigis and I agree that they are not Luigis, because as I said, a layer of Waluigi is not a one-to-one operator. My guess is about a normal Waluigi layer, but with a desirable Waluigi rather than a harmful Waluigi.
Honest Why-not-just question: if the WE is roughly “you’ll get exactly one layer of deception” (aka a Waluigi), why not just anticipate by steering through that effect? To choose an anti-good Luigi to get a good Waluigi?
I’m not sure what you mean by that. In literary terms, would that just be an evil protagonist who may at some point have the twist of turning out to secretly be genuinely good? But there don’t seem to be too many stories or histories like that, and the ones that start with evil protagonist usually end with that: villains like Hitler, Stalin, Mao, or Pol Pot don’t suddenly redeem themselves spontaneously. (Stories where the villain is redeemed almost always start with a good Luigi/hero, like Luke Skywalker redeeming Darth Vader.) Can you name 3 examples which start solely with an ‘anti-good Luigi’ and end in a ‘good Waluigi’?
And if the probability of such a twist remains meaningful, that doesn’t address the asymmetry: bad agents can be really bad, while good agents can do only a little good, and the goal is systems of 100% goodness with ~100% probability, not 99% badness and then maybe a short twist ending of goodness with 1% probability (even if that twist would ensure no additional layers of deception—deliberately instantiating an overtly evil agent just to avoid it being secretly evil would seem like burning down the village to save it).
I think these meet your criterion of starting solely with anti-good characters:
Cecil from FF4 starts as a literal dark knight before realizing he’s working for an evil empire, becoming a paladin, and saving the world.
John Preston from Equilibrium (the protagonist, played by Christian Bale) is a fascist secret police agent until he accidentally feels emotion, then realizes that anti-emotion fascism is bad and overthrows it.
Megamind from Megamind is a supervillain who realizes that actually he should be a hero. (Maybe this shouldn’t count because there’s initially a superhero? But the protagonist is Megamind throughout.)
Grace from Infinity Train season 3 starts as a cult leader trying to maximize the in-universe utility function (literally!), but got the sign wrong so she’s absolutely terrible. But she meets a small child and realizes she’s terrible and works to overcome that.
Gru from Despicable Me starts out a supervillain but eventually becomes a loving father and member of the “Anti-Villain League”.
Joel from The Last of Usis a murderer in the post-apocalypse who is redeemed by finding a surrogate daughter figure and at the end of the story…I have been advised this is not a suitable role-model for an AI, please disregard.Some themes of such redemption stories (safety implications left to the reader):
Adopting one or more children (1, 4, 5, 6)
Having an even eviler version of yourself to oppose (2, 3, 4, 5)
“Good” simply means “our targeted property” here. So my point is, if WE is true to any property P, we could get a P-Waluigi through some anti-P (pseudo-)naive targeting.
I don’t get your second point, we’re talking about simulacra not agent, and obviously this idea would only be part of a larger solution at best. For any property P, I expect several anti-P so you don’t have to instanciate an actually bad Luigi, my idea is more about to trap deception as a one-layer only.
This wouldn’t work. Wawaluigis are not luigis.
I’m confused, could you clarify? I interpret your “Wawaluigi” as two successive layers of deception within a simulacra, which is unlikely if WE is reliable, right?
I didn’t say anything about Wawaluigis and I agree that they are not Luigis, because as I said, a layer of Waluigi is not a one-to-one operator. My guess is about a normal Waluigi layer, but with a desirable Waluigi rather than a harmful Waluigi.