OK, imagine that I make an AI that works like this: a copy of Satan is instantiated and his preferences are extracted in percentiles, then sentences from Satan’s 2nd-5th percentile of outputs are randomly sampled. Then that copy of Satan is destroyed.
It’s not valid to say that there is no different inner motivation when there could be. It might be powerless and unimportant in practice, but it can still be a thing. The argument that it’s powerless and unimportant in practice is distinct from the argument that it doesn’t make conceptual sense as a distinct construction. If this distinct construction is there, we should ask and aim to measure how much influence it gets. Given the decades of neuroscience, it’s a somewhat hopeless endeavor in the medium term.
I don’t have a clear sense of terminology around the edges or motivation to particularly care once the burden of nuance in the way it should be used stops it from being helpful for communication. I sketched how I think about the situation. Which words I or you or someone else would use to talk about it is a separate issue.
OK, imagine that I make an AI that works like this: a copy of Satan is instantiated and his preferences are extracted in percentiles, then sentences from Satan’s 2nd-5th percentile of outputs are randomly sampled. Then that copy of Satan is destroyed.
Is the “Satan Reverser” AI misaligned?
Is it “inner misaligned”?
It’s not valid to say that there is no different inner motivation when there could be. It might be powerless and unimportant in practice, but it can still be a thing. The argument that it’s powerless and unimportant in practice is distinct from the argument that it doesn’t make conceptual sense as a distinct construction. If this distinct construction is there, we should ask and aim to measure how much influence it gets. Given the decades of neuroscience, it’s a somewhat hopeless endeavor in the medium term.
ok but as a matter of terminology, is a “Satan reverser” misaligned because it contains a Satan?
I don’t have a clear sense of terminology around the edges or motivation to particularly care once the burden of nuance in the way it should be used stops it from being helpful for communication. I sketched how I think about the situation. Which words I or you or someone else would use to talk about it is a separate issue.