[Eli’s personal notes. Feel free to comment or ignore.]
My summary of Eliezer’s overall view:
1. I don’t see how you can’t get cognition to “stack” like that, short of running a Turing machine made up of the agents in your system. But if you do that, then we throw alignment out the window.
2. There’s this strong X-and-only-X problem.
If our agents are perfect imitations of humans, then we do solve this problem. But having perfect imitations of humans is a very high bar that depends have a very powerful superintelligence already. And now we’re just passing the buck. How is that extremely powerful superintelligence aligned?
If our agents are not perfect imitations, it seems like have no guaranty of X-and-only-X.
This might still work, depending on the exact ways in which the imitation deviates from the subject, but most of the plausible ways seem like they don’t solve this problem.
And regardless, even if it only deviated in ways that we think are safe, we would want some guaranty of that fact.
[Eli’s personal notes. Feel free to comment or ignore.]
My summary of Eliezer’s overall view:
1. I don’t see how you can’t get cognition to “stack” like that, short of running a Turing machine made up of the agents in your system. But if you do that, then we throw alignment out the window.
2. There’s this strong X-and-only-X problem.
If our agents are perfect imitations of humans, then we do solve this problem. But having perfect imitations of humans is a very high bar that depends have a very powerful superintelligence already. And now we’re just passing the buck. How is that extremely powerful superintelligence aligned?
If our agents are not perfect imitations, it seems like have no guaranty of X-and-only-X.
This might still work, depending on the exact ways in which the imitation deviates from the subject, but most of the plausible ways seem like they don’t solve this problem.
And regardless, even if it only deviated in ways that we think are safe, we would want some guaranty of that fact.