Which part specifically are you referring to as being overly complicated? What I take to be the primary assertions of the post to be are:
Simulacra may themselves conduct simulation, and advanced simulators could produce vast webs of simulacra organized as a hierarchy.
Simulating an agent is not fundamentally different to creating one in the real world.
Due to instrumental convergence, agentic simulacra might be expected to engage in resource acquisition. This could take the shape of ‘complexity theft’ as described in the post.[1]
The Löbian Obstacle accurately describes why an agent cannot obtain a formal guarantee via design-inspection of its subsequent agent.
For a simulator to be safe, all simulacra need to be aligned unless we figure some upper bound on “programs of this complexity are too simple to be dangerous,” at which point we would consider simulacra above that complexity only.
I’ll try to justify my approach with respect to one or more of these claims, and if I can’t, I suppose that would give me strong reason to believe the method is overly complicated.
I am disagreeing with the underlying assumption that it’s worthwhile to create simulacra of the sort that satisfy point 2. I expect an AI reasoning about its successor to not simulate it with perfect fidelity—instead, it’s much more practical to make approximations that make the reasoning process different from instantiating the successor.
I expect agentic simulacra to occur without intentionally simulating them, in that agents are just generally useful for solving prediction problems and that in conducting millions of predictions (as would be expected of a product on the order of ChatGPT, or future successors,) it’s probable for agentic simulacra to occur. Even if these agents are just approximations, in predicting the behaviors of approximated agents their preferences could still be satisfied in the real world (as described in the Hubinger post.)
The problem I’m interested in is how you ensure that all subsequent agentic simulacra (whether occurred intentionally or otherwise) are safe, which seems difficult to verify formally due to the Löbian Obstacle.
Which part specifically are you referring to as being overly complicated? What I take to be the primary assertions of the post to be are:
Simulacra may themselves conduct simulation, and advanced simulators could produce vast webs of simulacra organized as a hierarchy.
Simulating an agent is not fundamentally different to creating one in the real world.
Due to instrumental convergence, agentic simulacra might be expected to engage in resource acquisition. This could take the shape of ‘complexity theft’ as described in the post.[1]
The Löbian Obstacle accurately describes why an agent cannot obtain a formal guarantee via design-inspection of its subsequent agent.
For a simulator to be safe, all simulacra need to be aligned unless we figure some upper bound on “programs of this complexity are too simple to be dangerous,” at which point we would consider simulacra above that complexity only.
I’ll try to justify my approach with respect to one or more of these claims, and if I can’t, I suppose that would give me strong reason to believe the method is overly complicated.
This doesn’t have to be resource acquisition, just any negative action that we could reasonably expect a rational agent to pursue.
I am disagreeing with the underlying assumption that it’s worthwhile to create simulacra of the sort that satisfy point 2. I expect an AI reasoning about its successor to not simulate it with perfect fidelity—instead, it’s much more practical to make approximations that make the reasoning process different from instantiating the successor.
I expect agentic simulacra to occur without intentionally simulating them, in that agents are just generally useful for solving prediction problems and that in conducting millions of predictions (as would be expected of a product on the order of ChatGPT, or future successors,) it’s probable for agentic simulacra to occur. Even if these agents are just approximations, in predicting the behaviors of approximated agents their preferences could still be satisfied in the real world (as described in the Hubinger post.)
The problem I’m interested in is how you ensure that all subsequent agentic simulacra (whether occurred intentionally or otherwise) are safe, which seems difficult to verify formally due to the Löbian Obstacle.