You are making a type error. It’s like saying “Sure your transformer arch worked ok on wikipedia, but obviously it won’t work on news articles …. or the game of Go, … or videos … or music.”
This is discussed in the article and in the comments. The point of simulations is to test general universal architectures for alignment—the equivalent of the transformer arch. Your use of the term ‘OOD’ implies you are thinking we train some large agent model in sim then deploy in reality. That isn’t the proposal.
The simulations do not need to “match reality” because intelligence and alignment are strongly universal and certainly don’t depend on presence of technology like computers. Simulations can test many different scenarios/environments which is useful for coverage of the future.
As an example—the human genome hasn’t changed much in the last 2,000 years, and yet inter-human brain alignment (love/altruism/empathy etc) works about as well now as it did 2,000 years ago. That is a strict lower bound example that AGI alignment architecture can improve on.
This is discussed in the article and in the comments. The point of simulations is to test general universal architectures for alignment—the equivalent of the transformer arch.
The “first critical try” problem is mostly about implementation, not general design. Even if you do discover a perfect alignment solution by testing it’s “general universal architecture” in the simbox, you need to implement it correctly on the first critical try.
Also, if the AI in the simbox are not AGI, you need to worry about the sharp left turn, where capabilities generalize before alignment generalizes.
If the AI in the simbox are AGI, you need to solve the AI boxing problem on the first critical try. So we have just reduced “solve alignment on the first critical try” to “solve AI boxing and deceptive alignment on the first critical try”.
Even if you do discover a perfect alignment solution by testing it’s “general universal architecture” in the simbox, you need to implement it correctly on the first critical try.
If you discover this, you are mostly done. By definition, a perfect alignment solution is the best that could exist, so there is nothing more to do (in terms of alignment design at least). The ‘implementation’ is then mostly just running the exact same code in the real world rather than in a sim. Of course in practice there are other implementation related issues—like how you decide what to align the AGI towards, but those are out of scope here.
The “AI boxing” thought experiment is a joke—it assumes a priori that the AI is not contained in a simbox. If the AI is not aware that it is in a simulation, then it can not escape, anymore than you could escape this simulation. The fact that you linked that old post indicates again that you are simply parroting standard old LW groupthink, and didn’t actually read the article, as it has a section on containment.
The EY/MIRI/LW groupthink is now mostly outdated/discredited: based on an old mostly incorrect theory of the brain, that has not aged well in the era of AGI through DL.
And the real world contains many simulations, which is > 0. This is what I meant by “won’t match reality”. In particular, reality would be OOD.
You are making a type error. It’s like saying “Sure your transformer arch worked ok on wikipedia, but obviously it won’t work on news articles …. or the game of Go, … or videos … or music.”
This is discussed in the article and in the comments. The point of simulations is to test general universal architectures for alignment—the equivalent of the transformer arch. Your use of the term ‘OOD’ implies you are thinking we train some large agent model in sim then deploy in reality. That isn’t the proposal.
The simulations do not need to “match reality” because intelligence and alignment are strongly universal and certainly don’t depend on presence of technology like computers. Simulations can test many different scenarios/environments which is useful for coverage of the future.
As an example—the human genome hasn’t changed much in the last 2,000 years, and yet inter-human brain alignment (love/altruism/empathy etc) works about as well now as it did 2,000 years ago. That is a strict lower bound example that AGI alignment architecture can improve on.
The “first critical try” problem is mostly about implementation, not general design. Even if you do discover a perfect alignment solution by testing it’s “general universal architecture” in the simbox, you need to implement it correctly on the first critical try.
Also, if the AI in the simbox are not AGI, you need to worry about the sharp left turn, where capabilities generalize before alignment generalizes.
If the AI in the simbox are AGI, you need to solve the AI boxing problem on the first critical try. So we have just reduced “solve alignment on the first critical try” to “solve AI boxing and deceptive alignment on the first critical try”.
If you discover this, you are mostly done. By definition, a perfect alignment solution is the best that could exist, so there is nothing more to do (in terms of alignment design at least). The ‘implementation’ is then mostly just running the exact same code in the real world rather than in a sim. Of course in practice there are other implementation related issues—like how you decide what to align the AGI towards, but those are out of scope here.
The “AI boxing” thought experiment is a joke—it assumes a priori that the AI is not contained in a simbox. If the AI is not aware that it is in a simulation, then it can not escape, anymore than you could escape this simulation. The fact that you linked that old post indicates again that you are simply parroting standard old LW groupthink, and didn’t actually read the article, as it has a section on containment.
The EY/MIRI/LW groupthink is now mostly outdated/discredited: based on an old mostly incorrect theory of the brain, that has not aged well in the era of AGI through DL.