Logan Zoellner comments on Even Superhuman Go AIs Have Surprising Failure Modes

Logan Zoellner 6 Sep 2024 0:07 UTC
2 points
0
What would the Platonically ideal version of this result be (assuming it isn’t just a fluke of CNNs or something)?

Something along the lines of “For any PSpace complete zero-sum game, adversary X with access to a simulator of Agent Y has {complexity function} advantage in the sense that X can defeat Y using {total compute to train Y}/{complexity function}”?

Assuming we could prove such a theorem true (which we almost certainly can’t since we can’t even prove P=NP), what would the implications for AI Alignment be?
“Don’t give your adversary a simulator of yourself they can train against” seems like an obvious one.
Maybe there’s some alternative formulation that yeilds “don’t worry, we can always shut the SAI down”?