What would the Platonically ideal version of this result be (assuming it isn’t just a fluke of CNNs or something)?
Something along the lines of “For any PSpace complete zero-sum game, adversary X with access to a simulator of Agent Y has {complexity function} advantage in the sense that X can defeat Y using {total compute to train Y}/{complexity function}”?
Assuming we could prove such a theorem true (which we almost certainly can’t since we can’t even prove P=NP), what would the implications for AI Alignment be?
“Don’t give your adversary a simulator of yourself they can train against” seems like an obvious one.
Maybe there’s some alternative formulation that yeilds “don’t worry, we can always shut the SAI down”?
What would the Platonically ideal version of this result be (assuming it isn’t just a fluke of CNNs or something)?
Something along the lines of “For any PSpace complete zero-sum game, adversary X with access to a simulator of Agent Y has {complexity function} advantage in the sense that X can defeat Y using {total compute to train Y}/{complexity function}”?
Assuming we could prove such a theorem true (which we almost certainly can’t since we can’t even prove P=NP), what would the implications for AI Alignment be?
“Don’t give your adversary a simulator of yourself they can train against” seems like an obvious one.
Maybe there’s some alternative formulation that yeilds “don’t worry, we can always shut the SAI down”?