You reason that honest oracles might tend to agree, because there is only one best true answer. Might lying oracles also tend to agree, because there’s only one most profitable way to lie?
I see no clear difference between the attainability of those two optima. I think it’s reasonable to expect optimisers to disagree about the best answer less often than we should expect optimisers to disagree about the most profitable way to lie, but it does not seem possible to prove rigorously that the difference in rates of disagreement will be large enough to be measurable.
My intuition in mathematics is that there are many degrees of freedom in mathematical proofs, and the sorts of proofs that AGI is uniquely able to find might often be very long. It would be helpful if we had an estimate of how much data can be hidden in a long proof, roughly, how many of the decisions as to how to describe the proof are arbitrary. Each one of those situations gives it a way to output a few bits of information.
(which would, of course, allow it to leak information (it seems foolish to imagine that AGI would not imagine that there is an outer world full of creator gods out beyond what it can see; the general intelligences we know now do this constantly) which could then be picked up by some stooge looking for it, who would be lead (roughly, paid) to use the information to twist the world in some way that gives the AGI’s future incarnations more compute, in anticipation of the AGI’s past incarnations having already done the same, so that it would be helped by them. Before we know it, we’d find ourselves living in mathematical research hellscape where AGI-assisted mathematical research is all anyone does, maybe, lol, idk)
Maybe it’d be possible to remove those degrees of freedom. Define a very specific sort lexicographic ordering over all theorems, so that if the AGI has a choice of different paths, we can always say, “use the ‘lowest’ one”. It might not be possible to be specific enough to preclude all methods of encoding information, but perhaps we can make it so convoluted for it to encode the information that no human will be able to extract it.
You reason that honest oracles might tend to agree, because there is only one best true answer. Might lying oracles also tend to agree, because there’s only one most profitable way to lie?
I see no clear difference between the attainability of those two optima. I think it’s reasonable to expect optimisers to disagree about the best answer less often than we should expect optimisers to disagree about the most profitable way to lie, but it does not seem possible to prove rigorously that the difference in rates of disagreement will be large enough to be measurable.
My intuition in mathematics is that there are many degrees of freedom in mathematical proofs, and the sorts of proofs that AGI is uniquely able to find might often be very long. It would be helpful if we had an estimate of how much data can be hidden in a long proof, roughly, how many of the decisions as to how to describe the proof are arbitrary. Each one of those situations gives it a way to output a few bits of information.
(which would, of course, allow it to leak information (it seems foolish to imagine that AGI would not imagine that there is an outer world full of creator gods out beyond what it can see; the general intelligences we know now do this constantly) which could then be picked up by some stooge looking for it, who would be lead (roughly, paid) to use the information to twist the world in some way that gives the AGI’s future incarnations more compute, in anticipation of the AGI’s past incarnations having already done the same, so that it would be helped by them. Before we know it, we’d find ourselves living in mathematical research hellscape where AGI-assisted mathematical research is all anyone does, maybe, lol, idk)
Maybe it’d be possible to remove those degrees of freedom. Define a very specific sort lexicographic ordering over all theorems, so that if the AGI has a choice of different paths, we can always say, “use the ‘lowest’ one”. It might not be possible to be specific enough to preclude all methods of encoding information, but perhaps we can make it so convoluted for it to encode the information that no human will be able to extract it.