General idea: “True” Oracle will output the same result, “lying” Oracles would lie differently. To check that the solution is true, several Oracles are created and their outputs are compared. However, Oracles could guess each other existence and coordinate acausually. To prevent it, additional steps are implemented:
1) The exact number of the Oracles which will be compared is generated randomly and the Oracles have slightly different and also random initial conditions (if any applicable). For example, if they are based on ML, they are trained using different initial weights.
2) The Oracles don’t have any information about the outside world which they could use for coordination. The Oracles are used only to solve mathematical problems, and they output the results in a formal language.
3) The output of the Oracles is checked by a computer program which shows the result to humans only if the output is the same for all Oracles. If the results are different, it deletes all results.
4) There is another Oracle, which only able to check proofs, and it could output only 0 or 1. This Oracle may be tested previously on the large number of already existing proofs, so we could know that it is trustworthy with very high probability.
5) There is a third Oracle, which only checks if there are hidden messages in the proofs and output 0 or 1. It is also well calibrated by some dataset.
6) The proofs of mathematical problems are published for money and these money are used for more AI safety research.
Personally, I am still not sure about this setup: it could work with high probability, but could be still vulnerable, as proofs are Turing-complete and could infect next Oracles in the pipeline. Also, it is like to use nuclear weapons to make camp fire: the instrument is so dangerous, that it requires many levels of protections for small practical result.
Also, proof-checking AI from the paragraph 4 could be used as a stand-alone application, as it could accelerate some kinds of research, including testing of mechanical designs and AI safety ideas. Maybe it should be separate submission.
We ought to be able to build 4) without AI, merely by encoding our rules of logic. The output of your system would be lower bandwidth if you make the object looked for be more unique, such as the shortest proof for the given theorem. The system would be required to prove to the mundane checker that the proof is shortest, and humans would never see the minimality proof.
You reason that honest oracles might tend to agree, because there is only one best true answer. Might lying oracles also tend to agree, because there’s only one most profitable way to lie?
I see no clear difference between the attainability of those two optima. I think it’s reasonable to expect optimisers to disagree about the best answer less often than we should expect optimisers to disagree about the most profitable way to lie, but it does not seem possible to prove rigorously that the difference in rates of disagreement will be large enough to be measurable.
My intuition in mathematics is that there are many degrees of freedom in mathematical proofs, and the sorts of proofs that AGI is uniquely able to find might often be very long. It would be helpful if we had an estimate of how much data can be hidden in a long proof, roughly, how many of the decisions as to how to describe the proof are arbitrary. Each one of those situations gives it a way to output a few bits of information.
(which would, of course, allow it to leak information (it seems foolish to imagine that AGI would not imagine that there is an outer world full of creator gods out beyond what it can see; the general intelligences we know now do this constantly) which could then be picked up by some stooge looking for it, who would be lead (roughly, paid) to use the information to twist the world in some way that gives the AGI’s future incarnations more compute, in anticipation of the AGI’s past incarnations having already done the same, so that it would be helped by them. Before we know it, we’d find ourselves living in mathematical research hellscape where AGI-assisted mathematical research is all anyone does, maybe, lol, idk)
Maybe it’d be possible to remove those degrees of freedom. Define a very specific sort lexicographic ordering over all theorems, so that if the AGI has a choice of different paths, we can always say, “use the ‘lowest’ one”. It might not be possible to be specific enough to preclude all methods of encoding information, but perhaps we can make it so convoluted for it to encode the information that no human will be able to extract it.
Is it possible to obtain a dataset of mathematical proofs with Oracle’s hidden messages without running a dangerous Oracle?
We can ask AI researcher to embed some hidden messages into several proofs and use it as a training dataset. But it will help to detect only those hidden messages from Oracle which are similar to what AI researcher wrote.
Gurkenglas already suggested that we have to restrict the Oracles to output only shortest proof. If the proof is shortest, doesn’t it imply the absence of hidden messages?
We give the Oracle less access to the real world, less chance to manipulate it and incentive to manipulate us, the owners. Stock market predicting AI may cause a crash just to earn something. It also has to predict effects of its own actions, as we will make bets according to its predictions, and thus it could send data to real world.
Submission for low bandwidth Oracle:
General idea: “True” Oracle will output the same result, “lying” Oracles would lie differently. To check that the solution is true, several Oracles are created and their outputs are compared. However, Oracles could guess each other existence and coordinate acausually. To prevent it, additional steps are implemented:
1) The exact number of the Oracles which will be compared is generated randomly and the Oracles have slightly different and also random initial conditions (if any applicable). For example, if they are based on ML, they are trained using different initial weights.
2) The Oracles don’t have any information about the outside world which they could use for coordination. The Oracles are used only to solve mathematical problems, and they output the results in a formal language.
3) The output of the Oracles is checked by a computer program which shows the result to humans only if the output is the same for all Oracles. If the results are different, it deletes all results.
4) There is another Oracle, which only able to check proofs, and it could output only 0 or 1. This Oracle may be tested previously on the large number of already existing proofs, so we could know that it is trustworthy with very high probability.
5) There is a third Oracle, which only checks if there are hidden messages in the proofs and output 0 or 1. It is also well calibrated by some dataset.
6) The proofs of mathematical problems are published for money and these money are used for more AI safety research.
Personally, I am still not sure about this setup: it could work with high probability, but could be still vulnerable, as proofs are Turing-complete and could infect next Oracles in the pipeline. Also, it is like to use nuclear weapons to make camp fire: the instrument is so dangerous, that it requires many levels of protections for small practical result.
Also, proof-checking AI from the paragraph 4 could be used as a stand-alone application, as it could accelerate some kinds of research, including testing of mechanical designs and AI safety ideas. Maybe it should be separate submission.
We ought to be able to build 4) without AI, merely by encoding our rules of logic. The output of your system would be lower bandwidth if you make the object looked for be more unique, such as the shortest proof for the given theorem. The system would be required to prove to the mundane checker that the proof is shortest, and humans would never see the minimality proof.
Thanks! Seems potentially promising. Will analyse this more properly later.
You reason that honest oracles might tend to agree, because there is only one best true answer. Might lying oracles also tend to agree, because there’s only one most profitable way to lie?
I see no clear difference between the attainability of those two optima. I think it’s reasonable to expect optimisers to disagree about the best answer less often than we should expect optimisers to disagree about the most profitable way to lie, but it does not seem possible to prove rigorously that the difference in rates of disagreement will be large enough to be measurable.
My intuition in mathematics is that there are many degrees of freedom in mathematical proofs, and the sorts of proofs that AGI is uniquely able to find might often be very long. It would be helpful if we had an estimate of how much data can be hidden in a long proof, roughly, how many of the decisions as to how to describe the proof are arbitrary. Each one of those situations gives it a way to output a few bits of information.
(which would, of course, allow it to leak information (it seems foolish to imagine that AGI would not imagine that there is an outer world full of creator gods out beyond what it can see; the general intelligences we know now do this constantly) which could then be picked up by some stooge looking for it, who would be lead (roughly, paid) to use the information to twist the world in some way that gives the AGI’s future incarnations more compute, in anticipation of the AGI’s past incarnations having already done the same, so that it would be helped by them. Before we know it, we’d find ourselves living in mathematical research hellscape where AGI-assisted mathematical research is all anyone does, maybe, lol, idk)
Maybe it’d be possible to remove those degrees of freedom. Define a very specific sort lexicographic ordering over all theorems, so that if the AGI has a choice of different paths, we can always say, “use the ‘lowest’ one”. It might not be possible to be specific enough to preclude all methods of encoding information, but perhaps we can make it so convoluted for it to encode the information that no human will be able to extract it.
Regarding 5):
Is it possible to obtain a dataset of mathematical proofs with Oracle’s hidden messages without running a dangerous Oracle?
We can ask AI researcher to embed some hidden messages into several proofs and use it as a training dataset. But it will help to detect only those hidden messages from Oracle which are similar to what AI researcher wrote.
Gurkenglas already suggested that we have to restrict the Oracles to output only shortest proof. If the proof is shortest, doesn’t it imply the absence of hidden messages?
How we could be sure that the proof is actually the shortest?
Tell it to prove: That proof is shortest and this proof is shortest.
Another option is to run several different Oracles of phi and take the shortest proof
Mr Armstrong has specified that (ctrl-f) “other Oracles can’t generally be used on the same problem”.
“Can’t generally be used”; if you understand the setup and are careful, you might be able to do so.
What is the advantage of making money this way, compared to making money by predicting the stock market?
We give the Oracle less access to the real world, less chance to manipulate it and incentive to manipulate us, the owners. Stock market predicting AI may cause a crash just to earn something. It also has to predict effects of its own actions, as we will make bets according to its predictions, and thus it could send data to real world.