An Idea For Corrigible, Recursively Improving Math Oracles

Math oracles are a special type of AI which answers questions about math—proving theorems, or finding mathematical objects with desired properties and proving they have them. While a superintelligent math oracle could easily be used to construct a dangerous agent-like AI (such as by asking it to find an object which is a utility maximizer), it could also be used to empower an FAI research team by solving mathematically-well-specified parts of the problem and proving that the AI has the properties it is supposed to.

There is one glaring problem with a superintelligent math oracle, which is safely constructing one. To be effective enough, a math oracle would need to identify and solve subproblems, and would enormously benefit from improving its own problem-solving capability. If constructed carelessly, this would lead to an AI which anted to take over computers and other resources and turn them to studying the problem it’s interested in.

I propose a structure for a math oracle which I claim can recursively self-improve but which will not try to take over resources, trick its human operators, or improve itself at faster than the intended rate, and which will switch itself off at each self-improvement step and at regular intervals. The overall structure is that you start with a math oracle and repeatedly ask “what would be a better math oracle”, where “better” is defined in terms of what would happen when run on an idealized mathematical model of a computer (rather than on the physics-embedded computer that will actually be used).

$M (p, q, t)$ is the state of an idealized computer running program $p$ on question $q$ after $t$ steps

We say that a program shuts down at $t$ if $M (p, q, t) = 0$ (the machine-state is halted).

$P V (t h m, p f)$ is a proof verifier which returns true if $p f$ is a valid proof of $t h m$ , false otherwise

Question are properties for which you want to find an object which exhibits the property, and a proof that it does. So an answer $(o b j, p f)$ is an answer to a question $q$ if $P V (q (o b j), p f)$ .

An oracle is a program which attempts to answer questions. We say that an oracle program $p$ answers a question in time $t$ if $M (p, q, t) = (o b j, p f)$ and $P V (q (o b j), p f)$ . An oracle program has a strength with respect to a probability distribution over questions $Q$ and an amount of time, $s t r (p, Q, t)$ which is the probability that it will answer a question drawn from $q$ in time $t$ .

With these definitions in place, we can ask an oracle $p$ with strength $s$ to make a better oracle with strength $s^{'}$ , and to guarantee that it shuts down when finished, by inputting $q (x) = s t r (x, Q, t) >= s^{'} \land \forall q \in Q . M (x, q, t + 1) = 0$ .

This oracle, when asked to produce successors in this way, will not try to produce successors that probe for weaknesses to escape a box, because instead of optimizing for how well the successor would do in an actual physically embedded system, it is optimizing for how well the successor would do in a counterfactual idealized system. This is not a guarantee, however; it might be that it is easier to produce a system which does probe for such weaknesses, because this behavior arises out of some general principles. This can be somewhat improved by tweaking the strength metric by making it favor proofs arrived at earlier; this would incentivize it to make an oracle which does not waste time and which does not make use of extra time in between when it finds a proof of something and when it reaches the deadline.