Submission (CF). Use a counterfactual oracle to send a message to ourselves with a time delay. We choose an episode length T and set of possible messages M. The oracle outputs a time to<T and message mo∈M. We are automatically sent mo at time to unless erasure occurs, in which case at the end of the episode we input a time t∗and message m∗ which we wish we had received at t∗. The oracle’s reward is then −(t∗−to)2 if mo=m∗ and −∞ otherwise. We do not find out that erasure has occurred until the end of the episode, ensuring that events before to are the same as in the counterfactual world.
Here are some examples, which can work with a single possible message, used as an alert. EDIT: I made the examples slightly more concrete.
Set the episode length equal to the planned duration of a research program and send an alert when continuing the program would produce unworthwhile or dangerous results. The time delay ensures we keep the useful results obtained before to, whereas receiving a warning at the start of the program could change the outcome. Of course, this relies on us being able to recognize dangerous results by the end of the episode and remaining in contact with the oracle until then.
Use a second counterfactual oracle to predict a series of future stock prices. If its erasure event does not occur, run a time-delayed oracle with the same episode length as it. We then use the predicted series to make investments. If the time-delayed oracle’s erasure event occurs, we input the time at which we think our actions made the predictions remaining too inaccurate to use. When we receive an alert from the time-delayed oracle, we can discard the remaining predictions and/or rerun the oracles.
EDIT: Here are some potentially useful modifications we can make to the oracle:
Have the oracle accept input at any time during the episode, possibly multiple times. If erasure occurs, the last message entered during the episode is used as the target. For example, we could set a default message that is sent to the start of the episode as a safety measure unless we subsequently update it.
We can allow the episode to be ended early. Other than happening earlier, the scoring mechanism remains the same. If erasure did not occur, nothing happens when we try to end the episode (we would then know that the oracle has made a bad prediction). We might wish to end the episode early if we were waiting for a specific event, or in extreme situations in which we think we will lose access to the oracle.
Instead of the oracle predicting the message and time delay at the start of the episode, have it output mo at any time to<T. Erasure/shutdown now also happens at to, while scoring occurs at the end of the episode as before. This lets the oracle actively make use of observations from the start of the episode until to, increasing its predictive power (if the oracle is sufficiently powerful, this should change not affect its output).
Submission (CF). Use a counterfactual oracle to send a message to ourselves with a time delay. We choose an episode length T and set of possible messages M. The oracle outputs a time to<T and message mo∈M. We are automatically sent mo at time to unless erasure occurs, in which case at the end of the episode we input a time t∗and message m∗ which we wish we had received at t∗. The oracle’s reward is then −(t∗−to)2 if mo=m∗ and −∞ otherwise. We do not find out that erasure has occurred until the end of the episode, ensuring that events before to are the same as in the counterfactual world.
Here are some examples, which can work with a single possible message, used as an alert. EDIT: I made the examples slightly more concrete.
Set the episode length equal to the planned duration of a research program and send an alert when continuing the program would produce unworthwhile or dangerous results. The time delay ensures we keep the useful results obtained before to, whereas receiving a warning at the start of the program could change the outcome. Of course, this relies on us being able to recognize dangerous results by the end of the episode and remaining in contact with the oracle until then.
Use a second counterfactual oracle to predict a series of future stock prices. If its erasure event does not occur, run a time-delayed oracle with the same episode length as it. We then use the predicted series to make investments. If the time-delayed oracle’s erasure event occurs, we input the time at which we think our actions made the predictions remaining too inaccurate to use. When we receive an alert from the time-delayed oracle, we can discard the remaining predictions and/or rerun the oracles.
EDIT: Here are some potentially useful modifications we can make to the oracle:
Have the oracle accept input at any time during the episode, possibly multiple times. If erasure occurs, the last message entered during the episode is used as the target. For example, we could set a default message that is sent to the start of the episode as a safety measure unless we subsequently update it.
We can allow the episode to be ended early. Other than happening earlier, the scoring mechanism remains the same. If erasure did not occur, nothing happens when we try to end the episode (we would then know that the oracle has made a bad prediction). We might wish to end the episode early if we were waiting for a specific event, or in extreme situations in which we think we will lose access to the oracle.
Instead of the oracle predicting the message and time delay at the start of the episode, have it output mo at any time to<T. Erasure/shutdown now also happens at to, while scoring occurs at the end of the episode as before. This lets the oracle actively make use of observations from the start of the episode until to, increasing its predictive power (if the oracle is sufficiently powerful, this should change not affect its output).