Probably not too original but I haven’t seen it clearly written anywhere.
There are several ways to amplify imitators with different safety-performance tradeoffs. This is something to consider when designing IDA-type solutions.
Amplifying by objective time: The AI is predicting what the user(s) will output after thinking about a problem for a long time. This method is the strongest, but also the least safe. It is the least safe because malign AI might exist in the future, which affects the prediction, which creates an attack vector for future malign AI to infiltrate the present world. We can try to defend by adding a button for “malign AI is attacking”, but that still leaves us open to surprise takeovers in which there is no chance to press the button.
Amplifying by subjective time: The AI is predicting what the user(s) will output after thinking about a problem for a short time, where in the beginning they are given the output of a similar process that ran for one iteration less. So, this simulates a “groundhog day” scenario where the humans wake up in the same objective time period over and over without memory of the previous iterations but with a written legacy. This is weaker than amplifying by objective time, because learning previous results is an overhead, and illegible intuitions might be hard to transmit. This is safer than amplifying by objective time, but if there is some probability of malign AI created in the short time period, there is still an attack vector. The malign AI leakage in this method is roughly proportional to subjective time of simulation times the present rate of malign AI takeover, as opposed to amplification by objective time where leakage is proportional to subjective time of simulation times some average future rate of malign AI takeover. However, by the time we are able to create this benign AI, the present rate of malign AI takeover might also be considerable.
Amplifying by probability: We allow the user(s) to choose “success” or “failure” (or some continuous metric) after completing their work, and make the AI skew the distribution of predictions toward success. This is similar to amplifying by subjective time without any transmission of information. It is weaker and about as safe. The potential advantage is, lower sample complexity: the AI only needs to have a reliable distribution of outcomes after the initial state instead of subsequent states.
Amplifying by parallelization: The AI is predicting the output of many copies of the user working together, by having strictly defined interfaces between the copies, over a time period similar to real time. For example, we can imagine a hierarchical organization where each person gives subtasks to their subordinates. We can then simulate such an organization with a copy of some subset of users in each role. To do this, the AI only needs to learn what a given subset of users would do given a particular task from their supervisors and particular results by their subordinates. This method is weaker than previous methods since it requires that the task at hand can be parallelized. But, it is also the safest since the rate of malign AI takeover is only amplified by O(1) compared to the background. [EDIT: Actually, it’s not safer than subjective time because the AI would sample the external world independently for each node in the organization. To avoid this, we would need to somehow define a correspondence between the outcome sets of worlds in which the user was queried at different nodes, and I don’t know how to do this.]
A complete solution can try to combine all of those methods, by simulating a virtual organization where the members can control which method is applied at every point. This way they can strive for the optimal risk-performance balance: parallelize everything that can be parallelized and amplify otherwise tasks that cannot be parallelized, change the subjective/objective time balance based on research into malign AI timelines etc.
More observations about this attack vector (“attack from counterfactuals”). I focus on “amplifying by subjective time”.
The harder the takeoff the more dangerous this attack vector: During every simulation cycle, ability to defend against simulated malign AI depends on the power of the defense system in the beginning of the cycle[1]. On the other hand, the capability of the attacker depends on its power in the end of the cycle. Therefore, if power grows very fast this is bad news for the defender. On the other hand, if power grows very slowly, the defender wins because defense is probably easier.
Inner control of anchor makes system safer: Given a fixed objective time anchor, the more times the simulation goes through a cycle, the higher its exposure to the threat, since improving the defense system takes time even if you have all the knowledge. Therefore, we want to give the simulated user the ability to move the objective time anchor forward in time, in order to benefit from improvements in the defense system.
Additional information about the external world makes system safer: Suppose our subjective probability of a malign AI successfully attacking during a single simulation cycle is 0.1%. Then performing 1000 cycles superficially yields a failure probability of about 37%. However, this is only the case if attacks on different cycles are independent random variables. If the AI knows as much about the possibility of such an attack as we do, this is a justified assumption. However, if the AI knows more, then, from our subjective perspective, these variables are positively correlated. In particular, if the AI knows with certainty whether a successful attack would occur or not the failure probability over 1000 cycles goes back to 0.1%. Therefore, it is beneficial to give the AI extra information about the world, just so it can estimate the probability of attack with more confidence.
More precisely, it is somewhat better than this since, if we are at some late cycle among a sequence of cycles anchored to same point in objective time, the simulated user can use knowledge generated in previous cycles to improve the defense system during the current cycle. So, early cycles contribute more to the risk.
Probably not too original but I haven’t seen it clearly written anywhere.
There are several ways to amplify imitators with different safety-performance tradeoffs. This is something to consider when designing IDA-type solutions.
Amplifying by objective time: The AI is predicting what the user(s) will output after thinking about a problem for a long time. This method is the strongest, but also the least safe. It is the least safe because malign AI might exist in the future, which affects the prediction, which creates an attack vector for future malign AI to infiltrate the present world. We can try to defend by adding a button for “malign AI is attacking”, but that still leaves us open to surprise takeovers in which there is no chance to press the button.
Amplifying by subjective time: The AI is predicting what the user(s) will output after thinking about a problem for a short time, where in the beginning they are given the output of a similar process that ran for one iteration less. So, this simulates a “groundhog day” scenario where the humans wake up in the same objective time period over and over without memory of the previous iterations but with a written legacy. This is weaker than amplifying by objective time, because learning previous results is an overhead, and illegible intuitions might be hard to transmit. This is safer than amplifying by objective time, but if there is some probability of malign AI created in the short time period, there is still an attack vector. The malign AI leakage in this method is roughly proportional to subjective time of simulation times the present rate of malign AI takeover, as opposed to amplification by objective time where leakage is proportional to subjective time of simulation times some average future rate of malign AI takeover. However, by the time we are able to create this benign AI, the present rate of malign AI takeover might also be considerable.
Amplifying by probability: We allow the user(s) to choose “success” or “failure” (or some continuous metric) after completing their work, and make the AI skew the distribution of predictions toward success. This is similar to amplifying by subjective time without any transmission of information. It is weaker and about as safe. The potential advantage is, lower sample complexity: the AI only needs to have a reliable distribution of outcomes after the initial state instead of subsequent states.
Amplifying by parallelization: The AI is predicting the output of many copies of the user working together, by having strictly defined interfaces between the copies, over a time period similar to real time. For example, we can imagine a hierarchical organization where each person gives subtasks to their subordinates. We can then simulate such an organization with a copy of some subset of users in each role. To do this, the AI only needs to learn what a given subset of users would do given a particular task from their supervisors and particular results by their subordinates. This method is weaker than previous methods since it requires that the task at hand can be parallelized.
But, it is also the safest since the rate of malign AI takeover is only amplified by O(1) compared to the background.[EDIT: Actually, it’s not safer than subjective time because the AI would sample the external world independently for each node in the organization. To avoid this, we would need to somehow define a correspondence between the outcome sets of worlds in which the user was queried at different nodes, and I don’t know how to do this.]A complete solution can try to combine all of those methods, by simulating a virtual organization where the members can control which method is applied at every point. This way they can strive for the optimal risk-performance balance: parallelize everything that can be parallelized and amplify otherwise tasks that cannot be parallelized, change the subjective/objective time balance based on research into malign AI timelines etc.
More observations about this attack vector (“attack from counterfactuals”). I focus on “amplifying by subjective time”.
The harder the takeoff the more dangerous this attack vector: During every simulation cycle, ability to defend against simulated malign AI depends on the power of the defense system in the beginning of the cycle[1]. On the other hand, the capability of the attacker depends on its power in the end of the cycle. Therefore, if power grows very fast this is bad news for the defender. On the other hand, if power grows very slowly, the defender wins because defense is probably easier.
Inner control of anchor makes system safer: Given a fixed objective time anchor, the more times the simulation goes through a cycle, the higher its exposure to the threat, since improving the defense system takes time even if you have all the knowledge. Therefore, we want to give the simulated user the ability to move the objective time anchor forward in time, in order to benefit from improvements in the defense system.
Additional information about the external world makes system safer: Suppose our subjective probability of a malign AI successfully attacking during a single simulation cycle is 0.1%. Then performing 1000 cycles superficially yields a failure probability of about 37%. However, this is only the case if attacks on different cycles are independent random variables. If the AI knows as much about the possibility of such an attack as we do, this is a justified assumption. However, if the AI knows more, then, from our subjective perspective, these variables are positively correlated. In particular, if the AI knows with certainty whether a successful attack would occur or not the failure probability over 1000 cycles goes back to 0.1%. Therefore, it is beneficial to give the AI extra information about the world, just so it can estimate the probability of attack with more confidence.
More precisely, it is somewhat better than this since, if we are at some late cycle among a sequence of cycles anchored to same point in objective time, the simulated user can use knowledge generated in previous cycles to improve the defense system during the current cycle. So, early cycles contribute more to the risk.
I think this would make a good top-level post. I have the feeling I’ll want to link to it later.
I retracted part of that, see the edit.