[Meta: Even low-effort engagement, like “known + keyword” or “you misunderstood everything; read <link>” or “go on talking / thinking” is highly appreciated. Stacks grow from the bottom to the top today, unlike x86 or threads on the internet]
------------
Iterative amplification schemes work by having each version i+1 trained by previous iteration i; and, whenever version i fails at finding a good answer (low confidence in the prediction), punting the question to i−1 , until it reaches the human overseer at i=0, which is the ground truth for our purposes. At the same time, the distribution Di of relevant question widens at each iteration, as capabilities improve: For example, a very stupid clippy will need to deal with simple linear optimization problems of its supply chain; a very smart clippy will need to figure out a proper plan for world-domination.
Now, in order to correctly punt questions down the call-stack, each level i when faced with a problem q must decide whether it can solve it (providing training data for level i+1), or whether it must punt downwards. More precisely, level i must recognize whether q is inside the distribution Di−1 it was trained with (resp. Di, the distribution of problems it can generate / handle post-training).
False positives (underconfidence) lead to increased load on the human overseers / capability loss. False negatives (overconfidence) lead to safety failures: If they happen during training, then the machine learns wrong values / ethics / whatever-it-should-learn, if they happen in production then they lead to bad actions (I am assuming that no one would run the AI unsupervised, but supervision will, by necessity, be pretty light).
This is a problem, even with a very shallow call-stack (few iterations), but it gets (possibly exponentially!) worse with deeper stacks: After all, if even a single false-negative mistake occurs along the chain, we have a safety incident; naively, the chance of mistakes grows exponentially in the depth, for fixed false-negative rate at each level. In other words, we would need to ensure that the infinite series ∑ilog(1−pi) converges, where pi is the false-negative rate at level i .
Something that may makes this problem either simpler or harder is the fact that, once a question has been bumped down a couple of times, it can be very far out-of-distribution. This makes it much easier to get mediocre reliability (the questions bumped down from very high levels are very weird for the low levels), but, at the same time, the low levels are really, really not equipped for dealing with these weird questions. In the clippy example, Paul’s scheme would totally allow the high capability clippy to convince low levels into accepting its proposal by exploiting its weaknesses (adversarial examples).
In different words: The entire scheme needs really good anomaly detection. Anomaly detection is a machine-learning problem, right? So learn this along the way?
The problem is that, while training the anomaly detection at level i , we cannot sample from the distribution Di+N , because we simply don’t know it yet. As we run amplification, we extend both the capability of answering questions and the range of questions that come up.
Note that this would not be a problem if the extensions of supports of the distributions Di were very mild, compared to the larger domain Ωi where the anomaly detection of level i is reliable-by-training. This appears like a very dangerous assumption on the “mind-space” accessible to our training procedure, for the “problem-space” at hand and the specific training procedure. In other words: No universality of safety amplification, if it relies on this assumption. Please link me to relevant discussions if this assumption has been justified anywhere.
A different approach would be to not run the scheme in a hierarchical way; instead try to feed it into itself. This reintroduces the problem of “drift”, i.e. instability: The system will acquire a specific culture, and we can only pray (i.e. must assume / ensure) that desireable states are an attractor, for the specific training procedure. I fear that this is extra-hard to check, because meta-stable systems look pretty stable, until you increase capability: Bad scaling.
A third approach to solving this problem would be to create, separately, a reliable anomaly detector. This may very well be an AI complete problem, but it may be simpler than alignment! Further, robust anomaly detection (recognizing that data points are outside of the training distribution) might be an avenue for differential progress that matters more for safety/alignment than capabilities.
Do you know other approaches for this problem?
------------
When is this harmless?
Well, this is entirely harmless if our problem domain contains only simple questions, where verification of ground truth is dirt cheap. For example, NP questions (like, e.g., all of mathematics! Proof verification is trivial after all). This observation should give us a bad feeling, because an AI that is really good at mathematics / programming and bad at everything else is one of the most dangerous situation we could imagine (magical function optimizers, aka mindless AIXI genies, being the only more hopeless case I can think of). On the other hand, typical NP questions don’t scale down: It is currently entirely infeasible to use machine learning for theorem proving, simply because useful transformations are exponentially rare in the space of possible ones (I am aware of some papers using the mizar library; while the neural net + theorem prover beat the unaided prover, I was less than impressed by the results).
For problem domains that feel more like exptime, this is more likely to be a problem: Say, training to play games like Go. Then, we can play against our ancestors in order to judge performance, and gain access to some kind of ground truth. Unfortunately, (1) strength is not linearly ordered: You clearly can have situations where A beats B beats C beats A, and (2) if we wanted to optimize “strength against perfect play”, aka min-max, then we don’t have access to a perfect opponent during training. Afaik it is usual for training-through-amplification of Game AI to develop “fads”, i.e. cheesy tactics, on the way; sometimes, these recur cyclically. This is also observed for the metagame in many multiplayer videogames. I have a feeling that the Go successes tell us a lot about how MCTS is amazingly stable against cheesy tactics; and who knows how much tweaking deepmind had to do until they got the amplification stable.
Now, safety amplification / value learning has a much, much harder problem: The ground truth is only accessible through examples / very expensive oracle queries (which might be fundamentally unsafe, at very high levels of capability: Don’t let human operators talk to unaligned too-clever AI).
------------
Post-script: Writing this down in clear words made me slightly update against Paul’s amplification schemes eventually growing into a solution. I still think that Paul’s line of research is damn cool and promising, so I’m more playing devil’s advocate here. The possible differential gain for capability in NP problems versus harder-than-NP alignment for this kind of amplification procedure made me slightly more pessimistic about our prospects in general. Moreover, it makes me rather skeptic whether amplification is a net win for safety / alignment in the differential progress view. I want to look more into anomaly detection now, for fun, my own short-term profit and long-term safety.
Iterative amplification schemes work by having each version i+1 trained by previous iteration i; and, whenever version i fails at finding a good answer (low confidence in the prediction), punting the question to i−1 , until it reaches the human overseer at i=0, which is the ground truth for our purposes.
There is a dynamic like this in amplification, but I don’t think this is quite what happens.
In particular, the AI at level i-1 generally isn’t any more expensive than the AI at level i. The main dynamic for punting down is some way of breaking the problem into simpler pieces (security amplification requires you to take out-of-distribution data and, after enough steps, to reduce it to in-distribution subtasks), rather than punting to a weaker but more robust agent.
The problem is that, while training the anomaly detection at level i , we cannot sample from the distribution Di+N , because we simply don’t know it yet. As we run amplification, we extend both the capability of answering questions and the range of questions that come up.
I do agree with the basic point here though: as you do amplification the distribution shifts, and you need to be able to get a guarantee on a distribution that you can’t sample from. I talk about this problem in this post. It’s clearly pretty hard, but it does look significantly easier than the full problem to me.
I think the “false positives” we care about are a special kind of really bad failure, it’s OK if the agent guesses wrong about what I want as long as it continues to correctly treat its guess as provisional and doesn’t do anything that would be irreversibly bad if the guess is wrong. I’m optimistic that (a) a smarter agent could recognize these failures when it sees them, (b) it’s easy enough to learn a model that never makes such mistakes, (c) we can use some combination of these techniques to actually learn a model that doesn’t make these mistakes. This might well be the diciest part of the scheme.
I don’t like “anomaly detection” as a framing for the problem we care about because that implies some change in some underlying data-generating process, but that’s not necessary to cause a catastrophic failure.
(Sorry if I misunderstood your comment, didn’t read in depth.)
[Meta: Even low-effort engagement, like “known + keyword” or “you misunderstood everything; read <link>” or “go on talking / thinking” is highly appreciated. Stacks grow from the bottom to the top today, unlike x86 or threads on the internet]
------------
Iterative amplification schemes work by having each version i+1 trained by previous iteration i; and, whenever version i fails at finding a good answer (low confidence in the prediction), punting the question to i−1 , until it reaches the human overseer at i=0, which is the ground truth for our purposes. At the same time, the distribution Di of relevant question widens at each iteration, as capabilities improve: For example, a very stupid clippy will need to deal with simple linear optimization problems of its supply chain; a very smart clippy will need to figure out a proper plan for world-domination.
Now, in order to correctly punt questions down the call-stack, each level i when faced with a problem q must decide whether it can solve it (providing training data for level i+1), or whether it must punt downwards. More precisely, level i must recognize whether q is inside the distribution Di−1 it was trained with (resp. Di, the distribution of problems it can generate / handle post-training).
False positives (underconfidence) lead to increased load on the human overseers / capability loss. False negatives (overconfidence) lead to safety failures: If they happen during training, then the machine learns wrong values / ethics / whatever-it-should-learn, if they happen in production then they lead to bad actions (I am assuming that no one would run the AI unsupervised, but supervision will, by necessity, be pretty light).
This is a problem, even with a very shallow call-stack (few iterations), but it gets (possibly exponentially!) worse with deeper stacks: After all, if even a single false-negative mistake occurs along the chain, we have a safety incident; naively, the chance of mistakes grows exponentially in the depth, for fixed false-negative rate at each level. In other words, we would need to ensure that the infinite series ∑ilog(1−pi) converges, where pi is the false-negative rate at level i .
Something that may makes this problem either simpler or harder is the fact that, once a question has been bumped down a couple of times, it can be very far out-of-distribution. This makes it much easier to get mediocre reliability (the questions bumped down from very high levels are very weird for the low levels), but, at the same time, the low levels are really, really not equipped for dealing with these weird questions. In the clippy example, Paul’s scheme would totally allow the high capability clippy to convince low levels into accepting its proposal by exploiting its weaknesses (adversarial examples).
In different words: The entire scheme needs really good anomaly detection. Anomaly detection is a machine-learning problem, right? So learn this along the way?
The problem is that, while training the anomaly detection at level i , we cannot sample from the distribution Di+N , because we simply don’t know it yet. As we run amplification, we extend both the capability of answering questions and the range of questions that come up.
Note that this would not be a problem if the extensions of supports of the distributions Di were very mild, compared to the larger domain Ωi where the anomaly detection of level i is reliable-by-training. This appears like a very dangerous assumption on the “mind-space” accessible to our training procedure, for the “problem-space” at hand and the specific training procedure. In other words: No universality of safety amplification, if it relies on this assumption. Please link me to relevant discussions if this assumption has been justified anywhere.
A different approach would be to not run the scheme in a hierarchical way; instead try to feed it into itself. This reintroduces the problem of “drift”, i.e. instability: The system will acquire a specific culture, and we can only pray (i.e. must assume / ensure) that desireable states are an attractor, for the specific training procedure. I fear that this is extra-hard to check, because meta-stable systems look pretty stable, until you increase capability: Bad scaling.
A third approach to solving this problem would be to create, separately, a reliable anomaly detector. This may very well be an AI complete problem, but it may be simpler than alignment! Further, robust anomaly detection (recognizing that data points are outside of the training distribution) might be an avenue for differential progress that matters more for safety/alignment than capabilities.
Do you know other approaches for this problem?
------------
When is this harmless?
Well, this is entirely harmless if our problem domain contains only simple questions, where verification of ground truth is dirt cheap. For example, NP questions (like, e.g., all of mathematics! Proof verification is trivial after all). This observation should give us a bad feeling, because an AI that is really good at mathematics / programming and bad at everything else is one of the most dangerous situation we could imagine (magical function optimizers, aka mindless AIXI genies, being the only more hopeless case I can think of). On the other hand, typical NP questions don’t scale down: It is currently entirely infeasible to use machine learning for theorem proving, simply because useful transformations are exponentially rare in the space of possible ones (I am aware of some papers using the mizar library; while the neural net + theorem prover beat the unaided prover, I was less than impressed by the results).
For problem domains that feel more like exptime, this is more likely to be a problem: Say, training to play games like Go. Then, we can play against our ancestors in order to judge performance, and gain access to some kind of ground truth. Unfortunately, (1) strength is not linearly ordered: You clearly can have situations where A beats B beats C beats A, and (2) if we wanted to optimize “strength against perfect play”, aka min-max, then we don’t have access to a perfect opponent during training. Afaik it is usual for training-through-amplification of Game AI to develop “fads”, i.e. cheesy tactics, on the way; sometimes, these recur cyclically. This is also observed for the metagame in many multiplayer videogames. I have a feeling that the Go successes tell us a lot about how MCTS is amazingly stable against cheesy tactics; and who knows how much tweaking deepmind had to do until they got the amplification stable.
Now, safety amplification / value learning has a much, much harder problem: The ground truth is only accessible through examples / very expensive oracle queries (which might be fundamentally unsafe, at very high levels of capability: Don’t let human operators talk to unaligned too-clever AI).
------------
Post-script: Writing this down in clear words made me slightly update against Paul’s amplification schemes eventually growing into a solution. I still think that Paul’s line of research is damn cool and promising, so I’m more playing devil’s advocate here. The possible differential gain for capability in NP problems versus harder-than-NP alignment for this kind of amplification procedure made me slightly more pessimistic about our prospects in general. Moreover, it makes me rather skeptic whether amplification is a net win for safety / alignment in the differential progress view. I want to look more into anomaly detection now, for fun, my own short-term profit and long-term safety.
There is a dynamic like this in amplification, but I don’t think this is quite what happens.
In particular, the AI at level i-1 generally isn’t any more expensive than the AI at level i. The main dynamic for punting down is some way of breaking the problem into simpler pieces (security amplification requires you to take out-of-distribution data and, after enough steps, to reduce it to in-distribution subtasks), rather than punting to a weaker but more robust agent.
I do agree with the basic point here though: as you do amplification the distribution shifts, and you need to be able to get a guarantee on a distribution that you can’t sample from. I talk about this problem in this post. It’s clearly pretty hard, but it does look significantly easier than the full problem to me.
I think the “false positives” we care about are a special kind of really bad failure, it’s OK if the agent guesses wrong about what I want as long as it continues to correctly treat its guess as provisional and doesn’t do anything that would be irreversibly bad if the guess is wrong. I’m optimistic that (a) a smarter agent could recognize these failures when it sees them, (b) it’s easy enough to learn a model that never makes such mistakes, (c) we can use some combination of these techniques to actually learn a model that doesn’t make these mistakes. This might well be the diciest part of the scheme.
I don’t like “anomaly detection” as a framing for the problem we care about because that implies some change in some underlying data-generating process, but that’s not necessary to cause a catastrophic failure.
(Sorry if I misunderstood your comment, didn’t read in depth.)