I don’t expect to use this kind of mechanism for fixing things, and am not exactly sure what it should look like.
Instead, when something goes wrong, you add the data to whatever dataset of experiences you are maintaining (or use amplification to decide how to update some small sketch), and then trust the mechanism that makes decisions from that database.
Basically, the goal is to make fewer errors than the RL agent (in the infinite computing limit), rather than making errors and then correcting them in the same way the RL agent would.
(I don’t know if I’ve followed the conversation well enough to respond sensibly.)
Instead, when something goes wrong, you add the data to whatever dataset of experiences you are maintaining (or use amplification to decide how to update some small sketch), and then trust the mechanism that makes decisions from that database.
By “mechanism that makes decisions from that database” are you thinking of some sort of linguistics mechanism, or a mechanism for general scientific research?
The reason I ask is, what if what went wrong was that H is missing some linguistics concept, for example the concept of implicature? Since we can’t guarantee that H knows all useful linguistics concepts (the field of linguistics may not be complete), it seems that in order to “make fewer errors than the RL agent (in the infinite computing limit)” IDA has to be able to invent linguistics concepts that H doesn’t know, and if IDA can do that then presumably IDA can do science in general?
If the latter (mechanism for general scientific research) is what you have in mind, we can’t really show that meta-execution is hopeless by pointing to some object-level task that it doesn’t seem able to do, because if we run into any difficulties we can always say “we don’t know how to do X with meta-execution, but if IDA can learn to do general scientific research, then it will invent whatever tools are needed to do X”.
There is some mechanism the RL agent uses, which doesn’t rest on scientific research. IDA should use the same mechanism.
This may sometimes involve “heuristic X works well empirically, but has no detectable internal structure.” In those cases IDA needs to be able to come up with a safe version of that procedure (i.e. a version that wouldn’t leave us at a disadvantage relative to people who just want to maximize complexity or whatever). I think the main obstacle to safety is if heuristic X itself involves consequentialism. But in that case there seems to necessarily be some internal structure. (This is the kind of thing that I have been mostly thinking about recently.)
There is some mechanism the RL agent uses, which doesn’t rest on scientific research. IDA should use the same mechanism.
How does IDA find such a mechanism, if not by scientific research? RL does it by searching for weights that do well empirically, and William and I were wondering if that idea could be adapted to IDA but you said “Searching for trees that do well empirically is scary business, since now you have all the normal problems with ML.” (I had interpreted you to mean that we should avoid doing that. Did you actually mean that we should try to figure out a safe way to do it?)
I think you need to do some trial and error, and was saying we should be scared of it ( / be careful about it / minimize it, though it’s subtle why minimization might help).
For example, suppose that I put a random 20 gate circuit in a black box and let you observe input-output behavior. At some point you don’t have any options other than guess and check, and no amount of cleverness about alignment could possibly avoid the need to sometimes use brute force.
I guess what we’re trying to unpack is “the mechanism that makes decisions from that database”, and whether it can be efficient. If you store all experience with no editing, you’re left with the same problem of “what decisions in the tree do you change based on experience (X,y) and how do you change them?” And it seems like the most naive way of doing this is too inefficient (looking at all data at all decision nodes).
If you summarize experience (what I’m interpreting “decide how to update some small sketch” as meaning), then the question is can you produce that summary in an efficient way, and efficiently use it at runtime?
The gradient-descent style solution is meant to gesture how this could be done in a way analogous to how we use parametric models rather than non-parametric models in ML for decently sized datasets. In advance, bake in a set of parameters/information that could influence each decision, then do updates by having the overseer look for ways to change the decision parameters. But when we do this in an oversight model, we can make the decisions/parameters interpretable sentences.
Rough example of how I see the overall process going:
1. amplified overseer acts in an environment with a breakable vase, but the overseer does not know that the vase is breakable
2. First time it acts, some overseer node N looks into the database for any information related to the vase, and returns nothing, and proceeds in it decision
3. The vase breaks, and the negative feedback is given to the amplified overseer
4. Amplified overseer runs some process to figure out that it could have made a better decision if the database contained the knowledge that vases are breakable, and so updates the database. (The thing that Wei Dai are trying to unpack is, can this step be run in a reasonable amount of time, given that each node of the overseer has only a limited understanding of the original decision process)
5. Second time it acts, the amplified overseer process is the same, up until the node N that does the database lookup is run. This time, the lookup returns that the vase is breakable, so the overseer runs a different decision path and treats the vase with more care.
The constraint on the amplification process is that learning the full set of subtasks can’t be that much harder than simply learning the task.
I propose the following as an example of a task where learning the full set of subtasks is much harder than simply learning the task. Suppose we’re trying to predict quantum mechanical systems, specifically we’re given a molecule and asked to predict some property of it.
How would this work with amplification? If I’m not misunderstanding something, assuming the overseer knows QM, one of the subtasks would be to do a QM simulation (via meta-execution), and that seems much harder for ML to learn than just predicting a specific property. If the overseer does not know QM, one of the subtasks would have to be to do science and invent QM, which seems even harder to learn.
This seems to show that H can’t always produce a transcript for A to do imitation learning or inverse reinforcement learning from, so the only option left for the distillation process is direct supervision?
You don’t have to do QM to make predictions about the particle. The goal is for IDA to find whatever structure allows the RL agent to make a prediction. (The exponential tree will solve the problem easily, but if we interleave distillation steps then many of those subtrees will get stuck because the agent isn’t able to learn to handle them.)
In some cases this will involve opaque structures that happen to make good predictions. In that case, we need to make a safety argument about “heuristic without internal structure that happens to work.”
You don’t have to do QM to make predictions about the particle. The goal is for IDA to find whatever structure allows the RL agent to make a prediction.
My thought here is why try to find this structure inside meta-execution? It seems counterintuitive / inelegant that you have to worry about the safety of learned / opaque structures in meta-execution, and then again in the distillation step. Why don’t we let the overseer directly train some auxiliary ML models at each iteration of IDA, using whatever data the overseer can obtain (in this case empirical measurements of molecule properties) and whatever transparency / robustness methods the overseer wants to use, and then make those auxiliary models available to the overseer at the next iteration?
It seems counterintuitive / inelegant that you have to worry about the safety of learned / opaque structures in meta-execution, and then again in the distillation step.
I agree, I think it’s unlikely the final scheme will involve doing this work in two places.
Why don’t we let the overseer directly train some auxiliary ML models at each iteration of IDA, using whatever data the overseer can obtain (in this case empirical measurements of molecule properties) and whatever transparency / robustness methods the overseer wants to use, and then make those auxiliary models available to the overseer at the next iteration?
This a way that things could end up looking. I think there are more natural ways to do this integration though.
Note that in order for any of this to work, amplification probably needs to be able to replicate/verify all (or most) of the cognitive work the ML model does implicitly, so that we can do informed oversight. There w opaque heuristics that “just work,” which are discovered either by ML or metaexecution trial-and-error, but then we need to confirm safety for those heuristics.
Ah, right. I guess I was balking at moving from exorbitant to exp(exorbitant). Maybe it’s better to think of this as reducing the size of fully worked initial overseer example problems that can be produced for training/increasing the number of amplification rounds that are needed.
So my argument is more an example of what a distilled overseer could learn as an efficient approximation.
I don’t expect to use this kind of mechanism for fixing things, and am not exactly sure what it should look like.
Instead, when something goes wrong, you add the data to whatever dataset of experiences you are maintaining (or use amplification to decide how to update some small sketch), and then trust the mechanism that makes decisions from that database.
Basically, the goal is to make fewer errors than the RL agent (in the infinite computing limit), rather than making errors and then correcting them in the same way the RL agent would.
(I don’t know if I’ve followed the conversation well enough to respond sensibly.)
By “mechanism that makes decisions from that database” are you thinking of some sort of linguistics mechanism, or a mechanism for general scientific research?
The reason I ask is, what if what went wrong was that H is missing some linguistics concept, for example the concept of implicature? Since we can’t guarantee that H knows all useful linguistics concepts (the field of linguistics may not be complete), it seems that in order to “make fewer errors than the RL agent (in the infinite computing limit)” IDA has to be able to invent linguistics concepts that H doesn’t know, and if IDA can do that then presumably IDA can do science in general?
If the latter (mechanism for general scientific research) is what you have in mind, we can’t really show that meta-execution is hopeless by pointing to some object-level task that it doesn’t seem able to do, because if we run into any difficulties we can always say “we don’t know how to do X with meta-execution, but if IDA can learn to do general scientific research, then it will invent whatever tools are needed to do X”.
Does this match your current thinking?
There is some mechanism the RL agent uses, which doesn’t rest on scientific research. IDA should use the same mechanism.
This may sometimes involve “heuristic X works well empirically, but has no detectable internal structure.” In those cases IDA needs to be able to come up with a safe version of that procedure (i.e. a version that wouldn’t leave us at a disadvantage relative to people who just want to maximize complexity or whatever). I think the main obstacle to safety is if heuristic X itself involves consequentialism. But in that case there seems to necessarily be some internal structure. (This is the kind of thing that I have been mostly thinking about recently.)
How does IDA find such a mechanism, if not by scientific research? RL does it by searching for weights that do well empirically, and William and I were wondering if that idea could be adapted to IDA but you said “Searching for trees that do well empirically is scary business, since now you have all the normal problems with ML.” (I had interpreted you to mean that we should avoid doing that. Did you actually mean that we should try to figure out a safe way to do it?)
I think you need to do some trial and error, and was saying we should be scared of it ( / be careful about it / minimize it, though it’s subtle why minimization might help).
For example, suppose that I put a random 20 gate circuit in a black box and let you observe input-output behavior. At some point you don’t have any options other than guess and check, and no amount of cleverness about alignment could possibly avoid the need to sometimes use brute force.
I guess what we’re trying to unpack is “the mechanism that makes decisions from that database”, and whether it can be efficient. If you store all experience with no editing, you’re left with the same problem of “what decisions in the tree do you change based on experience (X,y) and how do you change them?” And it seems like the most naive way of doing this is too inefficient (looking at all data at all decision nodes).
If you summarize experience (what I’m interpreting “decide how to update some small sketch” as meaning), then the question is can you produce that summary in an efficient way, and efficiently use it at runtime?
The gradient-descent style solution is meant to gesture how this could be done in a way analogous to how we use parametric models rather than non-parametric models in ML for decently sized datasets. In advance, bake in a set of parameters/information that could influence each decision, then do updates by having the overseer look for ways to change the decision parameters. But when we do this in an oversight model, we can make the decisions/parameters interpretable sentences.
Rough example of how I see the overall process going:
1. amplified overseer acts in an environment with a breakable vase, but the overseer does not know that the vase is breakable
2. First time it acts, some overseer node N looks into the database for any information related to the vase, and returns nothing, and proceeds in it decision
3. The vase breaks, and the negative feedback is given to the amplified overseer
4. Amplified overseer runs some process to figure out that it could have made a better decision if the database contained the knowledge that vases are breakable, and so updates the database. (The thing that Wei Dai are trying to unpack is, can this step be run in a reasonable amount of time, given that each node of the overseer has only a limited understanding of the original decision process)
5. Second time it acts, the amplified overseer process is the same, up until the node N that does the database lookup is run. This time, the lookup returns that the vase is breakable, so the overseer runs a different decision path and treats the vase with more care.
The constraint on the amplification process is that learning the full set of subtasks can’t be that much harder than simply learning the task.
There isn’t any constraint on the computation time of the overall tree, which should generally be exorbitant.
Searching for trees that do well empirically is scary business, since now you have all the normal problems with ML.
I propose the following as an example of a task where learning the full set of subtasks is much harder than simply learning the task. Suppose we’re trying to predict quantum mechanical systems, specifically we’re given a molecule and asked to predict some property of it.
How would this work with amplification? If I’m not misunderstanding something, assuming the overseer knows QM, one of the subtasks would be to do a QM simulation (via meta-execution), and that seems much harder for ML to learn than just predicting a specific property. If the overseer does not know QM, one of the subtasks would have to be to do science and invent QM, which seems even harder to learn.
This seems to show that H can’t always produce a transcript for A to do imitation learning or inverse reinforcement learning from, so the only option left for the distillation process is direct supervision?
You don’t have to do QM to make predictions about the particle. The goal is for IDA to find whatever structure allows the RL agent to make a prediction. (The exponential tree will solve the problem easily, but if we interleave distillation steps then many of those subtrees will get stuck because the agent isn’t able to learn to handle them.)
In some cases this will involve opaque structures that happen to make good predictions. In that case, we need to make a safety argument about “heuristic without internal structure that happens to work.”
My thought here is why try to find this structure inside meta-execution? It seems counterintuitive / inelegant that you have to worry about the safety of learned / opaque structures in meta-execution, and then again in the distillation step. Why don’t we let the overseer directly train some auxiliary ML models at each iteration of IDA, using whatever data the overseer can obtain (in this case empirical measurements of molecule properties) and whatever transparency / robustness methods the overseer wants to use, and then make those auxiliary models available to the overseer at the next iteration?
I agree, I think it’s unlikely the final scheme will involve doing this work in two places.
This a way that things could end up looking. I think there are more natural ways to do this integration though.
Note that in order for any of this to work, amplification probably needs to be able to replicate/verify all (or most) of the cognitive work the ML model does implicitly, so that we can do informed oversight. There w opaque heuristics that “just work,” which are discovered either by ML or metaexecution trial-and-error, but then we need to confirm safety for those heuristics.
Ah, right. I guess I was balking at moving from exorbitant to exp(exorbitant). Maybe it’s better to think of this as reducing the size of fully worked initial overseer example problems that can be produced for training/increasing the number of amplification rounds that are needed.
So my argument is more an example of what a distilled overseer could learn as an efficient approximation.