Can you give an example of natural language instruction (for humans operating on small inputs) that can’t be turned into a formal algorithm easily?
Any set of natural language instructions for humans operating on small inputs can be turned into a lookup table by executing the human on all possible inputs (multiple times on each input, if you want to capture a stochastic policy).
The with the following “Consider the sentence [s1] w [s2]”, and have the agent launch queries of the form “Consider the sentence [s1] w [s2], where we take w to have meaning m”. Now, you could easily produce this behaviour algorithmically if you have a dictionary. But in a world without dictionaries, suitably preparing a human to answer this query takes much less effort than producing a dictionary.
Suppose we wanted to have IDA (with security amplification) translate a pair of natural languages at least as well as the current best machine translator (which I believe is based on deep learning, trained on sentence pairs), and suppose the human overseer H can translate this pair of languages at an expert level, better than the machine translator. The only way I know how to accomplish this is to have IDA emulate the deep learning translator at a very low level, with H acting as a “human transistor” or maybe a “human neuron”, and totally ignore what H knows about translation including the meanings of words. Do you know a better way than this, or do you have an argument or intuition that a better way is possible?
I guess the above is addressing a slightly different issue than my original question, so to go back to that, your answer is not very satisfying because we do live in a world where dictionaries exist, and dictionaries aren’t that expensive to produce even if they didn’t exist. Can you think of other examples? Or have other explanations of why you think giving a human a bunch of natural language instructions might be much easier than writing down a formal algorithm (to the extent that it would be worth the downsides of having H be a human instead of a formal algorithm)?
The only way I know how to accomplish this is to have IDA emulate the deep learning translator at a very low level, with H acting as a “human transistor” or maybe a “human neuron”, and totally ignore what H knows about translation including the meanings of words.
The humans definitely don’t need to emulate the deep learning system. They could use a different way of translating that reaches a higher performance than the deep learning system, which will then be copied.
dictionaries aren’t that expensive to produce even if they didn’t exist. Can you think of other examples?
You could do the same thing with grammatical phrases of length ⇐ 10.
The humans definitely don’t need to emulate the deep learning system. They could use a different way of translating that reaches a higher performance than the deep learning system, which will then be copied.
Do you have such a way in mind, or just think that IDA will eventually figure out such a way if amplified enough? If the latter, am I correct in thinking that IDA will be generally superhuman at that point, since you and I can’t think of such a way?
You could do the same thing with grammatical phrases of length ⇐ 10.
I’m having trouble visualizing what the human is actually doing in this case. Can you or someone give natural language instructions that would let me know what to do as H? (Oh, please know that I didn’t entirely understand what William was saying either, but didn’t think it was important to ask.)
Do you have such a way in mind, or just think that IDA will eventually figure out such a way if amplified enough? If the latter, am I correct in thinking that IDA will be generally superhuman at that point, since you and I can’t think of such a way?
I think that a naive approach would probably work. We discussed what that might look like at the MIRI workshop we were both at. I’m imagining a breakdown into (source text --> meaning) and (meaning --> target text), with richer and richer representations of meaning (including connotation, etc.) as you amplify further. To implement (source text --> meaning) you would ask things like “What are the possible meanings phrase X?” and try to represent that meaning in terms of the meaning of the constituents. To do that, you might ask questions like “Is X likely to be an idiom? If so, what are the plausible meanings?” or “Can X be produced by a grammatical production rule, and if so how does it meaning relate to the meaning of its constituents?” or so on. To answer one of those questions you might say “What are some sentences in the database where constructions like X occur?”, “What are some possible meanings of X in context Y?” and “Is meaning Z consistent with the usage of X in context Y?” To answer the latter, you’d have to answer subquestions like “What are the most surprising consequences of the assertion X?” And so on.
(Hopefully it’s clear enough how you could use aligned sentences in a similar framework, though then the computation won’t factor as cleanly through meanings.)
Unsurprisingly, this gets really complicated. I think the easiest methodology to explore feasibility (other than actually implementing these decompositions) is to play the iterative game where you suggest a task that seems hard to decompose and I suggest a decomposition (thereby broadening the space of subtasks the system needs to solve). My intuition has been produced by playing this kind of game and it seeming very hard to get stuck. Given that the trees quickly become exponentially large and varied, it seems very difficult to provide a large tree.
Can you or someone give natural language instructions that would let me know what to do as H?
“List the most plausible meanings you can think of for the expression X”, e.g.
Q: List a plausible meaning for the expression “something you design for the present”
A: One candidate meaning is {an expression $2 which refers to a thing $3 and whose use implies that {there is something $1 satisfying {{{the speaker of {$2}} is addressing {$1}} and {{$1} regularly performs the action $4={design {$3} for the purpose {being used at the time {the time when {$4} occurs}}}}}}}, another is...
Where the {}’s represent pointers to submessages. The instructions describe the semantics for representing meaning, plus some guidance about the desiderata for answering questions, and so forth.
A similar example:
“List some facts that relate X, Y, and Z”, e.g.
Q: List some facts that relate humans, age, and the ocean.
A: One fact is {young humans do not know how to swim in the ocean}, another is...
Obviously things are a lot more complicated than this, but hopefully those examples illustrate how a human can be doing useful work while still operating on inputs that are small enough to be safe.
To implement (source text --> meaning) you would ask things like “What are the possible meanings phrase X?” and try to represent that meaning in terms of the meaning of the constituents. To do that, you might ask questions like “Is X likely to be an idiom? If so, what are the plausible meanings?” or “Can X be produced by a grammatical production rule, and if so how does it meaning relate to the meaning of its constituents?” or so on.
This points to another potential problem with capability amplification: in order to reach some target capability via amplification, you may have to go through another capability that is harder for ML to learn. In this case, the target capability is translation, and the intermediate capability is linguistic knowledge and skills. (We currently have ML that can learn to translate, but AFAIK not learn how to apply linguistics to recreate the ability to translate.) If this is true in general (and I don’t see why translation might be an exceptional case) then capability amplification being universal isn’t enough to ensure that IDA will be competitive with unaligned AIs, because in order to be competitive with state of the art AI capabilities (which can barely be learned by ML at a certain point in time) it may have to go through capabilities that are beyond what ML can learn at that time.
in order to reach some target capability via amplification, you may have to go through another capability that is harder for ML to learn
This is a general restriction on iterated amplification. Without this restriction it would be mostly trivial—whatever work we could do to build aligned AI, you could just do inside HCH, then delegate the decision to the resulting aligned AI.
In this case, the target capability is translation, and the intermediate capability is linguistic knowledge and skills. (We currently have ML that can learn to translate, but AFAIK not learn how to apply linguistics to recreate the ability to translate.)
If your AI is able to notice an empirical correlation (e.g. word A cooccurs with word B), and lacks the capability to understand anything at all about the causal structure of that correlation, then you have no option but to act on the basis of the brute association, i.e. to take the action that looks best to you in light of that correlation, without conditioning on other facts about the causal structure of the association, since by hypothesis your system is not capable enough to recognize those other facts.
If we have an empirical association between behavior X (pressing a sequence of buttons related in a certain way to what’s in memory) and our best estimate of utility, we might end up needing to take that action without understanding what’s going on causally. I’m still happy calling this aligned in general: the exact same thing would happen to a perfectly motivated human assistant trying their best to do what you want, who was able to notice an empirical correlation but was not smart enough to notice anything about the underlying mechanism (and sometimes acting on the basis of such correlations will be bad).
In order to argue that our AI leads to good outcomes, we need to make an assumption not only about alignment but about capability. If the system is aligned it will be trying its best to make use of all of the information it has to respond appropriately to the observed correlation, to behave cautiously in light of that uncertainty, etc.. But in order to get a good outcome, and even in order to avoid a catastrophic outcome, we need to make some assumptions about “what the AI is able to notice.”
(Ideally IDA could eventually serve as an adequate operationalization of “smart enough to understand X” and similar properties.)
These include assumptions like “if the AI is able to cook up a plan that gets high reward because it kills the human, the AI is likely to be able to notice that the plan involves killing the human” and “the AI is smart enough to understand that killing the human is bad, or sufficiently risky that it is worth behaving cautiously and check with the human” and “the AI is smart enough that it can understand when the human says `X is bad’.” Some of these we can likely verify empirically. Some of them will require more work to even state cleanly. And there will be some situations where these assumptions simply aren’t true, e.g. because there is an unfortunate fact about the world that introduces the linkage (plan X kills humans) --> (plan X looks good on paper) without telling you anything about why.
I’m currently considering these problems out of scope for me because (a) there seems to be no way to have a clever idea about AI that avoids this family of problems without sacrificing competitiveness, (b) they would occur with a well-motivated human assistant, (c) we don’t have much reason to suspect that they are particularly serious problems compared to other kinds of mistakes an AI might make.
(I don’t really care whether we call them “alignment” problems per se, though I’m proposing defining alignment such that they wouldn’t be.)
We discussed what that might look like at the MIRI workshop we were both at.
I guess I didn’t learn/understand it well enough for it to stick in my mind.
(Hopefully it’s clear enough how you could use aligned sentences in a similar framework, though then the computation won’t factor as cleanly through meanings.)
Actually I have no idea what you mean here. What are “aligned sentences”?
I think the easiest methodology to explore feasibility (other than actually implementing these decompositions) is to play the iterative game where you suggest a task that seems hard to decompose and I suggest a decomposition (thereby broadening the space of subtasks the system needs to solve). My intuition has been produced by playing this kind of game and it seeming very hard to get stuck. Given that the trees quickly become exponentially large and varied, it seems very difficult to provide a large tree.
I think before we play this game interactively I need better intuitions about how meta-execution works at a basic level, and what kind of tasks might be hard. Can you start with an example of a whole decomposition of a specific task, but instead of showing the entire tree, just a path from the root to a leaf? (At each node you can pick a sub-branch that seems hard and/or has good pedagogical value.) It would be helpful if you could give the full/exact input, output, and list of subqueries at each node along this path. (This might also be a good project for someone else to do, if they understand meta-execution well enough.)
The top-level task could be (source text --> meaning) for this sentence, which I’m picking for its subtle ambiguity, or let me know if this is not a good example to start with: “Some of the undisciplined children in his class couldn’t sit still for more than a few seconds at a time.”
Another thing I’d like to understand is, how does IDA recover from an error on H’s part? And also, how does it improve itself using external feedback (e.g., the user saying “good job” or “that was wrong”, or a translation customer sending back a bunch of sentences that were translated incorrectly)? In other words what’s the equivalent of gradient descent for meta-execution?
Actually I have no idea what you mean here. What are “aligned sentences”?
A sentence in one language, together with its translation in another .
Can you start with an example of a whole decomposition of a specific task, but instead of showing the entire tree, just a path from the root to a leaf? [...] The top-level task could be (source text --> meaning) for this sentence, which I’m picking for its subtle ambiguity, or let me know if this is not a good example to start with: “Some of the undisciplined children in his class couldn’t sit still for more than a few seconds at a time.”
Here is a quick version that hopefully gives the idea:
Given the question: “What is the meaning of the sentence with list of words {X}.”
I loop over ways of dividing into the section into two. For each division a, b I ask:
1. What are the most plausible meanings of the phrase with list of words {a}, and how plausible are they?
2. What are the most plausible meanings of the phrase with list of words {b}, and how plausible are they?
(L is the resulting list of pairs, each pair with one meaning from a and one from b)
3. For all pairs of possible meanings in the list of pairs {L}, what are the possible meanings of the concatenation of two phrases with those meanings, and how plausible is that concatenation?
One of the pairs is a=”Some of the undisciplined children in his class” and b=”couldn’t sit still for more than a few seconds at a time.”
For that pair we get a list of pairs of meanings. I’m not going to write any of them out in full, unless you think that would be particularly useful. An example is is roughly ({a noun phrase whose use implies {x} and that refers to {y}} , {a verb phrase whose implies {z} and which implies that implies the noun $1 it modifies satisfies {w}}). The most plausible combinations of those meanings is {{a phrase whose use implies {{z} and {x} and {the referent of {y} satisfies {w}}}}. We can then ask about plausibility of that meaning (which involves e.g. evaluating its consequences and how plausible they are, or what alternative expressions would have had the same meaning, and prior probabilities that someone would want to express this idea, or etc.) compared to the other meanings we are considering. For deeper trees you’d also do more subtle things like analyzing large databases to see how common certain constructions are.
You’d have to go a lot deeper in order to get the other meaning you were considering, that the undiscplined children tended to not be able to sit still. I’m not sure you could do it without having done a very large database search and found this alternative idiomatic usage, or by performing an explicit search over plausible nearby meanings that might have been unintentionally confused with that one (which would be at a plausibility disadvantage but might be promoted up by pragmatics or priors). But there might be an alternative grammatical reading I haven’t seen (since I haven’t done the extensive work of parsing it—doing the whole tree is exponentially slow) or there might be some other way to get to that meaning.
Error recovery could be supported by having a parent agent running multiple versions of a query in parallel with different approaches (or different random seeds).
And also, how does it improve itself using external feedback
I think this could be implemented as: part of the input for a task is a set of information on background knowledge relevant to the task (ie. model of what the user wants, background information about translating the language). The agent can have a task “Update [background knowledge] after receiving [feedback] after providing [output] for task [input]”, which outputs a modified version of [background knowledge], based on the feedback.
(This comment is being reposted to be under the right parent.)
Error recovery could be supported by having a parent agent running multiple versions of a query in parallel with different approaches (or different random seeds).
This doesn’t seem to help in the case of H misunderstanding the meaning of a word? Are you assuming multiple humans acting as H, and that they don’t all make the same mistake? If so, my concern about that is Paul’s description of how IDA would do translation seems to depend on H having a lot of linguistics knowledge and skills. What if the field of linguistics as a whole is wrong about some concept or technique, and as a result all of the humans are wrong about that? It doesn’t seem like using different random seeds would help, and there may not be another approach that can be taken that avoids that concept/technique.
I think this could be implemented as: part of the input for a task is a set of information on background knowledge relevant to the task (ie. model of what the user wants, background information about translating the language). The agent can have a task “Update [background knowledge] after receiving [feedback] after providing [output] for task [input]“, which outputs a modified version of [background knowledge], based on the feedback.
This was my first thought as well, but how does the background knowledge actually get used? Consider the external feedback about badly translated sentences. In the case of deep learning, we can do backprop and it automatically does credit assignment and figures out which parts of itself needs to be changed to do better next time. But in IDA, H is fixed and there’s no obvious way to figure out which parts of a large task decomposition tree was responsible for the badly translated sentence and therefore need to be changed for next time.
What if the field of linguistics as a whole is wrong about some concept or technique, and as a result all of the humans are wrong about that? It doesn’t seem like using different random seeds would help, and there may not be another approach that can be taken that avoids that concept/technique.
Yeah, I don’t think simple randomness would recover from this level of failure (only that it would help with some kinds of errors, where we can sample from a distribution that doesn’t make that error sometimes). I don’t know if anything could recover from this error in the middle of a computation without reinventing the entire field of linguistics from scratch, which might be too to ask. However, I think it could be possible to recover from this error if you get feedback about the final output being wrong.
But in IDA, H is fixed and there’s no obvious way to figure out which parts of a large task decomposition tree was responsible for the badly translated sentence and therefore need to be changed for next time.
I think that the IDA task decomposition tree could be created in such a way that you can reasonably trace back which part was responsible for the misunderstanding/that needs to be changed. The structure you’d need for this is that given a query, you can figure out which of it’s children would need to be corrected to get the correct result. So if you have a specific word to correct, you can find the subagent that generated that word, then look at it’s inputs, see which input is correct, trace where that came from, etc. This might need to be deliberately engineered into the task decomposition (in the same way that differently written programs accomplishing the same task could be easier or harder to debug).
Suppose you had to translate a sentence that was ambiguous (with two possible meanings depending on context) and the target language couldn’t express that ambiguity in the same way so you had to choose one meaning. In your task decomposition you might have two large subtrees for “how likely is meaning A for this sentence given this context” and “how likely is meaning B for this sentence given this context”. If it turns out that you picked the wrong meaning, how can you tell which part(s) of these large subtrees was responsible? (If they were neural nets or some other kind of differentiable computation then you could apply gradient descent, but what to do here?)
EDIT: It seems like you’d basically need a bigger task tree to debug these task trees the same way a human would debug a hand written translation software, but unlike the hand written software, these task trees are exponentially sized (and also distilled by ML)… I don’t know how to think about this.
EDIT2: A human debugging a translation software could look at the return value of some high-level function and ask “is this return value sensible” using their own linguistic intuition, and then if the answer is “no”, trace the execution of that function and ask the same question about each of the function it calls. This kind of debugging does not seem available to meta-execution trying to debug itself, so I just don’t see any way this kind of learning / error correction could work.
Huh, I hadn’t thought of this as trying to be a direct analogue of gradient descent, but now that I think about your comment that seems like an interesting way to approach it.
A human debugging a translation software could look at the return value of some high-level function and ask “is this return value sensible” using their own linguistic intuition, and then if the answer is “no”, trace the execution of that function and ask the same question about each of the function it calls. This kind of debugging does not seem available to meta-execution trying to debug itself, so I just don’t see any way this kind of learning / error correction could work.
I think instead of asking “is this return value sensible”, the debugging overseer process could start with some computation node where it knows what the return value should be (the final answer), and look at each of the subqueries of that node and ask for each subquery “how can I modify the answer to make the query answer more correct”, then recurse into the subquery. This seems pretty analogous to gradient descent, with the potential advantage that the overseer’s understanding of the function at each node could be better than naively taking the gradient (understanding the operation could yield something that takes into account higher-order terms in the operation).
I’m curious now whether you could run a more efficient version of gradient descent if you replace the gradient at each step with an overseer human who can harness some intuition to try to do better than the gradient.
It’s an interesting idea, but it seems like there are lots of difficulties.
What if the current node is responsible for the error instead of one of the subqueries, how do you figure that out? When you do backprop, you propagate the error signal through all the nodes, not just through a single path that is “most responsible” for the error, right? If you did this with meta-execution, wouldn’t it take an exponential amount of time? And what about nodes that are purely symbolic, where there are multiple ways the subnodes (or the current node) could have caused the error, so you couldn’t use the right answer for the current node to figure out what the right answer is from each subnode? (Can you in general structure the task tree to avoid this?)
I wonder if we’re on the right track at all, or if Paul has an entirely different idea about this. Like maybe don’t try to fix or improve the system at a given level of amplification, but just keep amplifying it, and eventually it re-derives a better version of rationality from first principles (i.e. from metaphilosophy) and re-learns everything it can’t derive using the rationality it invents, including re-inventing linguistics, and then it can translate using the better version of linguistics it invents instead of the linguistics we taught it?
What if the current node is responsible for the error instead of one of the subqueries, how do you figure that out?
I think you’d need to form the decomposition in such a way that you could fix any problem through perturbing something in the world representation (an extreme version is you have the method for performing every operation contained in the world representation and looked up, so you can adjust it in the future).
When you do backprop, you propagate the error signal through all the nodes, not just through a single path that is “most responsible” for the error, right? If you did this with meta-execution, wouldn’t it take an exponential amount of time?
One step of this method, as in backprop, is the same time complexity as the forward pass (running meta-execution forward, which I wouldn’t call exponential complexity, as I think the relevant baseline is the number of nodes in the meta-execution forward tree). You only need to process each node once (when the backprop signal for it’s output is ready), and need to do a constant amount of work at each node (figure out all the ways to perturb the nodes input).
The catch is that, as with backprop, maybe you need to run multiple steps to get it to actually work.
And what about nodes that are purely symbolic, where there are multiple ways the subnodes (or the current node) could have caused the error, so you couldn’t use the right answer for the current node to figure out what the right answer is from each subnode? (Can you in general structure the task tree to avoid this?)
The default backprop answer to this is to shrug and adjust all of the inputs (which is what you get from taking the first order gradient). If this causes problems, then you can fix them in the next gradient step. That seems to work in practice for backprop in continuous models. Discrete models like this it might be a bit more difficult—if you start to try out different combinations to see if they work, that’s where you’d get exponential complexity. But we’d get to counter this by potentially having cases where, based on understanding the operation, we could intelligently avoid some branches—I think this could potentially wash out to linear complexity in the number of forward nodes if it all works well.
I wonder if we’re on the right track at all, or if Paul has an entirely different idea about this.
I don’t expect to use this kind of mechanism for fixing things, and am not exactly sure what it should look like.
Instead, when something goes wrong, you add the data to whatever dataset of experiences you are maintaining (or use amplification to decide how to update some small sketch), and then trust the mechanism that makes decisions from that database.
Basically, the goal is to make fewer errors than the RL agent (in the infinite computing limit), rather than making errors and then correcting them in the same way the RL agent would.
(I don’t know if I’ve followed the conversation well enough to respond sensibly.)
Instead, when something goes wrong, you add the data to whatever dataset of experiences you are maintaining (or use amplification to decide how to update some small sketch), and then trust the mechanism that makes decisions from that database.
By “mechanism that makes decisions from that database” are you thinking of some sort of linguistics mechanism, or a mechanism for general scientific research?
The reason I ask is, what if what went wrong was that H is missing some linguistics concept, for example the concept of implicature? Since we can’t guarantee that H knows all useful linguistics concepts (the field of linguistics may not be complete), it seems that in order to “make fewer errors than the RL agent (in the infinite computing limit)” IDA has to be able to invent linguistics concepts that H doesn’t know, and if IDA can do that then presumably IDA can do science in general?
If the latter (mechanism for general scientific research) is what you have in mind, we can’t really show that meta-execution is hopeless by pointing to some object-level task that it doesn’t seem able to do, because if we run into any difficulties we can always say “we don’t know how to do X with meta-execution, but if IDA can learn to do general scientific research, then it will invent whatever tools are needed to do X”.
There is some mechanism the RL agent uses, which doesn’t rest on scientific research. IDA should use the same mechanism.
This may sometimes involve “heuristic X works well empirically, but has no detectable internal structure.” In those cases IDA needs to be able to come up with a safe version of that procedure (i.e. a version that wouldn’t leave us at a disadvantage relative to people who just want to maximize complexity or whatever). I think the main obstacle to safety is if heuristic X itself involves consequentialism. But in that case there seems to necessarily be some internal structure. (This is the kind of thing that I have been mostly thinking about recently.)
There is some mechanism the RL agent uses, which doesn’t rest on scientific research. IDA should use the same mechanism.
How does IDA find such a mechanism, if not by scientific research? RL does it by searching for weights that do well empirically, and William and I were wondering if that idea could be adapted to IDA but you said “Searching for trees that do well empirically is scary business, since now you have all the normal problems with ML.” (I had interpreted you to mean that we should avoid doing that. Did you actually mean that we should try to figure out a safe way to do it?)
I think you need to do some trial and error, and was saying we should be scared of it ( / be careful about it / minimize it, though it’s subtle why minimization might help).
For example, suppose that I put a random 20 gate circuit in a black box and let you observe input-output behavior. At some point you don’t have any options other than guess and check, and no amount of cleverness about alignment could possibly avoid the need to sometimes use brute force.
I guess what we’re trying to unpack is “the mechanism that makes decisions from that database”, and whether it can be efficient. If you store all experience with no editing, you’re left with the same problem of “what decisions in the tree do you change based on experience (X,y) and how do you change them?” And it seems like the most naive way of doing this is too inefficient (looking at all data at all decision nodes).
If you summarize experience (what I’m interpreting “decide how to update some small sketch” as meaning), then the question is can you produce that summary in an efficient way, and efficiently use it at runtime?
The gradient-descent style solution is meant to gesture how this could be done in a way analogous to how we use parametric models rather than non-parametric models in ML for decently sized datasets. In advance, bake in a set of parameters/information that could influence each decision, then do updates by having the overseer look for ways to change the decision parameters. But when we do this in an oversight model, we can make the decisions/parameters interpretable sentences.
Rough example of how I see the overall process going:
1. amplified overseer acts in an environment with a breakable vase, but the overseer does not know that the vase is breakable
2. First time it acts, some overseer node N looks into the database for any information related to the vase, and returns nothing, and proceeds in it decision
3. The vase breaks, and the negative feedback is given to the amplified overseer
4. Amplified overseer runs some process to figure out that it could have made a better decision if the database contained the knowledge that vases are breakable, and so updates the database. (The thing that Wei Dai are trying to unpack is, can this step be run in a reasonable amount of time, given that each node of the overseer has only a limited understanding of the original decision process)
5. Second time it acts, the amplified overseer process is the same, up until the node N that does the database lookup is run. This time, the lookup returns that the vase is breakable, so the overseer runs a different decision path and treats the vase with more care.
The constraint on the amplification process is that learning the full set of subtasks can’t be that much harder than simply learning the task.
I propose the following as an example of a task where learning the full set of subtasks is much harder than simply learning the task. Suppose we’re trying to predict quantum mechanical systems, specifically we’re given a molecule and asked to predict some property of it.
How would this work with amplification? If I’m not misunderstanding something, assuming the overseer knows QM, one of the subtasks would be to do a QM simulation (via meta-execution), and that seems much harder for ML to learn than just predicting a specific property. If the overseer does not know QM, one of the subtasks would have to be to do science and invent QM, which seems even harder to learn.
This seems to show that H can’t always produce a transcript for A to do imitation learning or inverse reinforcement learning from, so the only option left for the distillation process is direct supervision?
You don’t have to do QM to make predictions about the particle. The goal is for IDA to find whatever structure allows the RL agent to make a prediction. (The exponential tree will solve the problem easily, but if we interleave distillation steps then many of those subtrees will get stuck because the agent isn’t able to learn to handle them.)
In some cases this will involve opaque structures that happen to make good predictions. In that case, we need to make a safety argument about “heuristic without internal structure that happens to work.”
You don’t have to do QM to make predictions about the particle. The goal is for IDA to find whatever structure allows the RL agent to make a prediction.
My thought here is why try to find this structure inside meta-execution? It seems counterintuitive / inelegant that you have to worry about the safety of learned / opaque structures in meta-execution, and then again in the distillation step. Why don’t we let the overseer directly train some auxiliary ML models at each iteration of IDA, using whatever data the overseer can obtain (in this case empirical measurements of molecule properties) and whatever transparency / robustness methods the overseer wants to use, and then make those auxiliary models available to the overseer at the next iteration?
It seems counterintuitive / inelegant that you have to worry about the safety of learned / opaque structures in meta-execution, and then again in the distillation step.
I agree, I think it’s unlikely the final scheme will involve doing this work in two places.
Why don’t we let the overseer directly train some auxiliary ML models at each iteration of IDA, using whatever data the overseer can obtain (in this case empirical measurements of molecule properties) and whatever transparency / robustness methods the overseer wants to use, and then make those auxiliary models available to the overseer at the next iteration?
This a way that things could end up looking. I think there are more natural ways to do this integration though.
Note that in order for any of this to work, amplification probably needs to be able to replicate/verify all (or most) of the cognitive work the ML model does implicitly, so that we can do informed oversight. There w opaque heuristics that “just work,” which are discovered either by ML or metaexecution trial-and-error, but then we need to confirm safety for those heuristics.
Ah, right. I guess I was balking at moving from exorbitant to exp(exorbitant). Maybe it’s better to think of this as reducing the size of fully worked initial overseer example problems that can be produced for training/increasing the number of amplification rounds that are needed.
So my argument is more an example of what a distilled overseer could learn as an efficient approximation.
The only way I know how to accomplish this is to have IDA emulate the deep learning translator at a very low level, with H acting as a “human transistor” or maybe a “human neuron”, and totally ignore what H knows about translation including the meanings of words.
The human can understand the meaning of the word it sees, the human just can’t know the context (the words that it doesn’t see), and so can’t use their understanding of that context.
The could try to guess possible contexts for the word and leverage their understanding of those contexts (“what are some examples of sentences where the word could be used ambiguously?”), but they aren’t allowed to know if any of their guesses actually apply to the text they are currently working on (and so their answer is independent of the actual text they are currently working on).
Any set of natural language instructions for humans operating on small inputs can be turned into a lookup table by executing the human on all possible inputs (multiple times on each input, if you want to capture a stochastic policy).
The with the following “Consider the sentence [s1] w [s2]”, and have the agent launch queries of the form “Consider the sentence [s1] w [s2], where we take w to have meaning m”. Now, you could easily produce this behaviour algorithmically if you have a dictionary. But in a world without dictionaries, suitably preparing a human to answer this query takes much less effort than producing a dictionary.
Suppose we wanted to have IDA (with security amplification) translate a pair of natural languages at least as well as the current best machine translator (which I believe is based on deep learning, trained on sentence pairs), and suppose the human overseer H can translate this pair of languages at an expert level, better than the machine translator. The only way I know how to accomplish this is to have IDA emulate the deep learning translator at a very low level, with H acting as a “human transistor” or maybe a “human neuron”, and totally ignore what H knows about translation including the meanings of words. Do you know a better way than this, or do you have an argument or intuition that a better way is possible?
I guess the above is addressing a slightly different issue than my original question, so to go back to that, your answer is not very satisfying because we do live in a world where dictionaries exist, and dictionaries aren’t that expensive to produce even if they didn’t exist. Can you think of other examples? Or have other explanations of why you think giving a human a bunch of natural language instructions might be much easier than writing down a formal algorithm (to the extent that it would be worth the downsides of having H be a human instead of a formal algorithm)?
The humans definitely don’t need to emulate the deep learning system. They could use a different way of translating that reaches a higher performance than the deep learning system, which will then be copied.
You could do the same thing with grammatical phrases of length ⇐ 10.
Do you have such a way in mind, or just think that IDA will eventually figure out such a way if amplified enough? If the latter, am I correct in thinking that IDA will be generally superhuman at that point, since you and I can’t think of such a way?
I’m having trouble visualizing what the human is actually doing in this case. Can you or someone give natural language instructions that would let me know what to do as H? (Oh, please know that I didn’t entirely understand what William was saying either, but didn’t think it was important to ask.)
I think that a naive approach would probably work. We discussed what that might look like at the MIRI workshop we were both at. I’m imagining a breakdown into (source text --> meaning) and (meaning --> target text), with richer and richer representations of meaning (including connotation, etc.) as you amplify further. To implement (source text --> meaning) you would ask things like “What are the possible meanings phrase X?” and try to represent that meaning in terms of the meaning of the constituents. To do that, you might ask questions like “Is X likely to be an idiom? If so, what are the plausible meanings?” or “Can X be produced by a grammatical production rule, and if so how does it meaning relate to the meaning of its constituents?” or so on. To answer one of those questions you might say “What are some sentences in the database where constructions like X occur?”, “What are some possible meanings of X in context Y?” and “Is meaning Z consistent with the usage of X in context Y?” To answer the latter, you’d have to answer subquestions like “What are the most surprising consequences of the assertion X?” And so on.
(Hopefully it’s clear enough how you could use aligned sentences in a similar framework, though then the computation won’t factor as cleanly through meanings.)
Unsurprisingly, this gets really complicated. I think the easiest methodology to explore feasibility (other than actually implementing these decompositions) is to play the iterative game where you suggest a task that seems hard to decompose and I suggest a decomposition (thereby broadening the space of subtasks the system needs to solve). My intuition has been produced by playing this kind of game and it seeming very hard to get stuck. Given that the trees quickly become exponentially large and varied, it seems very difficult to provide a large tree.
“List the most plausible meanings you can think of for the expression X”, e.g.
Q: List a plausible meaning for the expression “something you design for the present”
A: One candidate meaning is {an expression $2 which refers to a thing $3 and whose use implies that {there is something $1 satisfying {{{the speaker of {$2}} is addressing {$1}} and {{$1} regularly performs the action $4={design {$3} for the purpose {being used at the time {the time when {$4} occurs}}}}}}}, another is...
Where the {}’s represent pointers to submessages. The instructions describe the semantics for representing meaning, plus some guidance about the desiderata for answering questions, and so forth.
A similar example:
“List some facts that relate X, Y, and Z”, e.g.
Q: List some facts that relate humans, age, and the ocean.
A: One fact is {young humans do not know how to swim in the ocean}, another is...
Obviously things are a lot more complicated than this, but hopefully those examples illustrate how a human can be doing useful work while still operating on inputs that are small enough to be safe.
This points to another potential problem with capability amplification: in order to reach some target capability via amplification, you may have to go through another capability that is harder for ML to learn. In this case, the target capability is translation, and the intermediate capability is linguistic knowledge and skills. (We currently have ML that can learn to translate, but AFAIK not learn how to apply linguistics to recreate the ability to translate.) If this is true in general (and I don’t see why translation might be an exceptional case) then capability amplification being universal isn’t enough to ensure that IDA will be competitive with unaligned AIs, because in order to be competitive with state of the art AI capabilities (which can barely be learned by ML at a certain point in time) it may have to go through capabilities that are beyond what ML can learn at that time.
This is a general restriction on iterated amplification. Without this restriction it would be mostly trivial—whatever work we could do to build aligned AI, you could just do inside HCH, then delegate the decision to the resulting aligned AI.
If your AI is able to notice an empirical correlation (e.g. word A cooccurs with word B), and lacks the capability to understand anything at all about the causal structure of that correlation, then you have no option but to act on the basis of the brute association, i.e. to take the action that looks best to you in light of that correlation, without conditioning on other facts about the causal structure of the association, since by hypothesis your system is not capable enough to recognize those other facts.
If we have an empirical association between behavior X (pressing a sequence of buttons related in a certain way to what’s in memory) and our best estimate of utility, we might end up needing to take that action without understanding what’s going on causally. I’m still happy calling this aligned in general: the exact same thing would happen to a perfectly motivated human assistant trying their best to do what you want, who was able to notice an empirical correlation but was not smart enough to notice anything about the underlying mechanism (and sometimes acting on the basis of such correlations will be bad).
In order to argue that our AI leads to good outcomes, we need to make an assumption not only about alignment but about capability. If the system is aligned it will be trying its best to make use of all of the information it has to respond appropriately to the observed correlation, to behave cautiously in light of that uncertainty, etc.. But in order to get a good outcome, and even in order to avoid a catastrophic outcome, we need to make some assumptions about “what the AI is able to notice.”
(Ideally IDA could eventually serve as an adequate operationalization of “smart enough to understand X” and similar properties.)
These include assumptions like “if the AI is able to cook up a plan that gets high reward because it kills the human, the AI is likely to be able to notice that the plan involves killing the human” and “the AI is smart enough to understand that killing the human is bad, or sufficiently risky that it is worth behaving cautiously and check with the human” and “the AI is smart enough that it can understand when the human says `X is bad’.” Some of these we can likely verify empirically. Some of them will require more work to even state cleanly. And there will be some situations where these assumptions simply aren’t true, e.g. because there is an unfortunate fact about the world that introduces the linkage (plan X kills humans) --> (plan X looks good on paper) without telling you anything about why.
I’m currently considering these problems out of scope for me because (a) there seems to be no way to have a clever idea about AI that avoids this family of problems without sacrificing competitiveness, (b) they would occur with a well-motivated human assistant, (c) we don’t have much reason to suspect that they are particularly serious problems compared to other kinds of mistakes an AI might make.
(I don’t really care whether we call them “alignment” problems per se, though I’m proposing defining alignment such that they wouldn’t be.)
I guess I didn’t learn/understand it well enough for it to stick in my mind.
Actually I have no idea what you mean here. What are “aligned sentences”?
I think before we play this game interactively I need better intuitions about how meta-execution works at a basic level, and what kind of tasks might be hard. Can you start with an example of a whole decomposition of a specific task, but instead of showing the entire tree, just a path from the root to a leaf? (At each node you can pick a sub-branch that seems hard and/or has good pedagogical value.) It would be helpful if you could give the full/exact input, output, and list of subqueries at each node along this path. (This might also be a good project for someone else to do, if they understand meta-execution well enough.)
The top-level task could be (source text --> meaning) for this sentence, which I’m picking for its subtle ambiguity, or let me know if this is not a good example to start with: “Some of the undisciplined children in his class couldn’t sit still for more than a few seconds at a time.”
Another thing I’d like to understand is, how does IDA recover from an error on H’s part? And also, how does it improve itself using external feedback (e.g., the user saying “good job” or “that was wrong”, or a translation customer sending back a bunch of sentences that were translated incorrectly)? In other words what’s the equivalent of gradient descent for meta-execution?
A sentence in one language, together with its translation in another .
Here is a quick version that hopefully gives the idea:
Given the question: “What is the meaning of the sentence with list of words {X}.”
I loop over ways of dividing into the section into two. For each division a, b I ask:
1. What are the most plausible meanings of the phrase with list of words {a}, and how plausible are they?
2. What are the most plausible meanings of the phrase with list of words {b}, and how plausible are they?
(L is the resulting list of pairs, each pair with one meaning from a and one from b)
3. For all pairs of possible meanings in the list of pairs {L}, what are the possible meanings of the concatenation of two phrases with those meanings, and how plausible is that concatenation?
One of the pairs is a=”Some of the undisciplined children in his class” and b=”couldn’t sit still for more than a few seconds at a time.”
For that pair we get a list of pairs of meanings. I’m not going to write any of them out in full, unless you think that would be particularly useful. An example is is roughly ({a noun phrase whose use implies {x} and that refers to {y}} , {a verb phrase whose implies {z} and which implies that implies the noun $1 it modifies satisfies {w}}). The most plausible combinations of those meanings is {{a phrase whose use implies {{z} and {x} and {the referent of {y} satisfies {w}}}}. We can then ask about plausibility of that meaning (which involves e.g. evaluating its consequences and how plausible they are, or what alternative expressions would have had the same meaning, and prior probabilities that someone would want to express this idea, or etc.) compared to the other meanings we are considering. For deeper trees you’d also do more subtle things like analyzing large databases to see how common certain constructions are.
You’d have to go a lot deeper in order to get the other meaning you were considering, that the undiscplined children tended to not be able to sit still. I’m not sure you could do it without having done a very large database search and found this alternative idiomatic usage, or by performing an explicit search over plausible nearby meanings that might have been unintentionally confused with that one (which would be at a plausibility disadvantage but might be promoted up by pragmatics or priors). But there might be an alternative grammatical reading I haven’t seen (since I haven’t done the extensive work of parsing it—doing the whole tree is exponentially slow) or there might be some other way to get to that meaning.
Error recovery could be supported by having a parent agent running multiple versions of a query in parallel with different approaches (or different random seeds).
I think this could be implemented as: part of the input for a task is a set of information on background knowledge relevant to the task (ie. model of what the user wants, background information about translating the language). The agent can have a task “Update [background knowledge] after receiving [feedback] after providing [output] for task [input]”, which outputs a modified version of [background knowledge], based on the feedback.
(This comment is being reposted to be under the right parent.)
This doesn’t seem to help in the case of H misunderstanding the meaning of a word? Are you assuming multiple humans acting as H, and that they don’t all make the same mistake? If so, my concern about that is Paul’s description of how IDA would do translation seems to depend on H having a lot of linguistics knowledge and skills. What if the field of linguistics as a whole is wrong about some concept or technique, and as a result all of the humans are wrong about that? It doesn’t seem like using different random seeds would help, and there may not be another approach that can be taken that avoids that concept/technique.
This was my first thought as well, but how does the background knowledge actually get used? Consider the external feedback about badly translated sentences. In the case of deep learning, we can do backprop and it automatically does credit assignment and figures out which parts of itself needs to be changed to do better next time. But in IDA, H is fixed and there’s no obvious way to figure out which parts of a large task decomposition tree was responsible for the badly translated sentence and therefore need to be changed for next time.
Yeah, I don’t think simple randomness would recover from this level of failure (only that it would help with some kinds of errors, where we can sample from a distribution that doesn’t make that error sometimes). I don’t know if anything could recover from this error in the middle of a computation without reinventing the entire field of linguistics from scratch, which might be too to ask. However, I think it could be possible to recover from this error if you get feedback about the final output being wrong.
I think that the IDA task decomposition tree could be created in such a way that you can reasonably trace back which part was responsible for the misunderstanding/that needs to be changed. The structure you’d need for this is that given a query, you can figure out which of it’s children would need to be corrected to get the correct result. So if you have a specific word to correct, you can find the subagent that generated that word, then look at it’s inputs, see which input is correct, trace where that came from, etc. This might need to be deliberately engineered into the task decomposition (in the same way that differently written programs accomplishing the same task could be easier or harder to debug).
Suppose you had to translate a sentence that was ambiguous (with two possible meanings depending on context) and the target language couldn’t express that ambiguity in the same way so you had to choose one meaning. In your task decomposition you might have two large subtrees for “how likely is meaning A for this sentence given this context” and “how likely is meaning B for this sentence given this context”. If it turns out that you picked the wrong meaning, how can you tell which part(s) of these large subtrees was responsible? (If they were neural nets or some other kind of differentiable computation then you could apply gradient descent, but what to do here?)
EDIT: It seems like you’d basically need a bigger task tree to debug these task trees the same way a human would debug a hand written translation software, but unlike the hand written software, these task trees are exponentially sized (and also distilled by ML)… I don’t know how to think about this.
EDIT2: A human debugging a translation software could look at the return value of some high-level function and ask “is this return value sensible” using their own linguistic intuition, and then if the answer is “no”, trace the execution of that function and ask the same question about each of the function it calls. This kind of debugging does not seem available to meta-execution trying to debug itself, so I just don’t see any way this kind of learning / error correction could work.
Huh, I hadn’t thought of this as trying to be a direct analogue of gradient descent, but now that I think about your comment that seems like an interesting way to approach it.
I think instead of asking “is this return value sensible”, the debugging overseer process could start with some computation node where it knows what the return value should be (the final answer), and look at each of the subqueries of that node and ask for each subquery “how can I modify the answer to make the query answer more correct”, then recurse into the subquery. This seems pretty analogous to gradient descent, with the potential advantage that the overseer’s understanding of the function at each node could be better than naively taking the gradient (understanding the operation could yield something that takes into account higher-order terms in the operation).
I’m curious now whether you could run a more efficient version of gradient descent if you replace the gradient at each step with an overseer human who can harness some intuition to try to do better than the gradient.
It’s an interesting idea, but it seems like there are lots of difficulties.
What if the current node is responsible for the error instead of one of the subqueries, how do you figure that out? When you do backprop, you propagate the error signal through all the nodes, not just through a single path that is “most responsible” for the error, right? If you did this with meta-execution, wouldn’t it take an exponential amount of time? And what about nodes that are purely symbolic, where there are multiple ways the subnodes (or the current node) could have caused the error, so you couldn’t use the right answer for the current node to figure out what the right answer is from each subnode? (Can you in general structure the task tree to avoid this?)
I wonder if we’re on the right track at all, or if Paul has an entirely different idea about this. Like maybe don’t try to fix or improve the system at a given level of amplification, but just keep amplifying it, and eventually it re-derives a better version of rationality from first principles (i.e. from metaphilosophy) and re-learns everything it can’t derive using the rationality it invents, including re-inventing linguistics, and then it can translate using the better version of linguistics it invents instead of the linguistics we taught it?
I think you’d need to form the decomposition in such a way that you could fix any problem through perturbing something in the
world representation (an extreme version is you have the method for performing every operation contained in the world representation and looked up, so you can adjust it in the future).
One step of this method, as in backprop, is the same time complexity as the forward pass (running meta-execution forward, which I wouldn’t call exponential complexity, as I think the relevant baseline is the number of nodes in the meta-execution forward tree). You only need to process each node once (when the backprop signal for it’s output is ready), and need to do a constant amount of work at each node (figure out all the ways to perturb the nodes input).
The catch is that, as with backprop, maybe you need to run multiple steps to get it to actually work.
The default backprop answer to this is to shrug and adjust all of the inputs (which is what you get from taking the first order gradient). If this causes problems, then you can fix them in the next gradient step. That seems to work in practice for backprop in continuous models. Discrete models like this it might be a bit more difficult—if you start to try out different combinations to see if they work, that’s where you’d get exponential complexity. But we’d get to counter this by potentially having cases where, based on understanding the operation, we could intelligently avoid some branches—I think this could potentially wash out to linear complexity in the number of forward nodes if it all works well.
So do I :)
I don’t expect to use this kind of mechanism for fixing things, and am not exactly sure what it should look like.
Instead, when something goes wrong, you add the data to whatever dataset of experiences you are maintaining (or use amplification to decide how to update some small sketch), and then trust the mechanism that makes decisions from that database.
Basically, the goal is to make fewer errors than the RL agent (in the infinite computing limit), rather than making errors and then correcting them in the same way the RL agent would.
(I don’t know if I’ve followed the conversation well enough to respond sensibly.)
By “mechanism that makes decisions from that database” are you thinking of some sort of linguistics mechanism, or a mechanism for general scientific research?
The reason I ask is, what if what went wrong was that H is missing some linguistics concept, for example the concept of implicature? Since we can’t guarantee that H knows all useful linguistics concepts (the field of linguistics may not be complete), it seems that in order to “make fewer errors than the RL agent (in the infinite computing limit)” IDA has to be able to invent linguistics concepts that H doesn’t know, and if IDA can do that then presumably IDA can do science in general?
If the latter (mechanism for general scientific research) is what you have in mind, we can’t really show that meta-execution is hopeless by pointing to some object-level task that it doesn’t seem able to do, because if we run into any difficulties we can always say “we don’t know how to do X with meta-execution, but if IDA can learn to do general scientific research, then it will invent whatever tools are needed to do X”.
Does this match your current thinking?
There is some mechanism the RL agent uses, which doesn’t rest on scientific research. IDA should use the same mechanism.
This may sometimes involve “heuristic X works well empirically, but has no detectable internal structure.” In those cases IDA needs to be able to come up with a safe version of that procedure (i.e. a version that wouldn’t leave us at a disadvantage relative to people who just want to maximize complexity or whatever). I think the main obstacle to safety is if heuristic X itself involves consequentialism. But in that case there seems to necessarily be some internal structure. (This is the kind of thing that I have been mostly thinking about recently.)
How does IDA find such a mechanism, if not by scientific research? RL does it by searching for weights that do well empirically, and William and I were wondering if that idea could be adapted to IDA but you said “Searching for trees that do well empirically is scary business, since now you have all the normal problems with ML.” (I had interpreted you to mean that we should avoid doing that. Did you actually mean that we should try to figure out a safe way to do it?)
I think you need to do some trial and error, and was saying we should be scared of it ( / be careful about it / minimize it, though it’s subtle why minimization might help).
For example, suppose that I put a random 20 gate circuit in a black box and let you observe input-output behavior. At some point you don’t have any options other than guess and check, and no amount of cleverness about alignment could possibly avoid the need to sometimes use brute force.
I guess what we’re trying to unpack is “the mechanism that makes decisions from that database”, and whether it can be efficient. If you store all experience with no editing, you’re left with the same problem of “what decisions in the tree do you change based on experience (X,y) and how do you change them?” And it seems like the most naive way of doing this is too inefficient (looking at all data at all decision nodes).
If you summarize experience (what I’m interpreting “decide how to update some small sketch” as meaning), then the question is can you produce that summary in an efficient way, and efficiently use it at runtime?
The gradient-descent style solution is meant to gesture how this could be done in a way analogous to how we use parametric models rather than non-parametric models in ML for decently sized datasets. In advance, bake in a set of parameters/information that could influence each decision, then do updates by having the overseer look for ways to change the decision parameters. But when we do this in an oversight model, we can make the decisions/parameters interpretable sentences.
Rough example of how I see the overall process going:
1. amplified overseer acts in an environment with a breakable vase, but the overseer does not know that the vase is breakable
2. First time it acts, some overseer node N looks into the database for any information related to the vase, and returns nothing, and proceeds in it decision
3. The vase breaks, and the negative feedback is given to the amplified overseer
4. Amplified overseer runs some process to figure out that it could have made a better decision if the database contained the knowledge that vases are breakable, and so updates the database. (The thing that Wei Dai are trying to unpack is, can this step be run in a reasonable amount of time, given that each node of the overseer has only a limited understanding of the original decision process)
5. Second time it acts, the amplified overseer process is the same, up until the node N that does the database lookup is run. This time, the lookup returns that the vase is breakable, so the overseer runs a different decision path and treats the vase with more care.
The constraint on the amplification process is that learning the full set of subtasks can’t be that much harder than simply learning the task.
There isn’t any constraint on the computation time of the overall tree, which should generally be exorbitant.
Searching for trees that do well empirically is scary business, since now you have all the normal problems with ML.
I propose the following as an example of a task where learning the full set of subtasks is much harder than simply learning the task. Suppose we’re trying to predict quantum mechanical systems, specifically we’re given a molecule and asked to predict some property of it.
How would this work with amplification? If I’m not misunderstanding something, assuming the overseer knows QM, one of the subtasks would be to do a QM simulation (via meta-execution), and that seems much harder for ML to learn than just predicting a specific property. If the overseer does not know QM, one of the subtasks would have to be to do science and invent QM, which seems even harder to learn.
This seems to show that H can’t always produce a transcript for A to do imitation learning or inverse reinforcement learning from, so the only option left for the distillation process is direct supervision?
You don’t have to do QM to make predictions about the particle. The goal is for IDA to find whatever structure allows the RL agent to make a prediction. (The exponential tree will solve the problem easily, but if we interleave distillation steps then many of those subtrees will get stuck because the agent isn’t able to learn to handle them.)
In some cases this will involve opaque structures that happen to make good predictions. In that case, we need to make a safety argument about “heuristic without internal structure that happens to work.”
My thought here is why try to find this structure inside meta-execution? It seems counterintuitive / inelegant that you have to worry about the safety of learned / opaque structures in meta-execution, and then again in the distillation step. Why don’t we let the overseer directly train some auxiliary ML models at each iteration of IDA, using whatever data the overseer can obtain (in this case empirical measurements of molecule properties) and whatever transparency / robustness methods the overseer wants to use, and then make those auxiliary models available to the overseer at the next iteration?
I agree, I think it’s unlikely the final scheme will involve doing this work in two places.
This a way that things could end up looking. I think there are more natural ways to do this integration though.
Note that in order for any of this to work, amplification probably needs to be able to replicate/verify all (or most) of the cognitive work the ML model does implicitly, so that we can do informed oversight. There w opaque heuristics that “just work,” which are discovered either by ML or metaexecution trial-and-error, but then we need to confirm safety for those heuristics.
Ah, right. I guess I was balking at moving from exorbitant to exp(exorbitant). Maybe it’s better to think of this as reducing the size of fully worked initial overseer example problems that can be produced for training/increasing the number of amplification rounds that are needed.
So my argument is more an example of what a distilled overseer could learn as an efficient approximation.
The human can understand the meaning of the word it sees, the human just can’t know the context (the words that it doesn’t see), and so can’t use their understanding of that context.
The could try to guess possible contexts for the word and leverage their understanding of those contexts (“what are some examples of sentences where the word could be used ambiguously?”), but they aren’t allowed to know if any of their guesses actually apply to the text they are currently working on (and so their answer is independent of the actual text they are currently working on).