I think the biggest problem is that Fθ can compute the instrumental policy (or a different policy that works well, or a fragment of it). Some possible reasons:
Maybe some people in the world are incidentally thinking about the instrumental policy and Fθ makes predictions about them.
Maybe an adversary who computes a policy that performs well in order to try to attack the learning process (since Eθ′ may just copy the adversary’s policy in order to be fast if it works well on training, resulting in bad generalization).
Maybe no one computes the instrumental policy itself, but they compute helpful building blocks and Eθ′ can save compute by reusing those.
Maybe Fθ wants to predict question-answering policies, and computing optimal responses to the training objective is a simple way for it to forecast what they will do. (It’s totally unclear if this is actually more efficient than the intended policy.) Then Eθ′ can copy that.
I don’t know if any of those particular failures are too likely. But overall it seems really bad to rely on Fθ never computing something inconvenient, and it definitely doesn’t look like it’s going to work in the worst case.
What are some possible outs, if in fact Fθ computes something adversarial to try to make it easy for Eθ′ to learn something bad?
At the time when Fθ is running it doesn’t know the question Q that Eθ will need to answer. So in order to cause trouble it would need to compute answers for all of these questions. But I’m afraid that this is doable in many cases, since Fθ could translate a big chunk of its internal state into natural language and then Eθ′ can do some simple QA on top of that.
We could try to ensure that the correctness/consistency checks are really complex or challenging to meet, at least for some questions. If the complexity of these checks scales up with the complexity of Fθ itself (e.g. by using amplification where the evaluator consults a model as big as Fθ), then maybe it’s impossible for Fθ to meet them. This idea can be applied without the factorization into F and E, but it’s probably useful to have this kind of factorization in mind as a possible tool when exploring approaches based on tricky-evaluations.
We could try to exploit some further structural facts about the parts of Zθ(X) that are used by Eθ′. For example, it feels like the intended model is going to be leveraging facts that are further “upstream.” For example, suppose an attacker observes that there is a cat in the room, and so writes out “There is a cat in the room” as part of a natural-language description of what it’s going on that it hopes that Eθ′ will eventually learn to copy. If Fθ predicts the adversary’s output, it must first predict that there is actually a cat in the room, which then ultimately flows downstream into predictions of the adversary’s behavior. And so we might hope to prefer the “intended” Eθ′ by having it preferentially read from the earlier activations (with shorter computational histories).
Overall this kind of approach feels like it’s probably doomed, but it does capture part of the intuition for why we should “just” be able to learn a simple correspondence rather than getting some crazy instrumental policy. So I’m not quite ready to let it go yet. I’m particularly interested to push a bit on the third of these approaches.
Here’s another approach to “shortest circuit” that is designed to avoid this problem:
Learn a circuit C(X) that outputs an entire set of beliefs. (Or maybe some different architecture, but with ~0 weight sharing so that computational complexity = description complexity.)
Impose a consistency requirement on those beliefs, even in cases where a human can’t tell the right answer.
Require C(X)’s beliefs about Y to match Fθ(X). We hope that this makes C an explication of “Fθ’s beliefs.”
Optimize some combination of (complexity) vs (usefulness), or chart the whole pareto frontier, or whatever. I’m a bit confused about how this step would work but there are similar difficulties for the other posts in this genre so it’s exciting if this proposal gets to that final step.
The “intended” circuit C just follows along with the computation done by Fθ and then translates its internal state into natural language.
What about the problem case where Fθ computes some reasonable beliefs (e.g. using the instrumental policy, where the simplicity prior makes us skeptical about their generalization) that C could just read off? I’ll imagine those being written down somewhere on a slip of paper inside of Fθ’s model of the world.
Suppose that the slip of paper is not relevant to predicting Fθ(X), i.e. it’s a spandrel from the weight sharing. Then the simplest circuit C just wants to cut it out. Whatever computation was done to write things down on the slip of paper can be done directly by C, so it seems like we’re in business.
So suppose that the slip of paper is relevant for predicting Fθ(X), e.g. because someone looks at the slip of paper and then takes an action that affects Y. If (the correct) Y is itself depicted on the slip of paper, then we can again cut out the slip of paper itself and just run the same computation (that was done by whoever wrote something on the slip of paper). Otherwise, the answers produced by C still have to contain both the items on the slip of paper as well as some facts that are causally downstream of the slip of paper (as well as hopefully some about the slip of paper itself). At that point it seems like we have a pretty good chance of getting a consistency violation out of C.
Probably nothing like this can work, but I now feel like there are two live proposals for capturing the optimistic minimal circuits intuition—the one in this current comment, and in this other comment. I still feel like the aggressive speed penalization is doing something, and I feel like probably we can either find a working proposal in that space or else come up with some clearer counterexample.
We could try to exploit some further structural facts about the parts of Zθ(X) that are used by Eθ′. For example, it feels like the intended model is going to be leveraging facts that are further “upstream.” For example, suppose an attacker observes that there is a cat in the room, and so writes out “There is a cat in the room” as part of a natural-language description of what it’s going on that it hopes that Eθ′ will eventually learn to copy. If Fθ predicts the adversary’s output, it must first predict that there is actually a cat in the room, which then ultimately flows downstream into predictions of the adversary’s behavior. And so we might hope to prefer the “intended” Eθ′ by having it preferentially read from the earlier activations (with shorter computational histories).
The natural way to implement this is to penalize Eθ′ not for the computation it does, but for all the computation needed to compute its output (including within Fθ.). The basic problem with this approach is that it incentivizes Eθ′ to do all of the computation of Fθ from scratch in a way optimized for speed rather than complexity. I’d set this approach aside for a while because of this difficulty and the unnaturalness mentioned in the sibling (where we’ve given up on what seems to be an important form of parameter-sharing).
Today I was thinking about some apparently-totally-different angles of attack for the ontology identification problem, and this idea seems to have emerged again, with a potential strategy for fixing the “recompute Fθ problem”. (In the context of ontology identification, the parameter-sharing objection no longer applies.)
Here’s the idea:
Train Fθ as before.
Start with a bunch of facts and probabilistic relationships that the human knows, expressed in their own ontology. These might be facts like “Ice melts at 100 degrees” or “Dogs bark at cats” or whatever.
We are going to try to jointly learn (i) a correspondence c between Fθ and the human’s ontology, (ii) a set of “justifications” showing that c(Fθ) satisfies all of the relationships the human expects. I’m imagining justifications like simulating ice at 100 degrees and observing that it indeed melts, or sampling situations with dogs and cats and verifying that the dogs bark.
The correspondence c is constrained to map “the observations” (a concept in the human’s ontology) to the output of Fθ, but other than that there is no simplicity prior, it can be anything.
Our goal is to make the justifications as computationally simple as possible. (Which indirectly incentivizes us to make c as computationally simple as possible.)
This still feels a bit weird, but you could imagine it handling a bunch of cases in a promising way:
c has no reason to “reuse the human” or to use the human’s inference procedure in order to save bits, because doing either of those things makes the justifications more computationally complex.
If there is a human embedded inside Fθ who performs the correspondence and writes it down in an easy-to-read way, c has no motivation to read it: if c is defined in that way, then justifying facts will require simulating the human (even if that work was already done inside Fθ). If c simply cut out the middle man and applied the correspondence itself, then it could save compute in the typical case (except when talking about facts about that human). This is subtle in a few ways but tentatively looks plausible to me.
c has no reason to ignore Fθ and implement a new more-efficient-but-more-complex simulation F′, because (i) it ultimately needs to relate observations back to the output of Fθ, and many of its concepts are related to observations (e.g. what cats look like), (ii) that forces F′ and Fθ to have the same behavior, (iii) the justification would then need to show that the the “observations” in F′ are the same as the observations in Fθ, which is computationally costly.
But right now it’s a pretty vague proposal, because it’s unclear what the nature of these facts or justifications are. If you set that up in a naive way, then the justification effectively just needs to simulate all of Fθ. That’s a problem because it reintroduces the failure mode where you need to simulate the human, and therefore there’s no extra cost to just simulating and then listening to whatever they say.
Overall I think that probably nothing like this works, but I’m still feeling a lot more optimistic than I was last week and want to explore it further. (This is partially for reasons not discussed in this comment, that several other approaches/motivations seem to converge on something similar.)
Here’s a slightly more formal algorithm along these lines:
Assume that both the human’s model WH and the AI’s model WAI are Bayesian networks where you compute the probability distribution over a node v’s value based on the values of its parents pa(v). I’ll write Values(v) for the set of values that a node v can take on (in either model), and Values(S) for the joint values of a set of nodes S.
A correspondence tells you how to compute the value of each node v in the human’s model. This consistent of (i) a neighborhood in the AI’s model NAI(v) which suffices to determine v, (ii) a function fv:Values(NAI(v))→Values(v).
Both the AI’s model and the human model contain some distinguished observation nodes.f must be the identity on these nodes.
An “explanation” of a correspondence consists of a set of nodes EAI(v) in the AI’s model for each node v in the human’s model. The intuition is that we can run a simulation involving only these nodes in order to reproduce the probability distribution of v given its parents’ values.
In particular, NAI(v)⊆EAI(v), and NAI(u)⊆EAI(v) for all u∈pa(v). In order to check whether EAI(v) reproduces the right distribution, we first sample values at random for all the nodes some of whose parents aren’t in EAI(v). Then we sample values for the remaining nodes. We can use f to compute the corresponding values for v and all of its parents. And then we can compute the conditional distributions for v given each set of values for its parents.
We require that the explanations exactly reproduce the conditional probability over Values(v) given Values(pa(v)).
The “cost” of the explanation of v is the sum of the compute required to sample all the nodes in EAI(v). The “cost” of the correspondence fv is the compute required to evaluate it.
We search for the set of correspondences and explanations for which the total cost is minimized.
(Maybe we also have some requirement where the correspondence fv agrees with some training data about v. I’m not really sure about that.)
Reviewing how this behaves in each of the bad cases from the parent:
It’s very bad to define fv by computing the observation and then using the human’s inference algorithm. The entire motivation for defining it this way was to save on description complexity, but fv is only being penalized for computational complexity. (This also forces every single EAI(v) to include the entire process that generates the observations, which seems like it should be an even bigger deal. But this feels less like the “right” reason and I think it might change for a future version of the scheme.)
If there is a human in the simulation who knows the value of v, it’s extremely bad to define NAI(v) to be that human. This is because the explanation EAI(v) will need to include the entire process by which the human computes that value. This is guaranteed to be more expensive than just performing that compute in the correspondence fv itself.
We could define fv to compute the value of v from scratch using the initial values of the simulation, using some more-efficient simulation. But we can’t do this for the observation node Obs, since it is required to map up with the observation node in the AI’s model. So in order to explain the observation node we will need to have EAI(Obs) include the AI’s entire model, or at least the parts leading up to the observation. This means we are basically paying for two full copies of the computation, and so we’re not really benefiting from the fact that the second copy is more efficient. (We are also potentially overlapping a lot of computation between the different fv’s, but again I’m a bit less sure how robust that is and whether it will hold up in different formalizations.)
There are a lot of problems and missing details in this proposal:
This requires exactly reproducing the conditional probabilities in the human’s model. But that’s totally unrealistic for anything short of a full simulation—and even for the full simulation it wouldn’t reproduce the probabilities since the human model isn’t accurate. So we need some way to choose a good enough explanation, i.e. a way of balancing the computational complexity of the explanation against the quality of the conditional probabilities that come out.
We’re sampling the inputs to EAI(v) uniformly at random. This seems unlikely to work in general. We could easily sample each node from its marginal, but most of the action is in the correlation. Allowing arbitrary correlations causes problems (since you could just specify the “human is accurate” correlation and then read off the correct answers from there). So I think probably some more flexible system is needed here; there are a lot of options but it’s tricky.
There is something problematic about the overlapping explanations EAI(v). If they overlap you need to pay for all of them, but for the intended mapping there will often be quite significant overlap. This isn’t inherently a problem, but I’m scared that it’s going to introduce a lot of pressure towards some different correspondence that is able to avoid that problem. We need to penalize overlap because the case where the training data is embedded in the model—the main problem with that model is that you need to separately explain every way in which the human is correct with highly overlapping explanations. If you didn’t penalize those then you may just end up with the embedded explanations (for which fv is extremely cheap).
There is something tricky about uniformity in the model and in the implementations of fv.
I’m still scared about the “recompute everything from scratch” failure mode. The model does need to have a single explanation EAI(Obs) that needs to include the whole model. But (i) it doesn’t have to reproduce work, (ii) it can cut out all the stuff not on the path to the observation. So the obvious reason that this one loses is by the duplicated work in fv. Hopefully that’s actually robust.
We are making really strong structural assumptions on the models and the correspondence between them. We get some things for free (because humans actually do have extra structure in our beliefs about the world that is properly part of the problem statement, and the AI’s model is constrained by its architecture) but not nearly this much.
Overall I’m becoming significantly more optimistic that something like this will work (though still less likely than not). Trying to step back and see the big picture, it seems like there are three key active ingredients:
Using “speed” instead of “simplicity” as the ~only requirement for these correspondences.
Having separate correspondences for separate properties and not allowing them to share tons of computation with each other (to prevent re-running the whole simulation).
Forcing the model to explain correlations, so that using an “embedded” copy of the answers (like a simulation of the data-generating process) forces you to reproduce the computation that produced that answer.
My next step would probably be looking at cases where these high-level ingredients aren’t sufficient (e.g. are there cases where “generate obs then do inference in the human model” is actually cheaper?). If they look pretty good, then I’ll spend some more time trying to fill in the details in a more plausible way.
We might be able to get similar advantages with a more general proposal like:
Fit a function f to a (Q, A) dataset with lots of questions about latent structure. Minimize the sum of some typical QA objective and the computational cost of verifying that f is consistent.
Then the idea is that matching the conditional probabilities from the human’s model (or at least being consistent with what the human believes strongly about those conditional probabilities) essentially falls out of a consistency condition.
It’s not clear how to actually formulate that consistency condition, but it seems like an improvement over the prior situation (which was just baking in the obviously-untenable requirement of exactly matching). It’s also not clear what happens if this consistency condition is soft.
It’s not clear what “verify that the consistency conditions are met” means. You can always do the same proposal as in the parent, though it’s not really clear if that’s a convincing verification. But I think that’s a fundamental philosophical problem that both of these proposals need to confront.
It’s not clear how to balance computational cost and the QA objective. But you are able to avoid most of the bad properties just by being on the Pareto frontier, and I don’t think this is worse than the prior proposal.
Overall this approach seems like it could avoid making such strong structural assumptions about the underlying model. It also helps a lot with the overlapping explanations + uniformity problem. And it generally seems to be inching towards feeling plausible.
I think the biggest problem is that Fθ can compute the instrumental policy (or a different policy that works well, or a fragment of it). Some possible reasons:
Maybe some people in the world are incidentally thinking about the instrumental policy and Fθ makes predictions about them.
Maybe an adversary who computes a policy that performs well in order to try to attack the learning process (since Eθ′ may just copy the adversary’s policy in order to be fast if it works well on training, resulting in bad generalization).
Maybe no one computes the instrumental policy itself, but they compute helpful building blocks and Eθ′ can save compute by reusing those.
Maybe Fθ wants to predict question-answering policies, and computing optimal responses to the training objective is a simple way for it to forecast what they will do. (It’s totally unclear if this is actually more efficient than the intended policy.) Then Eθ′ can copy that.
I don’t know if any of those particular failures are too likely. But overall it seems really bad to rely on Fθ never computing something inconvenient, and it definitely doesn’t look like it’s going to work in the worst case.
What are some possible outs, if in fact Fθ computes something adversarial to try to make it easy for Eθ′ to learn something bad?
At the time when Fθ is running it doesn’t know the question Q that Eθ will need to answer. So in order to cause trouble it would need to compute answers for all of these questions. But I’m afraid that this is doable in many cases, since Fθ could translate a big chunk of its internal state into natural language and then Eθ′ can do some simple QA on top of that.
We could try to ensure that the correctness/consistency checks are really complex or challenging to meet, at least for some questions. If the complexity of these checks scales up with the complexity of Fθ itself (e.g. by using amplification where the evaluator consults a model as big as Fθ), then maybe it’s impossible for Fθ to meet them. This idea can be applied without the factorization into F and E, but it’s probably useful to have this kind of factorization in mind as a possible tool when exploring approaches based on tricky-evaluations.
We could try to exploit some further structural facts about the parts of Zθ(X) that are used by Eθ′. For example, it feels like the intended model is going to be leveraging facts that are further “upstream.” For example, suppose an attacker observes that there is a cat in the room, and so writes out “There is a cat in the room” as part of a natural-language description of what it’s going on that it hopes that Eθ′ will eventually learn to copy. If Fθ predicts the adversary’s output, it must first predict that there is actually a cat in the room, which then ultimately flows downstream into predictions of the adversary’s behavior. And so we might hope to prefer the “intended” Eθ′ by having it preferentially read from the earlier activations (with shorter computational histories).
Overall this kind of approach feels like it’s probably doomed, but it does capture part of the intuition for why we should “just” be able to learn a simple correspondence rather than getting some crazy instrumental policy. So I’m not quite ready to let it go yet. I’m particularly interested to push a bit on the third of these approaches.
Here’s another approach to “shortest circuit” that is designed to avoid this problem:
Learn a circuit C(X) that outputs an entire set of beliefs. (Or maybe some different architecture, but with ~0 weight sharing so that computational complexity = description complexity.)
Impose a consistency requirement on those beliefs, even in cases where a human can’t tell the right answer.
Require C(X)’s beliefs about Y to match Fθ(X). We hope that this makes C an explication of “Fθ’s beliefs.”
Optimize some combination of (complexity) vs (usefulness), or chart the whole pareto frontier, or whatever. I’m a bit confused about how this step would work but there are similar difficulties for the other posts in this genre so it’s exciting if this proposal gets to that final step.
The “intended” circuit C just follows along with the computation done by Fθ and then translates its internal state into natural language.
What about the problem case where Fθ computes some reasonable beliefs (e.g. using the instrumental policy, where the simplicity prior makes us skeptical about their generalization) that C could just read off? I’ll imagine those being written down somewhere on a slip of paper inside of Fθ’s model of the world.
Suppose that the slip of paper is not relevant to predicting Fθ(X), i.e. it’s a spandrel from the weight sharing. Then the simplest circuit C just wants to cut it out. Whatever computation was done to write things down on the slip of paper can be done directly by C, so it seems like we’re in business.
So suppose that the slip of paper is relevant for predicting Fθ(X), e.g. because someone looks at the slip of paper and then takes an action that affects Y. If (the correct) Y is itself depicted on the slip of paper, then we can again cut out the slip of paper itself and just run the same computation (that was done by whoever wrote something on the slip of paper). Otherwise, the answers produced by C still have to contain both the items on the slip of paper as well as some facts that are causally downstream of the slip of paper (as well as hopefully some about the slip of paper itself). At that point it seems like we have a pretty good chance of getting a consistency violation out of C.
Probably nothing like this can work, but I now feel like there are two live proposals for capturing the optimistic minimal circuits intuition—the one in this current comment, and in this other comment. I still feel like the aggressive speed penalization is doing something, and I feel like probably we can either find a working proposal in that space or else come up with some clearer counterexample.
The natural way to implement this is to penalize Eθ′ not for the computation it does, but for all the computation needed to compute its output (including within Fθ.). The basic problem with this approach is that it incentivizes Eθ′ to do all of the computation of Fθ from scratch in a way optimized for speed rather than complexity. I’d set this approach aside for a while because of this difficulty and the unnaturalness mentioned in the sibling (where we’ve given up on what seems to be an important form of parameter-sharing).
Today I was thinking about some apparently-totally-different angles of attack for the ontology identification problem, and this idea seems to have emerged again, with a potential strategy for fixing the “recompute Fθ problem”. (In the context of ontology identification, the parameter-sharing objection no longer applies.)
Here’s the idea:
Train Fθ as before.
Start with a bunch of facts and probabilistic relationships that the human knows, expressed in their own ontology. These might be facts like “Ice melts at 100 degrees” or “Dogs bark at cats” or whatever.
We are going to try to jointly learn (i) a correspondence c between Fθ and the human’s ontology, (ii) a set of “justifications” showing that c(Fθ) satisfies all of the relationships the human expects. I’m imagining justifications like simulating ice at 100 degrees and observing that it indeed melts, or sampling situations with dogs and cats and verifying that the dogs bark.
The correspondence c is constrained to map “the observations” (a concept in the human’s ontology) to the output of Fθ, but other than that there is no simplicity prior, it can be anything.
Our goal is to make the justifications as computationally simple as possible. (Which indirectly incentivizes us to make c as computationally simple as possible.)
This still feels a bit weird, but you could imagine it handling a bunch of cases in a promising way:
c has no reason to “reuse the human” or to use the human’s inference procedure in order to save bits, because doing either of those things makes the justifications more computationally complex.
If there is a human embedded inside Fθ who performs the correspondence and writes it down in an easy-to-read way, c has no motivation to read it: if c is defined in that way, then justifying facts will require simulating the human (even if that work was already done inside Fθ). If c simply cut out the middle man and applied the correspondence itself, then it could save compute in the typical case (except when talking about facts about that human). This is subtle in a few ways but tentatively looks plausible to me.
c has no reason to ignore Fθ and implement a new more-efficient-but-more-complex simulation F′, because (i) it ultimately needs to relate observations back to the output of Fθ, and many of its concepts are related to observations (e.g. what cats look like), (ii) that forces F′ and Fθ to have the same behavior, (iii) the justification would then need to show that the the “observations” in F′ are the same as the observations in Fθ, which is computationally costly.
But right now it’s a pretty vague proposal, because it’s unclear what the nature of these facts or justifications are. If you set that up in a naive way, then the justification effectively just needs to simulate all of Fθ. That’s a problem because it reintroduces the failure mode where you need to simulate the human, and therefore there’s no extra cost to just simulating and then listening to whatever they say.
Overall I think that probably nothing like this works, but I’m still feeling a lot more optimistic than I was last week and want to explore it further. (This is partially for reasons not discussed in this comment, that several other approaches/motivations seem to converge on something similar.)
Here’s a slightly more formal algorithm along these lines:
Assume that both the human’s model WH and the AI’s model WAI are Bayesian networks where you compute the probability distribution over a node v’s value based on the values of its parents pa(v). I’ll write Values(v) for the set of values that a node v can take on (in either model), and Values(S) for the joint values of a set of nodes S.
A correspondence tells you how to compute the value of each node v in the human’s model. This consistent of (i) a neighborhood in the AI’s model NAI(v) which suffices to determine v, (ii) a function fv:Values(NAI(v))→Values(v).
Both the AI’s model and the human model contain some distinguished observation nodes.f must be the identity on these nodes.
An “explanation” of a correspondence consists of a set of nodes EAI(v) in the AI’s model for each node v in the human’s model. The intuition is that we can run a simulation involving only these nodes in order to reproduce the probability distribution of v given its parents’ values.
In particular, NAI(v)⊆EAI(v), and NAI(u)⊆EAI(v) for all u∈pa(v). In order to check whether EAI(v) reproduces the right distribution, we first sample values at random for all the nodes some of whose parents aren’t in EAI(v). Then we sample values for the remaining nodes. We can use f to compute the corresponding values for v and all of its parents. And then we can compute the conditional distributions for v given each set of values for its parents.
We require that the explanations exactly reproduce the conditional probability over Values(v) given Values(pa(v)).
The “cost” of the explanation of v is the sum of the compute required to sample all the nodes in EAI(v). The “cost” of the correspondence fv is the compute required to evaluate it.
We search for the set of correspondences and explanations for which the total cost is minimized.
(Maybe we also have some requirement where the correspondence fv agrees with some training data about v. I’m not really sure about that.)
Reviewing how this behaves in each of the bad cases from the parent:
It’s very bad to define fv by computing the observation and then using the human’s inference algorithm. The entire motivation for defining it this way was to save on description complexity, but fv is only being penalized for computational complexity. (This also forces every single EAI(v) to include the entire process that generates the observations, which seems like it should be an even bigger deal. But this feels less like the “right” reason and I think it might change for a future version of the scheme.)
If there is a human in the simulation who knows the value of v, it’s extremely bad to define NAI(v) to be that human. This is because the explanation EAI(v) will need to include the entire process by which the human computes that value. This is guaranteed to be more expensive than just performing that compute in the correspondence fv itself.
We could define fv to compute the value of v from scratch using the initial values of the simulation, using some more-efficient simulation. But we can’t do this for the observation node Obs, since it is required to map up with the observation node in the AI’s model. So in order to explain the observation node we will need to have EAI(Obs) include the AI’s entire model, or at least the parts leading up to the observation. This means we are basically paying for two full copies of the computation, and so we’re not really benefiting from the fact that the second copy is more efficient. (We are also potentially overlapping a lot of computation between the different fv’s, but again I’m a bit less sure how robust that is and whether it will hold up in different formalizations.)
There are a lot of problems and missing details in this proposal:
This requires exactly reproducing the conditional probabilities in the human’s model. But that’s totally unrealistic for anything short of a full simulation—and even for the full simulation it wouldn’t reproduce the probabilities since the human model isn’t accurate. So we need some way to choose a good enough explanation, i.e. a way of balancing the computational complexity of the explanation against the quality of the conditional probabilities that come out.
We’re sampling the inputs to EAI(v) uniformly at random. This seems unlikely to work in general. We could easily sample each node from its marginal, but most of the action is in the correlation. Allowing arbitrary correlations causes problems (since you could just specify the “human is accurate” correlation and then read off the correct answers from there). So I think probably some more flexible system is needed here; there are a lot of options but it’s tricky.
There is something problematic about the overlapping explanations EAI(v). If they overlap you need to pay for all of them, but for the intended mapping there will often be quite significant overlap. This isn’t inherently a problem, but I’m scared that it’s going to introduce a lot of pressure towards some different correspondence that is able to avoid that problem. We need to penalize overlap because the case where the training data is embedded in the model—the main problem with that model is that you need to separately explain every way in which the human is correct with highly overlapping explanations. If you didn’t penalize those then you may just end up with the embedded explanations (for which fv is extremely cheap).
There is something tricky about uniformity in the model and in the implementations of fv.
I’m still scared about the “recompute everything from scratch” failure mode. The model does need to have a single explanation EAI(Obs) that needs to include the whole model. But (i) it doesn’t have to reproduce work, (ii) it can cut out all the stuff not on the path to the observation. So the obvious reason that this one loses is by the duplicated work in fv. Hopefully that’s actually robust.
We are making really strong structural assumptions on the models and the correspondence between them. We get some things for free (because humans actually do have extra structure in our beliefs about the world that is properly part of the problem statement, and the AI’s model is constrained by its architecture) but not nearly this much.
Overall I’m becoming significantly more optimistic that something like this will work (though still less likely than not). Trying to step back and see the big picture, it seems like there are three key active ingredients:
Using “speed” instead of “simplicity” as the ~only requirement for these correspondences.
Having separate correspondences for separate properties and not allowing them to share tons of computation with each other (to prevent re-running the whole simulation).
Forcing the model to explain correlations, so that using an “embedded” copy of the answers (like a simulation of the data-generating process) forces you to reproduce the computation that produced that answer.
My next step would probably be looking at cases where these high-level ingredients aren’t sufficient (e.g. are there cases where “generate obs then do inference in the human model” is actually cheaper?). If they look pretty good, then I’ll spend some more time trying to fill in the details in a more plausible way.
We might be able to get similar advantages with a more general proposal like:
Then the idea is that matching the conditional probabilities from the human’s model (or at least being consistent with what the human believes strongly about those conditional probabilities) essentially falls out of a consistency condition.
It’s not clear how to actually formulate that consistency condition, but it seems like an improvement over the prior situation (which was just baking in the obviously-untenable requirement of exactly matching). It’s also not clear what happens if this consistency condition is soft.
It’s not clear what “verify that the consistency conditions are met” means. You can always do the same proposal as in the parent, though it’s not really clear if that’s a convincing verification. But I think that’s a fundamental philosophical problem that both of these proposals need to confront.
It’s not clear how to balance computational cost and the QA objective. But you are able to avoid most of the bad properties just by being on the Pareto frontier, and I don’t think this is worse than the prior proposal.
Overall this approach seems like it could avoid making such strong structural assumptions about the underlying model. It also helps a lot with the overlapping explanations + uniformity problem. And it generally seems to be inching towards feeling plausible.