I’m trying to get a better handle on what the benefits coming from LTP are. Here’s my current picture—are there points here where I’ve misundersood? _________
The core problem: We have a training distribution (x, y) ~ D and a deployment distribution (x*, y*) ~ D*, where D != D*. We would rather not rely on ML OOD generalization from D to D*. Instead, we would rather have a human label D*, train an ML model on those labels, and only rely on IID generalization. Suppose D is too large for a human to process. If the human knows how to label D* without learning from D, that’s fine. But D* might be very hard for humans. In particular we need to outperform prosaic ML: the human (before updating on D) needs to outperform an ML model (after updating on D).
Insight from LTP: Ideally, we can compress D into something more manageable: some latent variable z*. Then the human can use z* to predict PH(y* | x*, z*) instead of just PH(y* | x*), and can now hopefully outperform the prosaic ML model. The benefit is we can now rely on IID generalization while remaining competitive.
At first, this seems to assume that it is possible to compress the key information in D into a much smaller core z* containing the main insights (distilled core assumption). For example, if D were movements of planets, z* might be the laws of physics. This post argues this is not necessary: by using amplification or debate, the amplified human can use a very large z*. But since the amplification/debate models are ML models, and we’re running these models to aid human decisions on x*, aren’t we back to relying on ML OOD generalization, and so back where we started?
The distilled core assumption seems right to me because the neural network weights are already a distilled representation of D, and we only need to compete with that representation. For that reason, I expect z* to have roughly the same size as the neural network parameters.
My main reservation is that this seems really hard (and maybe in some sense just a reframing of the original problem). We want z to be a representation of what the neural network learned that a human can manipulate in order to reason about what it implies on D*. But what is that going to look like? If we require competitiveness then it seems like z has to look quite a lot like the weights of a neural network...
In writing the original post I was imagining z* being much bigger than a neural network but distilled by a neural network in some way. I’ve generally moved away from that kind of perspective, partly based on the kinds of considerations in this post.
But since the amplification/debate models are ML models, and we’re running these models to aid human decisions on x*, aren’t we back to relying on ML OOD generalization, and so back where we started?
I now think we’re going to have to actually have z* reflect something more like the structure of the unaligned neural network, rather than another model (Mz) that outputs all of the unaligned neural network’s knowledge.
That said, I’m not sure we require OOD generalization even if we represent z via a model Mz. E.g. suppose that Mz(i) is the ith word of the intractably-large z. Then the prior calculation can access all of the words i in order to evaluate the plausibility of the string represented by Mz. We then use that same set of words at training time and test time. If some index i is used at test time but not at training time, then the model responsible for evaluating Prior(z) is incentivized to access that index in order to show that z is unnecessarily complex. So every index i should be accessed on the training distribution. (Though they need not be accessed explicitly, just somewhere in the implicit exponentially large tree).
Like I said, I’m a bit less optimistic about doing this kind of massive compression. For now, I’m just thinking about the setting where our human has plenty of time to look at z in detail even if it’s the same size as the weights of our neural network. if we can make that work, then I’ll think about to do it in the case of computationally bounded humans (which I expect to be straightforward).
Thanks. This is helpful. I agree that LTP with the distilled core assumption buys us a lot, both theoretically and probably in practice too.
> The distilled core assumption seems right to me because the neural network weights are already a distilled representation of D, and we only need to compete with that representation… My main reservation is that this seems really hard… If we require competitiveness then it seems like z has to look quite a lot like the weights of a neural network
Great, agreed with all of this.
> In writing the original post I was imagining z* being much bigger than a neural network but distilled by a neural network in some way. I’ve generally moved away from that kind of perspective, partly based on the kinds of considerations in this post
I share the top-line view, but I’m not sure what issues obfuscated arguments present for large z*, other than generally pushing more difficulty onto alignment/debate. (Probably not important to respond to, just wanted to flag in case this matters elsewhere.)
> That said, I’m not sure we require OOD generalization even if we represent z via a model Mz. E.g. suppose that Mz(i) is the ith word of the intractably-large z.
I agree that Mz (= z*) does not require OOD generalization. My claim is that the amplified model using Mz involves a ML model which must generalize OOD. On D, our y-targets are PHA(y|x,Mz) where HA is an amplified human. On D*, our y-targets are similarly PHA(y∗|x∗,Mz). The key question for me is whether our y-targets on D* are good. If we use the distilled core assumption, they are—they’re exactly the predictions the human makes after updating on D. Without it, our y-targets depend on HA, which involves a ML model.
In particular, I’m assuming H^A is something like human + policy PM(y|x,Mz), where PM was optimized to imitate H on D (with z sampled), but is making predictions on D* now. Maybe the picture is that we instead run IDA from scratch on D*? E.g. for amplification, this involves ignoring the models/policies we already have, starting with the usual unaided human supervision on D* at first, and bootstrapping all the way up. I suppose this works, but then couldn’t we just have run IDA on D* without access to Mz (which itself can still access superhuman performance)?
It seems like you should either run separate models for D and D*, or jointly train the model on both D and D*, definitely you shouldn’t train on D then run on D* (and you don’t need to!).
I suppose this works, but then couldn’t we just have run IDA on D* without access to Mz (which itself can still access superhuman performance)?
The goal is to be as good as an unaligned ML system though, not just to be better than humans. And the ML system updates on D, so we need to update on D too.
It seems like you should either run separate models for D and D*, or jointly train the model on both D and D*, definitely you shouldn’t train on D then run on D* (and you don’t need to!).
Sorry yes, you’re completely right. (I previously didn’t like that there’s a model trained on Ez∼Z,D[PHA(y|x,z)] which only gets used for finding z*, but realized it’s not a big deal.)
The goal is to be as good as an unaligned ML system though, not just to be better than humans. And the ML system updates on D, so we need to update on D too.
I agree—I mean for the alternative to be running IDA on D*, using D as an auxiliary input (rather than using indirection through Mz). In other words, if we need IDA to access a large context Mz, we could also use IDA to access a large context D? Without something like the distilled core assumption, I’m not sure if there are major advantages one way or the other?
OTOH, with something like the distilled core assumption, it’s clearly better to go through Mz, because Mz is much smaller than D (I think of this as amortizing the cost of distilling D).
Even if you were taking D as input and ignoring tractability, IDA still has to decide what to do with D, and that needs to be at least as useful as what ML does with D (and needs to not introduce alignment problems in the learned model). In the post I’m kind of vague about that and just wrapping it up into the philosophical assumption that HCH is good, but really we’d want to do work to figure out what to do with D, even if we were just trying to make HCH aligned (and I think even for HCH competitiveness matters because it’s needed for HCH to be stable/aligned against internal optimization pressure).
I’m trying to get a better handle on what the benefits coming from LTP are. Here’s my current picture—are there points here where I’ve misundersood?
_________
The core problem: We have a training distribution (x, y) ~ D and a deployment distribution (x*, y*) ~ D*, where D != D*. We would rather not rely on ML OOD generalization from D to D*. Instead, we would rather have a human label D*, train an ML model on those labels, and only rely on IID generalization. Suppose D is too large for a human to process. If the human knows how to label D* without learning from D, that’s fine. But D* might be very hard for humans. In particular we need to outperform prosaic ML: the human (before updating on D) needs to outperform an ML model (after updating on D).
Insight from LTP: Ideally, we can compress D into something more manageable: some latent variable z*. Then the human can use z* to predict PH(y* | x*, z*) instead of just PH(y* | x*), and can now hopefully outperform the prosaic ML model. The benefit is we can now rely on IID generalization while remaining competitive.
At first, this seems to assume that it is possible to compress the key information in D into a much smaller core z* containing the main insights (distilled core assumption). For example, if D were movements of planets, z* might be the laws of physics. This post argues this is not necessary: by using amplification or debate, the amplified human can use a very large z*. But since the amplification/debate models are ML models, and we’re running these models to aid human decisions on x*, aren’t we back to relying on ML OOD generalization, and so back where we started?
I think your description is correct.
The distilled core assumption seems right to me because the neural network weights are already a distilled representation of D, and we only need to compete with that representation. For that reason, I expect z* to have roughly the same size as the neural network parameters.
My main reservation is that this seems really hard (and maybe in some sense just a reframing of the original problem). We want z to be a representation of what the neural network learned that a human can manipulate in order to reason about what it implies on D*. But what is that going to look like? If we require competitiveness then it seems like z has to look quite a lot like the weights of a neural network...
In writing the original post I was imagining z* being much bigger than a neural network but distilled by a neural network in some way. I’ve generally moved away from that kind of perspective, partly based on the kinds of considerations in this post.
I now think we’re going to have to actually have z* reflect something more like the structure of the unaligned neural network, rather than another model (Mz) that outputs all of the unaligned neural network’s knowledge.
That said, I’m not sure we require OOD generalization even if we represent z via a model Mz. E.g. suppose that Mz(i) is the ith word of the intractably-large z. Then the prior calculation can access all of the words i in order to evaluate the plausibility of the string represented by Mz. We then use that same set of words at training time and test time. If some index i is used at test time but not at training time, then the model responsible for evaluating Prior(z) is incentivized to access that index in order to show that z is unnecessarily complex. So every index i should be accessed on the training distribution. (Though they need not be accessed explicitly, just somewhere in the implicit exponentially large tree).
Like I said, I’m a bit less optimistic about doing this kind of massive compression. For now, I’m just thinking about the setting where our human has plenty of time to look at z in detail even if it’s the same size as the weights of our neural network. if we can make that work, then I’ll think about to do it in the case of computationally bounded humans (which I expect to be straightforward).
Thanks. This is helpful. I agree that LTP with the distilled core assumption buys us a lot, both theoretically and probably in practice too.
> The distilled core assumption seems right to me because the neural network weights are already a distilled representation of D, and we only need to compete with that representation… My main reservation is that this seems really hard… If we require competitiveness then it seems like z has to look quite a lot like the weights of a neural network
Great, agreed with all of this.
> In writing the original post I was imagining z* being much bigger than a neural network but distilled by a neural network in some way. I’ve generally moved away from that kind of perspective, partly based on the kinds of considerations in this post
I share the top-line view, but I’m not sure what issues obfuscated arguments present for large z*, other than generally pushing more difficulty onto alignment/debate. (Probably not important to respond to, just wanted to flag in case this matters elsewhere.)
> That said, I’m not sure we require OOD generalization even if we represent z via a model Mz. E.g. suppose that Mz(i) is the ith word of the intractably-large z.
I agree that Mz (= z*) does not require OOD generalization. My claim is that the amplified model using Mz involves a ML model which must generalize OOD. On D, our y-targets are PHA(y|x,Mz) where HA is an amplified human. On D*, our y-targets are similarly PHA(y∗|x∗,Mz). The key question for me is whether our y-targets on D* are good. If we use the distilled core assumption, they are—they’re exactly the predictions the human makes after updating on D. Without it, our y-targets depend on HA, which involves a ML model.
In particular, I’m assuming H^A is something like human + policy PM(y|x,Mz), where PM was optimized to imitate H on D (with z sampled), but is making predictions on D* now. Maybe the picture is that we instead run IDA from scratch on D*? E.g. for amplification, this involves ignoring the models/policies we already have, starting with the usual unaided human supervision on D* at first, and bootstrapping all the way up. I suppose this works, but then couldn’t we just have run IDA on D* without access to Mz (which itself can still access superhuman performance)?
It seems like you should either run separate models for D and D*, or jointly train the model on both D and D*, definitely you shouldn’t train on D then run on D* (and you don’t need to!).
The goal is to be as good as an unaligned ML system though, not just to be better than humans. And the ML system updates on D, so we need to update on D too.
Sorry yes, you’re completely right. (I previously didn’t like that there’s a model trained on Ez∼Z,D[PHA(y|x,z)] which only gets used for finding z*, but realized it’s not a big deal.)
I agree—I mean for the alternative to be running IDA on D*, using D as an auxiliary input (rather than using indirection through Mz). In other words, if we need IDA to access a large context Mz, we could also use IDA to access a large context D? Without something like the distilled core assumption, I’m not sure if there are major advantages one way or the other?
OTOH, with something like the distilled core assumption, it’s clearly better to go through Mz, because Mz is much smaller than D (I think of this as amortizing the cost of distilling D).
Even if you were taking D as input and ignoring tractability, IDA still has to decide what to do with D, and that needs to be at least as useful as what ML does with D (and needs to not introduce alignment problems in the learned model). In the post I’m kind of vague about that and just wrapping it up into the philosophical assumption that HCH is good, but really we’d want to do work to figure out what to do with D, even if we were just trying to make HCH aligned (and I think even for HCH competitiveness matters because it’s needed for HCH to be stable/aligned against internal optimization pressure).
Okay, that makes sense (and seems compelling, though not decisive, to me). I’m happy to leave it here—thanks for the answers!