It seems like you should either run separate models for D and D*, or jointly train the model on both D and D*, definitely you shouldn’t train on D then run on D* (and you don’t need to!).
I suppose this works, but then couldn’t we just have run IDA on D* without access to Mz (which itself can still access superhuman performance)?
The goal is to be as good as an unaligned ML system though, not just to be better than humans. And the ML system updates on D, so we need to update on D too.
It seems like you should either run separate models for D and D*, or jointly train the model on both D and D*, definitely you shouldn’t train on D then run on D* (and you don’t need to!).
Sorry yes, you’re completely right. (I previously didn’t like that there’s a model trained on Ez∼Z,D[PHA(y|x,z)] which only gets used for finding z*, but realized it’s not a big deal.)
The goal is to be as good as an unaligned ML system though, not just to be better than humans. And the ML system updates on D, so we need to update on D too.
I agree—I mean for the alternative to be running IDA on D*, using D as an auxiliary input (rather than using indirection through Mz). In other words, if we need IDA to access a large context Mz, we could also use IDA to access a large context D? Without something like the distilled core assumption, I’m not sure if there are major advantages one way or the other?
OTOH, with something like the distilled core assumption, it’s clearly better to go through Mz, because Mz is much smaller than D (I think of this as amortizing the cost of distilling D).
Even if you were taking D as input and ignoring tractability, IDA still has to decide what to do with D, and that needs to be at least as useful as what ML does with D (and needs to not introduce alignment problems in the learned model). In the post I’m kind of vague about that and just wrapping it up into the philosophical assumption that HCH is good, but really we’d want to do work to figure out what to do with D, even if we were just trying to make HCH aligned (and I think even for HCH competitiveness matters because it’s needed for HCH to be stable/aligned against internal optimization pressure).
It seems like you should either run separate models for D and D*, or jointly train the model on both D and D*, definitely you shouldn’t train on D then run on D* (and you don’t need to!).
The goal is to be as good as an unaligned ML system though, not just to be better than humans. And the ML system updates on D, so we need to update on D too.
Sorry yes, you’re completely right. (I previously didn’t like that there’s a model trained on Ez∼Z,D[PHA(y|x,z)] which only gets used for finding z*, but realized it’s not a big deal.)
I agree—I mean for the alternative to be running IDA on D*, using D as an auxiliary input (rather than using indirection through Mz). In other words, if we need IDA to access a large context Mz, we could also use IDA to access a large context D? Without something like the distilled core assumption, I’m not sure if there are major advantages one way or the other?
OTOH, with something like the distilled core assumption, it’s clearly better to go through Mz, because Mz is much smaller than D (I think of this as amortizing the cost of distilling D).
Even if you were taking D as input and ignoring tractability, IDA still has to decide what to do with D, and that needs to be at least as useful as what ML does with D (and needs to not introduce alignment problems in the learned model). In the post I’m kind of vague about that and just wrapping it up into the philosophical assumption that HCH is good, but really we’d want to do work to figure out what to do with D, even if we were just trying to make HCH aligned (and I think even for HCH competitiveness matters because it’s needed for HCH to be stable/aligned against internal optimization pressure).
Okay, that makes sense (and seems compelling, though not decisive, to me). I’m happy to leave it here—thanks for the answers!