IMO the strongest argument in favor of imitation-based solutions is: if there is any way to solve the alignment problem which we can plausibly come up with, then a sufficiently reliable and amplified imitation of us will also find this solution. So, if the imitation is doomed to produce a bad solution or end up in a strange attractor, then our own research is also doomed in the same way anyway. Any counter-argument to this would probably have to be based on one of the following:
Maybe the imitation is different from how we do alignment research in important ways. For example, we have more than 2 weeks of memory of working on the solution: but maybe if you spend 1 week learning the relevant context and another week making further progress, then it’s not a serious problem. I definitely think factored cognition is a big difference, but also that we don’t need factored cognition (we should use confidence thresholds instead).
Maybe producing reliable imitation is much harder than some different solution that explicitly references the concept of “values”. An imitator doesn’t have any way to know which features of what it’s imitating are important, which makes its job hard. I think we need some rigorous learning-theoretic analysis to confirm or disprove this.
Maybe by the time we launch the imitator we’ll have such a small remaining window of opportunity that it won’t do as good a job at solving alignment as we would do working on the problem starting from now. Especially taking into account malign AI leaking into the imitation. [EDIT: Actually malign AI leakage is a serious problem since the malign AI takeover probability rate at the time of IDA deployment is likely to be much higher than it is now, and the leakage is proportional to this rate times the amplification factor.]
Yeah, I agree with this. It’s certainly possible to see normal human passage through time as a process with probable attractors. I think the biggest differences are that HCH is a psychological “monoculture,” HCH has tiny bottlenecks through which to pass messages compared to the information I can pass to my future self, and there’s some presumption that the output will be “an answer” whereas I have no such demands on the brain-state I pass to tomorrow.
If we imagine actual human imitations I think all of these problems have fairly obvious solutions, but I think the problems are harder to solve if you want IDA approximations of HCH. I’m not totally sure what you meant by the confidence thresholds link—was it related to this?
The monoculture problem seems like it should increase the size (“size” meaning attraction basin, not measure of the equilibrium set), lifetime, and weirdness of attractors, while the restrictions and expectations on message-passing seem like they might shift the distribution away from “normal” human results.
But yeah, in theory we could use imitiation humans to do any research we could do ourselves. I think that gets into the relative difficulty of super-speed imitations of humans doing alignment research versus transformative AI, which I’m not really an expert in.
[EDIT: After thinking about this some more, I realized that malign AI leakage is a bigger problem than I thought when writing the parent comment, because the way I imagined it can be overcome doesn’t work that well.]
I think the biggest differences are that HCH is a psychological “monoculture,” HCH has tiny bottlenecks through which to pass messages compared to the information I can pass to my future self, and there’s some presumption that the output will be “an answer” whereas I have no such demands on the brain-state I pass to tomorrow.
I don’t think that last one is a real constraint. What counts as “an answer” is entirely a matter of interpretation by the participants in the HCH. For example, initially I can ask the question “what are the most useful thoughts about AI alignment I can come up with during 1,000,000 iterations?”. When I am tasked to answer the question “what are the most useful thoughts about AI alignment I can come up with during N iterations?” then
If N=1, I will just spend my allotted time thinking about AI alignment and write whatever I came up with in the end.
If N>1, I will ask “what are the most useful thoughts about AI alignment I can come up with during N−1 iterations?”. Then, I will study the answer and use the remaining time to improve on it to the best of my ability.
An iteration of 2 weeks might be too short to learn the previous results, but we can work in longer iterations. Certainly, having to learn the previous results from text carries overhead compared to just remembering myself developing them (and having developed some illegible intuitions in the process), but only that much overhead.
As to “monoculture”, we can do HCH with multiple people (either the AI learns to simulate the entire system of multiple people or we use some rigid interface e.g. posting on a forum). For example, we can imagine putting the entire AI X-safety community there. But, we certainly don’t want to put the entire world in there, since that way malign AI would probably leak into the system.
I think the problems are harder to solve if you want IDA approximations of HCH. I’m not totally sure what you meant by the confidence thresholds link—was it related to this?
Yes: it shows how to achieve reliable imitation (although for now in a theoretical model that’s not feasible to implement), and the same idea should be applicable to an imitation system like IDA (although it calls for its own theoretical analysis). Essentially, the AI queries a real person if and only if it cannot produce a reliable prediction using previous data (because there are several plausible mutually inconsistent hypotheses), and the frequency of queries vanishes over time.
IMO the strongest argument in favor of imitation-based solutions is: if there is any way to solve the alignment problem which we can plausibly come up with, then a sufficiently reliable and amplified imitation of us will also find this solution. So, if the imitation is doomed to produce a bad solution or end up in a strange attractor, then our own research is also doomed in the same way anyway. Any counter-argument to this would probably have to be based on one of the following:
Maybe the imitation is different from how we do alignment research in important ways. For example, we have more than 2 weeks of memory of working on the solution: but maybe if you spend 1 week learning the relevant context and another week making further progress, then it’s not a serious problem. I definitely think factored cognition is a big difference, but also that we don’t need factored cognition (we should use confidence thresholds instead).
Maybe producing reliable imitation is much harder than some different solution that explicitly references the concept of “values”. An imitator doesn’t have any way to know which features of what it’s imitating are important, which makes its job hard. I think we need some rigorous learning-theoretic analysis to confirm or disprove this.
Maybe by the time we launch the imitator we’ll have such a small remaining window of opportunity that it won’t do as good a job at solving alignment as we would do working on the problem starting from now. Especially taking into account malign AI leaking into the imitation. [EDIT: Actually malign AI leakage is a serious problem since the malign AI takeover probability rate at the time of IDA deployment is likely to be much higher than it is now, and the leakage is proportional to this rate times the amplification factor.]
Yeah, I agree with this. It’s certainly possible to see normal human passage through time as a process with probable attractors. I think the biggest differences are that HCH is a psychological “monoculture,” HCH has tiny bottlenecks through which to pass messages compared to the information I can pass to my future self, and there’s some presumption that the output will be “an answer” whereas I have no such demands on the brain-state I pass to tomorrow.
If we imagine actual human imitations I think all of these problems have fairly obvious solutions, but I think the problems are harder to solve if you want IDA approximations of HCH. I’m not totally sure what you meant by the confidence thresholds link—was it related to this?
The monoculture problem seems like it should increase the size (“size” meaning attraction basin, not measure of the equilibrium set), lifetime, and weirdness of attractors, while the restrictions and expectations on message-passing seem like they might shift the distribution away from “normal” human results.
But yeah, in theory we could use imitiation humans to do any research we could do ourselves. I think that gets into the relative difficulty of super-speed imitations of humans doing alignment research versus transformative AI, which I’m not really an expert in.
[EDIT: After thinking about this some more, I realized that malign AI leakage is a bigger problem than I thought when writing the parent comment, because the way I imagined it can be overcome doesn’t work that well.]
I don’t think that last one is a real constraint. What counts as “an answer” is entirely a matter of interpretation by the participants in the HCH. For example, initially I can ask the question “what are the most useful thoughts about AI alignment I can come up with during 1,000,000 iterations?”. When I am tasked to answer the question “what are the most useful thoughts about AI alignment I can come up with during N iterations?” then
If N=1, I will just spend my allotted time thinking about AI alignment and write whatever I came up with in the end.
If N>1, I will ask “what are the most useful thoughts about AI alignment I can come up with during N−1 iterations?”. Then, I will study the answer and use the remaining time to improve on it to the best of my ability.
An iteration of 2 weeks might be too short to learn the previous results, but we can work in longer iterations. Certainly, having to learn the previous results from text carries overhead compared to just remembering myself developing them (and having developed some illegible intuitions in the process), but only that much overhead.
As to “monoculture”, we can do HCH with multiple people (either the AI learns to simulate the entire system of multiple people or we use some rigid interface e.g. posting on a forum). For example, we can imagine putting the entire AI X-safety community there. But, we certainly don’t want to put the entire world in there, since that way malign AI would probably leak into the system.
Yes: it shows how to achieve reliable imitation (although for now in a theoretical model that’s not feasible to implement), and the same idea should be applicable to an imitation system like IDA (although it calls for its own theoretical analysis). Essentially, the AI queries a real person if and only if it cannot produce a reliable prediction using previous data (because there are several plausible mutually inconsistent hypotheses), and the frequency of queries vanishes over time.