Vanessa Kosoy comments on HCH Speculation Post #2A

Vanessa Kosoy Mar 18, 2021, 2:25 PM
LW: 2 AF: 1
AF
[EDIT: After thinking about this some more, I realized that malign AI leakage is a bigger problem than I thought when writing the parent comment, because the way I imagined it can be overcome doesn’t work that well.]

I think the biggest differences are that HCH is a psychological “monoculture,” HCH has tiny bottlenecks through which to pass messages compared to the information I can pass to my future self, and there’s some presumption that the output will be “an answer” whereas I have no such demands on the brain-state I pass to tomorrow.

I don’t think that last one is a real constraint. What counts as “an answer” is entirely a matter of interpretation by the participants in the HCH. For example, initially I can ask the question “what are the most useful thoughts about AI alignment I can come up with during 1,000,000 iterations?”. When I am tasked to answer the question “what are the most useful thoughts about AI alignment I can come up with during $N$ iterations?” then
- If $N = 1$ , I will just spend my allotted time thinking about AI alignment and write whatever I came up with in the end.
- If $N > 1$ , I will ask “what are the most useful thoughts about AI alignment I can come up with during $N - 1$ iterations?”. Then, I will study the answer and use the remaining time to improve on it to the best of my ability.
An iteration of 2 weeks might be too short to learn the previous results, but we can work in longer iterations. Certainly, having to learn the previous results from text carries overhead compared to just remembering myself developing them (and having developed some illegible intuitions in the process), but only that much overhead.

As to “monoculture”, we can do HCH with multiple people (either the AI learns to simulate the entire system of multiple people or we use some rigid interface e.g. posting on a forum). For example, we can imagine putting the entire AI X-safety community there. But, we certainly don’t want to put the entire world in there, since that way malign AI would probably leak into the system.

I think the problems are harder to solve if you want IDA approximations of HCH. I’m not totally sure what you meant by the confidence thresholds link—was it related to this?

Yes: it shows how to achieve reliable imitation (although for now in a theoretical model that’s not feasible to implement), and the same idea should be applicable to an imitation system like IDA (although it calls for its own theoretical analysis). Essentially, the AI queries a real person if and only if it cannot produce a reliable prediction using previous data (because there are several plausible mutually inconsistent hypotheses), and the frequency of queries vanishes over time.