One way to think about the problem of AI alignment is that we only know how to train models on information that is _accessible_ to us, but we want models that leverage _inaccessible_ information.
Information is accessible if it can be checked directly, or if an ML model would successfully transfer to provide the information when trained on some other accessible information. (An example of the latter would be if we trained a system to predict what happens in a day, and it successfully transfers to predicting what happens in a month.) Otherwise, the information is inaccessible: for example, “what Alice is thinking” is (at least currently) inaccessible, while “what Alice will say” is accessible. The post has several other examples.
Note that while an ML model may not directly say exactly what Alice is thinking, if we train it to predict what Alice will say, it will probably have some internal model of what Alice is thinking, since that is useful for predicting what Alice will say. It is nonetheless inaccessible because there’s no obvious way of extracting this information from the model. While we could train the model to also output “what Alice is thinking”, this would have to be training for “a consistent and plausible answer to what Alice is thinking”, since we don’t have the ground truth answer. This could incentivize bad policies that figure out what we would most believe, rather than reporting the truth.
The argument for risk is then as follows: we care about inaccessible information (e.g. we care about what people _actually_ experience, rather than what they _say_ they experience) but can’t easily make AI systems that optimize for it. However, AI systems will be able to infer and use inaccessible information, and would outcompete ones that don’t. AI systems will be able to plan using such inaccessible information for at least some goals. Then, the AI systems that plan using the inaccessible information could eventually control most resources. Key quote: “The key asymmetry working against us is that optimizing flourishing appears to require a particular quantity to be accessible, while danger just requires anything to be accessible.”
The post then goes on to list some possible angles of attack on this problem. Iterated amplification can be thought of as addressing gaps in speed, size, experience, algorithmic sophistication etc. between the agents we train and ourselves, which can limit what inaccessible information our agents can have that we won’t. However, it seems likely that amplification will eventually run up against some inaccessible information that will never be produced. As a result, this could be a “hard core” of alignment.
Planned opinion:
I think the idea of inaccessible information is an important one, but it’s one that feels deceptively hard to reason about. For example, I often think about solving alignment by approximating “what a human would say after thinking for a long time”; this is effectively a claim that human reasoning transfers well when iterated over long periods of time, and “what a human would say” is at least somewhat accessible. Regardless, it seems reasonably likely that AI systems will inherit the same property of transferability that I attribute to human reasoning, in which case the argument for risk applies primarily because the AI system might apply its reasoning towards a different goal than the ones we care about, which leads us back to the <@intent alignment@>(@Clarifying “AI Alignment”@) formulation.
This response views this post as a fairly general argument against black box optimization, where we only look at input-output behavior, as then we can’t use inaccessible information. It suggests that we need to understand how the AI system works, rather than relying on search, to avoid these problems.
Planned summary for the Alignment Newsletter:
Planned opinion: