adamShimi comments on The case for aligning narrowly superhuman models

adamShimi 7 Mar 2021 21:53 UTC
LW: 5 AF: 4
AF
Well, Paul’s original post presents HCH as the specification of a human enlightened judgement.
For now, I think that HCH is our best way to precisely specify “a human’s enlightened judgment.” It’s got plenty of problems, but for now I don’t know anything better.
And if we follow the links to Paul’s previous post about this concept, he does describe his ideal implementation of considered judgement (what will become HCH) using the intuition of thinking for decent amount of time.
To define my considered judgment about a question Q, suppose I am told Q and spend a few days trying to answer it. But in addition to all of the normal tools—reasoning, programming, experimentation, conversation—I also have access to a special oracle. I can give this oracle any question Q’, and the oracle will immediately reply with my considered judgment about Q’. And what is my considered judgment about Q’? Well, it’s whatever I would have output if we had performed exactly the same process, starting with Q’ instead of Q.
So it looks to me like “HCH captures the judgment of the human after thinking from a long time” is definitely a claim made in the post defining the concept. Whether it actually holds is another (quite interesting) question that I don’t know the answer.
A line of thought about this that I explore in Epistemology of HCH is the comparison between HCH and CEV: the former is more operationally concrete (what I call an intermediary alignment scheme), but the latter can directly state the properties it has (like giving the same decision that the human after thinking for a long time), whereas we need to argue for them in HCH.