Unfortunately, I must say I actually did add that paragraph in later to make my thesis clearer. However, the version that Eliezer, Nate and Rob replied to still had this paragraph, which I think makes essentially the same point (i.e. that I am not merely referring to passive understanding, but rather explicit specification):
I’m not arguing that GPT-4 actually cares about maximizing human value. However, I am saying that the system is able to transparently pinpoint to us which outcomes are good and which outcomes are bad, with fidelity approaching an average human, albeit in a text format. Crucially, GPT-4 can do this visibly to us, in a legible way, rather than merely passively knowing right from wrong in some way that we can’t access. This fact is key to what I’m saying because it means that, in the near future, we can literally just query multimodal GPT-N about whether an outcome is bad or good, and use that as an adequate “human value function”. That wouldn’t solve the problem of getting an AI to care about maximizing the human value function, but it would arguably solve the problem of creating an adequate function that we can put into a machine to begin with.
Unfortunately, I must say I actually did add that paragraph in later to make my thesis clearer. However, the version that Eliezer, Nate and Rob replied to still had this paragraph, which I think makes essentially the same point (i.e. that I am not merely referring to passive understanding, but rather explicit specification):
Ok, thanks for clarifying that that paragraph was added later.
(My comments also apply to the paragraph that was in the original.)