A U-maximizer faces a similar set of problems: it cannot understand the exact form of U, but it can still have well-founded beliefs about U, and about what sorts of actions are good according to U. For example, if we suppose that the U-maximizer can carry out any reasoning that we can carry out, then the U-maximizer knows to avoid anything which we suspect would be bad according to U (for example, torturing humans).
Here’s a stronger version of my previous criticism of this argument. Suppose instead of giving neuroimaging data to the AI and defining H in terms of a brute force search for a model that can explain the neuroimaging data, we give it a cryptographic hash of the neuroimaging data (of sufficient length to avoid possible collisions), and modify the definition of H to first perform a brute force search to recover the neuroimaging data from the hash. In this case, we can still say that torturing is probably bad according to U, but the AI obviously can’t arrive at this conclusion from the formal definition of U alone (assuming it can’t break the cryptographic hash). It seems clear that we can’t safely assume that “the U-maximizer can carry out any reasoning that we can carry out”.
Even if the U-maximizer cannot carry out this reasoning, as long as it can recognize that humans have powerful predictive models for other humans, it can simply appropriate those models (either by carrying out reasoning inspired by human models, or by simply asking).
In order to “carry out reasoning inspired by human models”, the AI has to first form a usable model of a human. I don’t have a strong argument that the U-maximizer can’t do this from the original definition of U (i.e., from plaintext neuroimaging data), but intuitively it seems implausible given an amount of computing power the U-maximizer might initially have access to (say, within a couple orders of magnitude of the amount needed to do standard WBE). I don’t see how “simply asking” could work either. What kind of questions might the U-maximizer ask, and how can we answer it, given that we don’t know how to formalize what “torture” means?
Here’s a stronger version of my previous criticism of this argument. Suppose instead of giving neuroimaging data to the AI and defining H in terms of a brute force search for a model that can explain the neuroimaging data, we give it a cryptographic hash of the neuroimaging data (of sufficient length to avoid possible collisions), and modify the definition of H to first perform a brute force search to recover the neuroimaging data from the hash. In this case, we can still say that torturing is probably bad according to U, but the AI obviously can’t arrive at this conclusion from the formal definition of U alone (assuming it can’t break the cryptographic hash). It seems clear that we can’t safely assume that “the U-maximizer can carry out any reasoning that we can carry out”.
In order to “carry out reasoning inspired by human models”, the AI has to first form a usable model of a human. I don’t have a strong argument that the U-maximizer can’t do this from the original definition of U (i.e., from plaintext neuroimaging data), but intuitively it seems implausible given an amount of computing power the U-maximizer might initially have access to (say, within a couple orders of magnitude of the amount needed to do standard WBE). I don’t see how “simply asking” could work either. What kind of questions might the U-maximizer ask, and how can we answer it, given that we don’t know how to formalize what “torture” means?