While reading through the report I made a lot of notes about stuff that wasn’t clear to me, so I’m copying here the ones that weren’t resolved after finishing it. Since they were written while reading, a lot of these may be either obvious or nitpick-y.
Footnote 14, page 15:
Though we do believe that messiness may quantitatively change when problems occur. As a caricature, if we had a method that worked as long as the predictor’s Bayes net had fewer than 109 parameters, it might end up working for a realistic messy AI until it had 1012 parameters, since most of those parameters do not specify a single monolithic model in which inference is performed.
Can we make the assumption that defeating the method allows the AI to get better loss since it’s effectively wireheading at that point? If so, then wouldn’t a realistic messy AI learn a Bayes net once it had >= 109 parameters? In other words, are there reasons beyond performance that preclude an AI from learning a single monolithic model?
Footnote 33, page 30 (under the heading “Strategy: have AI help humans improve our understanding”):
Most likely this would involve some kind of joint training, where our AI helps humans understand the world better in parallel with using gradient descent to develop its own understanding. To reiterate, we are leaving details vague because we don’t think that our counterexample depends on those details.
I realize this is only a possible example of how we might implement this, but wouldn’t a training procedure that explicitly involves humans be very anti-competitive? The strategy described in the actual text sounds like it’s describing an AI assistant that automates science well enough to impart us with all the predictor’s knowledge, which wouldn’t run into this issue.
Footnote 48 to this paragraph on page 36:
The paradigmatic example of an ontology mismatch is a deep change in our understanding of the physical world. For example, you might imagine humans who think about the world in terms of rigid bodies and Newtonian fluids and “complicated stuff we don’t quite understand,” while an AI thinks of the world in terms of atoms and the void. Or we might imagine humans who think in terms of the standard model of physics, while an AI understands reality as vibrations of strings. We think that this kind of deep physical mismatch is a useful mental picture, and it can be a fruitful source of simplified examples, but we don’t think it’s very likely.
Footnote:
And if it did occur it seems like an unusually good candidate for a case where doing science (and in particular tracking how the new structures implement the old structures) outcompetes gradient descent, and on top of that a case where translation is likely to be relatively easy to pick out with suitable regularization.
I might be reading too much into this, but I don’t understand the basis of this claim. Is it that the correspondence differs only at the low-level? If so, I still don’t see how science outcompetes gradient descent.
Page 51, under the heading “[ELK] may be sufficient for building a worst-case solution to outer alignment:
Use imitative generalization combined with amplification to search over some space of instructions we could give an amplified human that would let them make cakes just as delicious as Cakey’s would have been.
I haven’t thoroughly read the article on amplification, so this question may be trivial, but my understanding is that amplified humans are more or less equivalent to humans with AI-trained Bayes nets. If true, then doesn’t this require the assumption that tasks will always have a clean divide between the qualitative (taste of cakes) which we can match with an amplified human, and the quantitative (number of cakes produced per hour) which we can’t? That feels like it’s a safe assumption to make, but I’m not entirely sure.
Page 58, in the list of features suggesting that M(x) knew that A’ was the better answer:
That real world referent Z has observable effects and the human approximately understands those effects (though there may be other things that also affect observations which the human doesn’t understand)
...
The referent Z is also relevant to minimizing the loss function ℒ. That is, there is a coherent sense in which the optimal behavior “depends on” Z, and the relative loss of different outputs would be very different if Z “had been different.”
There is a feature of the computation done by the AI which is robustly correlated with Z, and for which that correlation is causally responsible for M achieving a lower loss.
First, why is the first point necessary to suggest that M(x) knew that A’ was the better answer? Second, how are the last two points different?
Page 69, under “Can your AI model this crazy sequence of delegation?”:
We hope that this reasoning is feasible because it is closely analogous to a problem that the unaligned AI must solve: it needs to reason about acquiring resources that will be used by future copies of itself, who will themselves acquire resources to be used by further future copies and so on.
We need the AI to have a much smaller margin of error when it comes to modelling this sequence of delegation than needed for the AI to reason about acquiring resources for future copies—in other words, for a limited amount of computation, the AI will still try to reason about acquiring resources for future copies and could succeed in the absence of other superintelligences because of the lack of serious opposition, but modelling the delegation with that limited computation might be dangerous because of the tendency for value drift.
Page 71:
… we want to use a proposal that decouples “the human we are asking to evaluate a world” from “the humans in that world”—this ensures that manipulating the humans to be easily satisfied can’t improve the evaluation of a world.
Is it possible for the AI to manipulate the human in world i to be easily satisfied in order to improve the evaluation of world i+1?
As I understand this, z_prior is what the model expects to happen when it sees “action” and “before”, z_posterior is what it thinks actually happened after it sees “after”, and kl is the difference between the two that we’re penalizing it on. What is logprob doing?
While reading through the report I made a lot of notes about stuff that wasn’t clear to me, so I’m copying here the ones that weren’t resolved after finishing it. Since they were written while reading, a lot of these may be either obvious or nitpick-y.
Footnote 14, page 15:
Can we make the assumption that defeating the method allows the AI to get better loss since it’s effectively wireheading at that point? If so, then wouldn’t a realistic messy AI learn a Bayes net once it had >= 109 parameters? In other words, are there reasons beyond performance that preclude an AI from learning a single monolithic model?
Footnote 33, page 30 (under the heading “Strategy: have AI help humans improve our understanding”):
I realize this is only a possible example of how we might implement this, but wouldn’t a training procedure that explicitly involves humans be very anti-competitive? The strategy described in the actual text sounds like it’s describing an AI assistant that automates science well enough to impart us with all the predictor’s knowledge, which wouldn’t run into this issue.
Footnote 48 to this paragraph on page 36:
Footnote:
I might be reading too much into this, but I don’t understand the basis of this claim. Is it that the correspondence differs only at the low-level? If so, I still don’t see how science outcompetes gradient descent.
Page 51, under the heading “[ELK] may be sufficient for building a worst-case solution to outer alignment:
I haven’t thoroughly read the article on amplification, so this question may be trivial, but my understanding is that amplified humans are more or less equivalent to humans with AI-trained Bayes nets. If true, then doesn’t this require the assumption that tasks will always have a clean divide between the qualitative (taste of cakes) which we can match with an amplified human, and the quantitative (number of cakes produced per hour) which we can’t? That feels like it’s a safe assumption to make, but I’m not entirely sure.
Page 58, in the list of features suggesting that M(x) knew that A’ was the better answer:
First, why is the first point necessary to suggest that M(x) knew that A’ was the better answer? Second, how are the last two points different?
Page 69, under “Can your AI model this crazy sequence of delegation?”:
We need the AI to have a much smaller margin of error when it comes to modelling this sequence of delegation than needed for the AI to reason about acquiring resources for future copies—in other words, for a limited amount of computation, the AI will still try to reason about acquiring resources for future copies and could succeed in the absence of other superintelligences because of the lack of serious opposition, but modelling the delegation with that limited computation might be dangerous because of the tendency for value drift.
Page 71:
Is it possible for the AI to manipulate the human in world i to be easily satisfied in order to improve the evaluation of world i+1?
Page 73:
As I understand this, z_prior is what the model expects to happen when it sees “action” and “before”, z_posterior is what it thinks actually happened after it sees “after”, and kl is the difference between the two that we’re penalizing it on. What is logprob doing?