Daniel Kokotajlo comments on MIRI comments on Cotra’s “Case for Aligning Narrowly Superhuman Models”

Daniel Kokotajlo 9 Mar 2021 9:25 UTC
LW: 2 AF: 1
AF
I think I agree with Eliezer here, but I’m worried I misunderstand something:
Eliezer Yudkowsky: “Pessimal” is a strange word to use for this apt description of humanity’s entire experience with ML to date. Unless by “generalize” you mean “generalize correctly to one new example from the same distribution” rather than “generalize the underlying concept that a human would”.
Ajeya Cotra: I used “pessimal” here in the technical sense that it’s assuming if there are N generalizations equally valid on the training distribution the model will pick the one which is worst for humans. Even if there’s a very high probability that the worst one is in fact picked, assuming the worst one will be picked is still “assuming the worst case.”
I was never under the impression that MIRI’s conceptual work assumes that the neural net that pops out of the training process will be the worst possible one for humans. That would mean assuming that e.g. GPT-3 is secretly a deceptively aligned sadistic superintelligence, which is clearly false. (Consider: In the space of all possible generalizations that fit GPT-3′s training data, i.e. the space of all possible ways the neural net could be subject to the constraint that it perform well on GPT-3′s training data, is there at least one way that is both sadistic and superhuman intelligence? Yeah, probably. If not, just suppose we made GPT-3 bigger or something.) At any rate MIRI never seems to be assuming that the AI is sadistic, merely that it has goals different from ours and is aware of whats going on.
- Ajeya Cotra 10 Mar 2021 9:08 UTC
  LW: 3 AF: 3
  AF Parent
  The conceptual work I was gesturing at here is more Paul’s work, since MIRI’s work (afaik) is not really neural net-focused. It’s true that Paul’s work also doesn’t assume a literal worst case; it’s a very fuzzy concept I’m gesturing at here. It’s more like, Paul’s research process is to a) come up with some procedure, b) try to think of any “plausible” set of empirical outcomes that cause the procedure to fail, and c) modify the procedure to try to address that case. (The slipperiness comes in at the definition of “plausible” here, but the basic spirit of it is to “solve for every case” in the way theoretical CS typically aims to do in algorithm design, rather than “solve for the case we’ll in fact encounter.”)