Why did you write “This post [Inaccessible Information] doesn’t reflect me becoming more pessimistic about iterated amplification or alignment overall.” just one month before publishing “Learning the prior”? (Is it because you were classifying “learning the prior” / imitative generalization under “iterated amplification” and now you consider it a different algorithm?)
I think that post is basically talking about the same kinds of hard cases as in Towards Formalizing Universality 1.5 years earlier (in section IV), so it’s intended to be more about clarification/exposition than changing views.
See the thread with Rohin above for some rough history.
Why doesn’t the analogy with cryptography make you a lot more pessimistic about AI alignment, as it did for me?
I’m not sure.It’s possible I would become more pessimistic if I walked through concrete cases of people’s analyses being wrong in subtle and surprising ways.
My experience with practical systems is that it is usually easy for theorists to describe hypothetical breaks for the security model, and the issue is mostly one of prioritization (since people normally don’t care too much about security). For example, my strong expectation would be that people had described hypothetical attacks on any of the systems discussed in the article you linked prior to their implementation, at least if they had ever been subject to formal scrutiny. The failures are just quite far away from the levels of paranoia that I’ve seen people on the theory side exhibit when they are trying to think of attacks.
I would also expect that e.g. if you were to describe almost any existing practical system with purported provable security, it would be straightforward for a layperson with theoretical background (e.g. me) to describe possible attacks that are not precluded by the security proof, and that it wouldn’t even take that long. It sounds like a fun game.
Another possible divergence is that I’m less convinced by the analogy, since alignment seems more about avoiding the introduction of adversarial consequentialists and it’s not clear if that game behaves in the same way. I’m not sure if that’s more or less important than the prior point.
Would you do anything else to make sure it’s safe, before letting it become potentially superintelligent? For example would you want to see “alignment proofs” similar to “security proofs” in cryptography?
I would want to do a lot of work before deploying an algorithm in any context where a failure would be catastrophic (though “before letting it become potentially superintelligent” kind of suggests a development model I’m not on board with).
That would ideally involve theoretical analysis from a lot of angles, e.g. proofs of key properties that are amenable to proof, demonstrations of how the system could plausibly fail if we were wrong about key claims or if we relax assumptions, and so on.
It would also involve good empirical characterization, including things like running on red team inputs, or changing the training procedure in ways that seem as bad as possible while still preserving our alignment arguments, and performing extensive evals under those more pessimistic conditions. It would involve validating key claims individually, and empirically testing other claims that are established by structurally similar arguments. It would involve characterizing scaling behavior where applicable and understanding it as well as we can (along with typical levels of variability and plausible stories about deviations from trend).
What if such things do not seem feasible or you can’t reach very high confidence that the definitions/assumptions/proofs are correct?
I’m not exactly sure what you are asking. It seems like we’ll do what we can on all the fronts and prioritize them as well as we can. Do you mean, what else can we say today about what methodologies we’d use? Or under what conditions would I pivot to spending down my political capital to delay deployment? Or something else?
I think that post is basically talking about the same kinds of hard cases as in Towards Formalizing Universality 1.5 years earlier (in section IV), so it’s intended to be more about clarification/exposition than changing views.
See the thread with Rohin above for some rough history.
I’m not sure.It’s possible I would become more pessimistic if I walked through concrete cases of people’s analyses being wrong in subtle and surprising ways.
My experience with practical systems is that it is usually easy for theorists to describe hypothetical breaks for the security model, and the issue is mostly one of prioritization (since people normally don’t care too much about security). For example, my strong expectation would be that people had described hypothetical attacks on any of the systems discussed in the article you linked prior to their implementation, at least if they had ever been subject to formal scrutiny. The failures are just quite far away from the levels of paranoia that I’ve seen people on the theory side exhibit when they are trying to think of attacks.
I would also expect that e.g. if you were to describe almost any existing practical system with purported provable security, it would be straightforward for a layperson with theoretical background (e.g. me) to describe possible attacks that are not precluded by the security proof, and that it wouldn’t even take that long. It sounds like a fun game.
Another possible divergence is that I’m less convinced by the analogy, since alignment seems more about avoiding the introduction of adversarial consequentialists and it’s not clear if that game behaves in the same way. I’m not sure if that’s more or less important than the prior point.
I would want to do a lot of work before deploying an algorithm in any context where a failure would be catastrophic (though “before letting it become potentially superintelligent” kind of suggests a development model I’m not on board with).
That would ideally involve theoretical analysis from a lot of angles, e.g. proofs of key properties that are amenable to proof, demonstrations of how the system could plausibly fail if we were wrong about key claims or if we relax assumptions, and so on.
It would also involve good empirical characterization, including things like running on red team inputs, or changing the training procedure in ways that seem as bad as possible while still preserving our alignment arguments, and performing extensive evals under those more pessimistic conditions. It would involve validating key claims individually, and empirically testing other claims that are established by structurally similar arguments. It would involve characterizing scaling behavior where applicable and understanding it as well as we can (along with typical levels of variability and plausible stories about deviations from trend).
I’m not exactly sure what you are asking. It seems like we’ll do what we can on all the fronts and prioritize them as well as we can. Do you mean, what else can we say today about what methodologies we’d use? Or under what conditions would I pivot to spending down my political capital to delay deployment? Or something else?