I’m interested in concrete ways for humans to evaluate and verify complex facts about the world. I’m especially interested in a set of things that might be described as “bootstrapping trust”.
For example:
Say I want to compute some expensive function f on an input x. I have access to a computer C that can compute f; it gives me a result r. But I don’t fully trust C—it might be maliciously programmed to tell me a wrong answer. In some cases, I can require that C produce a proof that f(x) = r that I can easily check. In others, I can’t. Which cases are which?
A partial answer to this question is “the complexity class NP”. But in practice this isn’t really satisfying. I have to make some assumptions about what tools are available that I do trust.
Maybe I trust simple mathematical facts (and I think I even trust that serious mathematics and theoretical computer science track truth really well). I also trust my own senses and memory, to a nontrivial extent. Reaching much beyond that is starting to feel iffy. For example, I might not (yet) have a computer of my own that I trust to help me with the verification. What kinds of proof can I accept with the limitations I’ve chosen? And how can I use those trustworthy proofs to bootstrap other trusted tools?
Other problems in this bucket include “How can we have trustworthy evidence—say videos—in a world with nearly perfect generative models?” and a bunch of subquestions of “Does debate scale as an AI alignment strategy?”
This class of questions feels like an interesting lens on some things that are relevant to some sorts of AI alignment work such as debate and interpretability. It’s also obviously related to some parts of information security and cryptography.
“Bootstrapping trust” is basically just a restatement of the whole problem. It’s not exactly that I think this is a good way to decide how to direct AI alignment effort; I just notice that it seems somehow like a “fresh” way of viewing things.
A thing that feels especially good about this way of thinking about things is that it feels like the kind of problem with straightforward engineering / cryptography style solutions.
I’m interested in concrete ways for humans to evaluate and verify complex facts about the world. I’m especially interested in a set of things that might be described as “bootstrapping trust”.
For example:
Say I want to compute some expensive function f on an input x. I have access to a computer C that can compute f; it gives me a result r. But I don’t fully trust C—it might be maliciously programmed to tell me a wrong answer. In some cases, I can require that C produce a proof that f(x) = r that I can easily check. In others, I can’t. Which cases are which?
A partial answer to this question is “the complexity class NP”. But in practice this isn’t really satisfying. I have to make some assumptions about what tools are available that I do trust.
Maybe I trust simple mathematical facts (and I think I even trust that serious mathematics and theoretical computer science track truth really well). I also trust my own senses and memory, to a nontrivial extent. Reaching much beyond that is starting to feel iffy. For example, I might not (yet) have a computer of my own that I trust to help me with the verification. What kinds of proof can I accept with the limitations I’ve chosen? And how can I use those trustworthy proofs to bootstrap other trusted tools?
Other problems in this bucket include “How can we have trustworthy evidence—say videos—in a world with nearly perfect generative models?” and a bunch of subquestions of “Does debate scale as an AI alignment strategy?”
This class of questions feels like an interesting lens on some things that are relevant to some sorts of AI alignment work such as debate and interpretability. It’s also obviously related to some parts of information security and cryptography.
“Bootstrapping trust” is basically just a restatement of the whole problem. It’s not exactly that I think this is a good way to decide how to direct AI alignment effort; I just notice that it seems somehow like a “fresh” way of viewing things.
A thing that feels especially good about this way of thinking about things is that it feels like the kind of problem with straightforward engineering / cryptography style solutions.