It seems to me that the amplification scheme could include redundant processing/error correction—ie. ask subordinates to solve a problem in several different ways, then look at whether they disagree and take majority vote or flag disagreements as indicating that something dangerous is going on, and this could deal with this sort of problem.
That sounds like a good idea. But I still don’t feel like I fully understand what we are getting in return for knowledge isolation. Knowledge isolation is clearly useful sometimes, e.g. for ensemble learning or cross validation. But it feels to me like a framework that allowed us to pool & isolate knowledge in a more fluid way would work better.
Imagine an organization where Alice in Department A needs to tell something to Bob in Department B. Suppose the organization is very straightlaced, and Alice knows that the only way Bob will hear her message is if Alice tells it to her boss, who tells it to the CEO, who tells it to to the head of department B, who tells it to Bob. What is this game of telephone buying us, exactly?
I think (based on reading Paul’s blog posts) that knowledge isolation provides these benefits:
The distribution of training and test examples for the distilled agent are as similar as possible (possibly identical, or possibly close enough that you can ask for new training data when you find something too far out of distribution). Suppose we allow for unlimited knowledge sharing. The training data gathered from humans will only include examples of humans processing some limited amount of information, and that information will have been produced in a fairly normal set of circumstances that occur during training. But as the IDA procedure continues, later agents will have to deal with much larger amounts of data generated during potentially weirder circumstances. So distributional shift will become more of a problem.
Security amplification. In Universality and Security Amplification, Paul describes limiting the amount of information accessible to each agent to the extent that we can perform red-teaming over the set of possible inputs an agent will have to process, and so have confidence that agents won’t be attacked by any input they receive. He acknowledges that this will limit capabilities (so the system wouldn’t be able to, say, translate between languages as well as a human translator). But he claims that 1) the system doesn’t necessarily need to perform all tasks itself, instead it can just learn how to safely use external humans or system and 2) even the information limited set of queries the system can answer will still be able to include a “simple core of reasoning” sufficient for this task. (I’m still trying to wrap my head around whether I think this kind of system will be able to have sufficient capabilities.)
That sounds like a good idea. But I still don’t feel like I fully understand what we are getting in return for knowledge isolation. Knowledge isolation is clearly useful sometimes, e.g. for ensemble learning or cross validation. But it feels to me like a framework that allowed us to pool & isolate knowledge in a more fluid way would work better.
Imagine an organization where Alice in Department A needs to tell something to Bob in Department B. Suppose the organization is very straightlaced, and Alice knows that the only way Bob will hear her message is if Alice tells it to her boss, who tells it to the CEO, who tells it to to the head of department B, who tells it to Bob. What is this game of telephone buying us, exactly?
Re: corrigibility, see this comment.
I think (based on reading Paul’s blog posts) that knowledge isolation provides these benefits:
The distribution of training and test examples for the distilled agent are as similar as possible (possibly identical, or possibly close enough that you can ask for new training data when you find something too far out of distribution). Suppose we allow for unlimited knowledge sharing. The training data gathered from humans will only include examples of humans processing some limited amount of information, and that information will have been produced in a fairly normal set of circumstances that occur during training. But as the IDA procedure continues, later agents will have to deal with much larger amounts of data generated during potentially weirder circumstances. So distributional shift will become more of a problem.
Security amplification. In Universality and Security Amplification, Paul describes limiting the amount of information accessible to each agent to the extent that we can perform red-teaming over the set of possible inputs an agent will have to process, and so have confidence that agents won’t be attacked by any input they receive. He acknowledges that this will limit capabilities (so the system wouldn’t be able to, say, translate between languages as well as a human translator). But he claims that 1) the system doesn’t necessarily need to perform all tasks itself, instead it can just learn how to safely use external humans or system and 2) even the information limited set of queries the system can answer will still be able to include a “simple core of reasoning” sufficient for this task. (I’m still trying to wrap my head around whether I think this kind of system will be able to have sufficient capabilities.)