I took a quick look at your proposal; here are some quick thoughts on why I’m not super excited:
It seems brittle. If there’s miscommunication at any level of the hierarchy, you run the risk of breakage. Fatal miscommunications could happen as information travels either up or down the hierarchy.
It seems like an awkward framework for achieving decisive strategic advantage. I think decisive strategic advantage will be achieved not through getting menial tasks done at a really fast rate, so much as making lots of discoveries, generating lots of ideas, and analyzing lots of possible scenarios. For this, a shared knowledge base seems ideal. In your framework, it looks like new insights get shared with subordinates by retraining them? This seems awkward and slow. And if insights need to travel up and down the hierarchy to get shared, this introduces loads of opportunities for miscommunication (see previous bullet point). To put it another way, this looks like a speed superintelligence at best; I think a quality superintelligence will beat it.
The framework does not appear to have a clear provision for adapting its value learning to the presence/absence of decisive strategic advantage. The ideal FAI will slow down and spend a lot of time asking us what we want once decisive strategic advantage has been achieved. With your thing, it appears as though this would require an awkward retraining process.
In general, your proposal looks rather like a human organization, or the human economy. There are people called “CEOs” who delegate tasks to subordinates, who delegate tasks to subordinates and so on. I expect that if your proposal works better than existing organizational models, companies will have a financial incentive to adopt it regardless of what you do. As AI and machine learning advance, I expect that AI and machine learning will gradually swallow menial jobs in organizations, and the remaining humans will supervise AIs. Replacing human supervisors with AIs will be the logical next step. If you think this is the best way to go, perhaps you could raise venture capital to help accelerate this transition; I assume there will be a lot of money to be made. In any case, you could brainstorm failure modes for your proposal by looking at how organizations fail.
I’ve spent a lot of time thinking about this, and I still don’t understand why a very simple approach to AI alignment (train a well-calibrated model of human preferences, have the AI ask for clarification when necessary) is unworkable. All the objections I’ve seen to this approach seem either confused or solvable with some effort. Building well-calibrated statistical models of complex phenomena that generalize well is a hard problem. But I think an AI would likely need a solution to this problem to take over the world anyway.
In software engineering terms, your proposal appears to couple together value learning, making predictions, making plans, and taking action. I think an FAI will be both safer and more powerful if these concerns are decoupled.
It seems brittle. If there’s miscommunication at any level of the hierarchy, you run the risk of breakage. Fatal miscommunications could happen as information travels either up or down the hierarchy.
It seems to me that the amplification scheme could include redundant processing/error correction—ie. ask subordinates to solve a problem in several different ways, then look at whether they disagree and take majority vote or flag disagreements as indicating that something dangerous is going on, and this could deal with this sort of problem.
The framework does not appear to have a clear provision for adapting its value learning to the presence/absence of decisive strategic advantage. The ideal FAI will slow down and spend a lot of time asking us what we want once decisive strategic advantage has been achieved. With your thing, it appears as though this would require an awkward retraining process.
It seems to me that balancing the risks of acting vs. taking time to ask questions depending on the current situation falls under Paul’s notion of corrigibility, so it would happen appropriately (as long as you maintain the possiblity of asking questions as an output of the system, and the input appropriately describes the state of the world relevant to evaluating whether you have decisive strategic advantage)
It seems to me that balancing the risks of acting vs. taking time to ask questions depending on the current situation falls under Paul’s notion of corrigibility
I definitely agree that balancing costs vs. VOI falls under the behavior-to-be-learned, and don’t see why it would require retraining. You train a policy that maps (situation) --> (what to do next). Part of the situation is whether you have a decisive advantage, and generally how much of a hurry you are in. If you had to retrain every time the situation changed, you’d never be able to do anything at all :)
(That said, to the extent that corrigibility is a plausible candidate for a worst-case property, it wouldn’t be guaranteeing any kind of competent balancing of costs and benefits.)
Figuring out whether to act vs ask questions feels like a fundamentally epistemic judgement: How confident am I in my knowledge that this is what my operator wants me to do? How important do I believe this aspect of my task to be, and how confident am I in my importance assessment? What is the likely cost of delaying in order to ask my operator a question? Etc. My intuition is that this problem is therefore best viewed within an epistemic framework (trying to have well-calibrated knowledge) rather than a behavioral one (trying to mimic instances of question-asking in the training data). Giving an agent examples of cases where it should ask questions feels like about as much of a solution to the problem of corrigibility as the use of soft labels (probability targets that are neither 0 nor 1) is a solution to the problem of calibration in a supervised learning context. It’s a good start, but I’d prefer a solution with a stronger justification behind it. However, if we did have a solution with a strong justification, FAI starts looking pretty easy to me.
My impression (shaped by this example of amplification) is that the agents in the amplification tree would be considering exactly these sort of epistemic questions. (There is then the separate question of how faithfully this behaviour is reproduced/generalized during distillation)
You train a policy that maps (situation) --> (what to do next). Part of the situation is whether you have a decisive advantage, and generally how much of a hurry you are in.
Sure. But you can’t train it on every possible situation—that would take an infinite amount of time. And some situations may be difficult to train for—for example, you aren’t actually going to be in a situation where you have a decisive strategic advantage during training. So then the question is whether your learning algorithms are capable of generalizing well from whatever training data you are able to provide for them.
There’s an analogy to organizations. Nokia used to be worth over $290 billion. Now it’s worth $33 billion. The company was dominant in hardware, and it failed to adapt when software became more important than hardware. In order to adapt successfully, I assume Nokia would have needed to retrain a lot of employees. Managers also would have needed retraining: Running a hardware company and running a software company are different. But managers and employees continued to operate based on old intuitions even after the situation changed, and the outcome was catastrophic.
If you do have learning algorithms that generalize well on complex problems, then AI alignment seems solved anyway: train a model of your values that generalizes well, and use that as your AI’s utility function.
(I’m still not sure I fully understand what you’re trying to do with your proposal, so I guess you could see my comments as an attempt to poke at it :)
I think this decomposes into two questions: 1) does the amplification process, given humans/trained agents solve the problem in a generalizable way (ie. would HCH solve the problem correctly)? 2) Does this generalizability break during the distillation process? (I’m not quite sure which you’re pointing at here).
For the amplification process, I think it would deal with things in an appropriately generalizable way. You are doing something a bit more like training the agents to form nodes in a decision tree that captures all of the important questions you would need to figure out what to do next, including components that examine the situation in detail. Paul has written up an example of what amplification might look like, that I think helped me to understand the level of abstraction that things are working at. The claim then is that expanding the decision tree captures all of the relevant considerations (possibly at some abstract level, ie. instead of capturing considerations directly it captures the thing that generates those considerations), and so will properly generalize to a new decision.
I’m less sure at this point about how well distillation would work, in my understanding this might require providing some kind of continual supervision (if the trained agent goes into a sufficiently new input domain, then it requests more labels on this new input domain from it’s overseer), or might be something Paul expects to fall out of informed oversight + corrigibility?
It seems to me that the amplification scheme could include redundant processing/error correction—ie. ask subordinates to solve a problem in several different ways, then look at whether they disagree and take majority vote or flag disagreements as indicating that something dangerous is going on, and this could deal with this sort of problem.
That sounds like a good idea. But I still don’t feel like I fully understand what we are getting in return for knowledge isolation. Knowledge isolation is clearly useful sometimes, e.g. for ensemble learning or cross validation. But it feels to me like a framework that allowed us to pool & isolate knowledge in a more fluid way would work better.
Imagine an organization where Alice in Department A needs to tell something to Bob in Department B. Suppose the organization is very straightlaced, and Alice knows that the only way Bob will hear her message is if Alice tells it to her boss, who tells it to the CEO, who tells it to to the head of department B, who tells it to Bob. What is this game of telephone buying us, exactly?
I think (based on reading Paul’s blog posts) that knowledge isolation provides these benefits:
The distribution of training and test examples for the distilled agent are as similar as possible (possibly identical, or possibly close enough that you can ask for new training data when you find something too far out of distribution). Suppose we allow for unlimited knowledge sharing. The training data gathered from humans will only include examples of humans processing some limited amount of information, and that information will have been produced in a fairly normal set of circumstances that occur during training. But as the IDA procedure continues, later agents will have to deal with much larger amounts of data generated during potentially weirder circumstances. So distributional shift will become more of a problem.
Security amplification. In Universality and Security Amplification, Paul describes limiting the amount of information accessible to each agent to the extent that we can perform red-teaming over the set of possible inputs an agent will have to process, and so have confidence that agents won’t be attacked by any input they receive. He acknowledges that this will limit capabilities (so the system wouldn’t be able to, say, translate between languages as well as a human translator). But he claims that 1) the system doesn’t necessarily need to perform all tasks itself, instead it can just learn how to safely use external humans or system and 2) even the information limited set of queries the system can answer will still be able to include a “simple core of reasoning” sufficient for this task. (I’m still trying to wrap my head around whether I think this kind of system will be able to have sufficient capabilities.)
I took a quick look at your proposal; here are some quick thoughts on why I’m not super excited:
It seems brittle. If there’s miscommunication at any level of the hierarchy, you run the risk of breakage. Fatal miscommunications could happen as information travels either up or down the hierarchy.
It seems like an awkward framework for achieving decisive strategic advantage. I think decisive strategic advantage will be achieved not through getting menial tasks done at a really fast rate, so much as making lots of discoveries, generating lots of ideas, and analyzing lots of possible scenarios. For this, a shared knowledge base seems ideal. In your framework, it looks like new insights get shared with subordinates by retraining them? This seems awkward and slow. And if insights need to travel up and down the hierarchy to get shared, this introduces loads of opportunities for miscommunication (see previous bullet point). To put it another way, this looks like a speed superintelligence at best; I think a quality superintelligence will beat it.
The framework does not appear to have a clear provision for adapting its value learning to the presence/absence of decisive strategic advantage. The ideal FAI will slow down and spend a lot of time asking us what we want once decisive strategic advantage has been achieved. With your thing, it appears as though this would require an awkward retraining process.
In general, your proposal looks rather like a human organization, or the human economy. There are people called “CEOs” who delegate tasks to subordinates, who delegate tasks to subordinates and so on. I expect that if your proposal works better than existing organizational models, companies will have a financial incentive to adopt it regardless of what you do. As AI and machine learning advance, I expect that AI and machine learning will gradually swallow menial jobs in organizations, and the remaining humans will supervise AIs. Replacing human supervisors with AIs will be the logical next step. If you think this is the best way to go, perhaps you could raise venture capital to help accelerate this transition; I assume there will be a lot of money to be made. In any case, you could brainstorm failure modes for your proposal by looking at how organizations fail.
I’ve spent a lot of time thinking about this, and I still don’t understand why a very simple approach to AI alignment (train a well-calibrated model of human preferences, have the AI ask for clarification when necessary) is unworkable. All the objections I’ve seen to this approach seem either confused or solvable with some effort. Building well-calibrated statistical models of complex phenomena that generalize well is a hard problem. But I think an AI would likely need a solution to this problem to take over the world anyway.
In software engineering terms, your proposal appears to couple together value learning, making predictions, making plans, and taking action. I think an FAI will be both safer and more powerful if these concerns are decoupled.
It seems to me that the amplification scheme could include redundant processing/error correction—ie. ask subordinates to solve a problem in several different ways, then look at whether they disagree and take majority vote or flag disagreements as indicating that something dangerous is going on, and this could deal with this sort of problem.
It seems to me that balancing the risks of acting vs. taking time to ask questions depending on the current situation falls under Paul’s notion of corrigibility, so it would happen appropriately (as long as you maintain the possiblity of asking questions as an output of the system, and the input appropriately describes the state of the world relevant to evaluating whether you have decisive strategic advantage)
I definitely agree that balancing costs vs. VOI falls under the behavior-to-be-learned, and don’t see why it would require retraining. You train a policy that maps (situation) --> (what to do next). Part of the situation is whether you have a decisive advantage, and generally how much of a hurry you are in. If you had to retrain every time the situation changed, you’d never be able to do anything at all :)
(That said, to the extent that corrigibility is a plausible candidate for a worst-case property, it wouldn’t be guaranteeing any kind of competent balancing of costs and benefits.)
Figuring out whether to act vs ask questions feels like a fundamentally epistemic judgement: How confident am I in my knowledge that this is what my operator wants me to do? How important do I believe this aspect of my task to be, and how confident am I in my importance assessment? What is the likely cost of delaying in order to ask my operator a question? Etc. My intuition is that this problem is therefore best viewed within an epistemic framework (trying to have well-calibrated knowledge) rather than a behavioral one (trying to mimic instances of question-asking in the training data). Giving an agent examples of cases where it should ask questions feels like about as much of a solution to the problem of corrigibility as the use of soft labels (probability targets that are neither 0 nor 1) is a solution to the problem of calibration in a supervised learning context. It’s a good start, but I’d prefer a solution with a stronger justification behind it. However, if we did have a solution with a strong justification, FAI starts looking pretty easy to me.
My impression (shaped by this example of amplification) is that the agents in the amplification tree would be considering exactly these sort of epistemic questions. (There is then the separate question of how faithfully this behaviour is reproduced/generalized during distillation)
Sure. But you can’t train it on every possible situation—that would take an infinite amount of time.
And some situations may be difficult to train for—for example, you aren’t actually going to be in a situation where you have a decisive strategic advantage during training. So then the question is whether your learning algorithms are capable of generalizing well from whatever training data you are able to provide for them.
There’s an analogy to organizations. Nokia used to be worth over $290 billion. Now it’s worth $33 billion. The company was dominant in hardware, and it failed to adapt when software became more important than hardware. In order to adapt successfully, I assume Nokia would have needed to retrain a lot of employees. Managers also would have needed retraining: Running a hardware company and running a software company are different. But managers and employees continued to operate based on old intuitions even after the situation changed, and the outcome was catastrophic.
If you do have learning algorithms that generalize well on complex problems, then AI alignment seems solved anyway: train a model of your values that generalizes well, and use that as your AI’s utility function.
(I’m still not sure I fully understand what you’re trying to do with your proposal, so I guess you could see my comments as an attempt to poke at it :)
I think this decomposes into two questions: 1) does the amplification process, given humans/trained agents solve the problem in a generalizable way (ie. would HCH solve the problem correctly)? 2) Does this generalizability break during the distillation process? (I’m not quite sure which you’re pointing at here).
For the amplification process, I think it would deal with things in an appropriately generalizable way. You are doing something a bit more like training the agents to form nodes in a decision tree that captures all of the important questions you would need to figure out what to do next, including components that examine the situation in detail. Paul has written up an example of what amplification might look like, that I think helped me to understand the level of abstraction that things are working at. The claim then is that expanding the decision tree captures all of the relevant considerations (possibly at some abstract level, ie. instead of capturing considerations directly it captures the thing that generates those considerations), and so will properly generalize to a new decision.
I’m less sure at this point about how well distillation would work, in my understanding this might require providing some kind of continual supervision (if the trained agent goes into a sufficiently new input domain, then it requests more labels on this new input domain from it’s overseer), or might be something Paul expects to fall out of informed oversight + corrigibility?
That sounds like a good idea. But I still don’t feel like I fully understand what we are getting in return for knowledge isolation. Knowledge isolation is clearly useful sometimes, e.g. for ensemble learning or cross validation. But it feels to me like a framework that allowed us to pool & isolate knowledge in a more fluid way would work better.
Imagine an organization where Alice in Department A needs to tell something to Bob in Department B. Suppose the organization is very straightlaced, and Alice knows that the only way Bob will hear her message is if Alice tells it to her boss, who tells it to the CEO, who tells it to to the head of department B, who tells it to Bob. What is this game of telephone buying us, exactly?
Re: corrigibility, see this comment.
I think (based on reading Paul’s blog posts) that knowledge isolation provides these benefits:
The distribution of training and test examples for the distilled agent are as similar as possible (possibly identical, or possibly close enough that you can ask for new training data when you find something too far out of distribution). Suppose we allow for unlimited knowledge sharing. The training data gathered from humans will only include examples of humans processing some limited amount of information, and that information will have been produced in a fairly normal set of circumstances that occur during training. But as the IDA procedure continues, later agents will have to deal with much larger amounts of data generated during potentially weirder circumstances. So distributional shift will become more of a problem.
Security amplification. In Universality and Security Amplification, Paul describes limiting the amount of information accessible to each agent to the extent that we can perform red-teaming over the set of possible inputs an agent will have to process, and so have confidence that agents won’t be attacked by any input they receive. He acknowledges that this will limit capabilities (so the system wouldn’t be able to, say, translate between languages as well as a human translator). But he claims that 1) the system doesn’t necessarily need to perform all tasks itself, instead it can just learn how to safely use external humans or system and 2) even the information limited set of queries the system can answer will still be able to include a “simple core of reasoning” sufficient for this task. (I’m still trying to wrap my head around whether I think this kind of system will be able to have sufficient capabilities.)