You train a policy that maps (situation) --> (what to do next). Part of the situation is whether you have a decisive advantage, and generally how much of a hurry you are in.
Sure. But you can’t train it on every possible situation—that would take an infinite amount of time. And some situations may be difficult to train for—for example, you aren’t actually going to be in a situation where you have a decisive strategic advantage during training. So then the question is whether your learning algorithms are capable of generalizing well from whatever training data you are able to provide for them.
There’s an analogy to organizations. Nokia used to be worth over $290 billion. Now it’s worth $33 billion. The company was dominant in hardware, and it failed to adapt when software became more important than hardware. In order to adapt successfully, I assume Nokia would have needed to retrain a lot of employees. Managers also would have needed retraining: Running a hardware company and running a software company are different. But managers and employees continued to operate based on old intuitions even after the situation changed, and the outcome was catastrophic.
If you do have learning algorithms that generalize well on complex problems, then AI alignment seems solved anyway: train a model of your values that generalizes well, and use that as your AI’s utility function.
(I’m still not sure I fully understand what you’re trying to do with your proposal, so I guess you could see my comments as an attempt to poke at it :)
I think this decomposes into two questions: 1) does the amplification process, given humans/trained agents solve the problem in a generalizable way (ie. would HCH solve the problem correctly)? 2) Does this generalizability break during the distillation process? (I’m not quite sure which you’re pointing at here).
For the amplification process, I think it would deal with things in an appropriately generalizable way. You are doing something a bit more like training the agents to form nodes in a decision tree that captures all of the important questions you would need to figure out what to do next, including components that examine the situation in detail. Paul has written up an example of what amplification might look like, that I think helped me to understand the level of abstraction that things are working at. The claim then is that expanding the decision tree captures all of the relevant considerations (possibly at some abstract level, ie. instead of capturing considerations directly it captures the thing that generates those considerations), and so will properly generalize to a new decision.
I’m less sure at this point about how well distillation would work, in my understanding this might require providing some kind of continual supervision (if the trained agent goes into a sufficiently new input domain, then it requests more labels on this new input domain from it’s overseer), or might be something Paul expects to fall out of informed oversight + corrigibility?
Sure. But you can’t train it on every possible situation—that would take an infinite amount of time.
And some situations may be difficult to train for—for example, you aren’t actually going to be in a situation where you have a decisive strategic advantage during training. So then the question is whether your learning algorithms are capable of generalizing well from whatever training data you are able to provide for them.
There’s an analogy to organizations. Nokia used to be worth over $290 billion. Now it’s worth $33 billion. The company was dominant in hardware, and it failed to adapt when software became more important than hardware. In order to adapt successfully, I assume Nokia would have needed to retrain a lot of employees. Managers also would have needed retraining: Running a hardware company and running a software company are different. But managers and employees continued to operate based on old intuitions even after the situation changed, and the outcome was catastrophic.
If you do have learning algorithms that generalize well on complex problems, then AI alignment seems solved anyway: train a model of your values that generalizes well, and use that as your AI’s utility function.
(I’m still not sure I fully understand what you’re trying to do with your proposal, so I guess you could see my comments as an attempt to poke at it :)
I think this decomposes into two questions: 1) does the amplification process, given humans/trained agents solve the problem in a generalizable way (ie. would HCH solve the problem correctly)? 2) Does this generalizability break during the distillation process? (I’m not quite sure which you’re pointing at here).
For the amplification process, I think it would deal with things in an appropriately generalizable way. You are doing something a bit more like training the agents to form nodes in a decision tree that captures all of the important questions you would need to figure out what to do next, including components that examine the situation in detail. Paul has written up an example of what amplification might look like, that I think helped me to understand the level of abstraction that things are working at. The claim then is that expanding the decision tree captures all of the relevant considerations (possibly at some abstract level, ie. instead of capturing considerations directly it captures the thing that generates those considerations), and so will properly generalize to a new decision.
I’m less sure at this point about how well distillation would work, in my understanding this might require providing some kind of continual supervision (if the trained agent goes into a sufficiently new input domain, then it requests more labels on this new input domain from it’s overseer), or might be something Paul expects to fall out of informed oversight + corrigibility?