It seems to me that balancing the risks of acting vs. taking time to ask questions depending on the current situation falls under Paul’s notion of corrigibility
I definitely agree that balancing costs vs. VOI falls under the behavior-to-be-learned, and don’t see why it would require retraining. You train a policy that maps (situation) --> (what to do next). Part of the situation is whether you have a decisive advantage, and generally how much of a hurry you are in. If you had to retrain every time the situation changed, you’d never be able to do anything at all :)
(That said, to the extent that corrigibility is a plausible candidate for a worst-case property, it wouldn’t be guaranteeing any kind of competent balancing of costs and benefits.)
Figuring out whether to act vs ask questions feels like a fundamentally epistemic judgement: How confident am I in my knowledge that this is what my operator wants me to do? How important do I believe this aspect of my task to be, and how confident am I in my importance assessment? What is the likely cost of delaying in order to ask my operator a question? Etc. My intuition is that this problem is therefore best viewed within an epistemic framework (trying to have well-calibrated knowledge) rather than a behavioral one (trying to mimic instances of question-asking in the training data). Giving an agent examples of cases where it should ask questions feels like about as much of a solution to the problem of corrigibility as the use of soft labels (probability targets that are neither 0 nor 1) is a solution to the problem of calibration in a supervised learning context. It’s a good start, but I’d prefer a solution with a stronger justification behind it. However, if we did have a solution with a strong justification, FAI starts looking pretty easy to me.
My impression (shaped by this example of amplification) is that the agents in the amplification tree would be considering exactly these sort of epistemic questions. (There is then the separate question of how faithfully this behaviour is reproduced/generalized during distillation)
You train a policy that maps (situation) --> (what to do next). Part of the situation is whether you have a decisive advantage, and generally how much of a hurry you are in.
Sure. But you can’t train it on every possible situation—that would take an infinite amount of time. And some situations may be difficult to train for—for example, you aren’t actually going to be in a situation where you have a decisive strategic advantage during training. So then the question is whether your learning algorithms are capable of generalizing well from whatever training data you are able to provide for them.
There’s an analogy to organizations. Nokia used to be worth over $290 billion. Now it’s worth $33 billion. The company was dominant in hardware, and it failed to adapt when software became more important than hardware. In order to adapt successfully, I assume Nokia would have needed to retrain a lot of employees. Managers also would have needed retraining: Running a hardware company and running a software company are different. But managers and employees continued to operate based on old intuitions even after the situation changed, and the outcome was catastrophic.
If you do have learning algorithms that generalize well on complex problems, then AI alignment seems solved anyway: train a model of your values that generalizes well, and use that as your AI’s utility function.
(I’m still not sure I fully understand what you’re trying to do with your proposal, so I guess you could see my comments as an attempt to poke at it :)
I think this decomposes into two questions: 1) does the amplification process, given humans/trained agents solve the problem in a generalizable way (ie. would HCH solve the problem correctly)? 2) Does this generalizability break during the distillation process? (I’m not quite sure which you’re pointing at here).
For the amplification process, I think it would deal with things in an appropriately generalizable way. You are doing something a bit more like training the agents to form nodes in a decision tree that captures all of the important questions you would need to figure out what to do next, including components that examine the situation in detail. Paul has written up an example of what amplification might look like, that I think helped me to understand the level of abstraction that things are working at. The claim then is that expanding the decision tree captures all of the relevant considerations (possibly at some abstract level, ie. instead of capturing considerations directly it captures the thing that generates those considerations), and so will properly generalize to a new decision.
I’m less sure at this point about how well distillation would work, in my understanding this might require providing some kind of continual supervision (if the trained agent goes into a sufficiently new input domain, then it requests more labels on this new input domain from it’s overseer), or might be something Paul expects to fall out of informed oversight + corrigibility?
I definitely agree that balancing costs vs. VOI falls under the behavior-to-be-learned, and don’t see why it would require retraining. You train a policy that maps (situation) --> (what to do next). Part of the situation is whether you have a decisive advantage, and generally how much of a hurry you are in. If you had to retrain every time the situation changed, you’d never be able to do anything at all :)
(That said, to the extent that corrigibility is a plausible candidate for a worst-case property, it wouldn’t be guaranteeing any kind of competent balancing of costs and benefits.)
Figuring out whether to act vs ask questions feels like a fundamentally epistemic judgement: How confident am I in my knowledge that this is what my operator wants me to do? How important do I believe this aspect of my task to be, and how confident am I in my importance assessment? What is the likely cost of delaying in order to ask my operator a question? Etc. My intuition is that this problem is therefore best viewed within an epistemic framework (trying to have well-calibrated knowledge) rather than a behavioral one (trying to mimic instances of question-asking in the training data). Giving an agent examples of cases where it should ask questions feels like about as much of a solution to the problem of corrigibility as the use of soft labels (probability targets that are neither 0 nor 1) is a solution to the problem of calibration in a supervised learning context. It’s a good start, but I’d prefer a solution with a stronger justification behind it. However, if we did have a solution with a strong justification, FAI starts looking pretty easy to me.
My impression (shaped by this example of amplification) is that the agents in the amplification tree would be considering exactly these sort of epistemic questions. (There is then the separate question of how faithfully this behaviour is reproduced/generalized during distillation)
Sure. But you can’t train it on every possible situation—that would take an infinite amount of time.
And some situations may be difficult to train for—for example, you aren’t actually going to be in a situation where you have a decisive strategic advantage during training. So then the question is whether your learning algorithms are capable of generalizing well from whatever training data you are able to provide for them.
There’s an analogy to organizations. Nokia used to be worth over $290 billion. Now it’s worth $33 billion. The company was dominant in hardware, and it failed to adapt when software became more important than hardware. In order to adapt successfully, I assume Nokia would have needed to retrain a lot of employees. Managers also would have needed retraining: Running a hardware company and running a software company are different. But managers and employees continued to operate based on old intuitions even after the situation changed, and the outcome was catastrophic.
If you do have learning algorithms that generalize well on complex problems, then AI alignment seems solved anyway: train a model of your values that generalizes well, and use that as your AI’s utility function.
(I’m still not sure I fully understand what you’re trying to do with your proposal, so I guess you could see my comments as an attempt to poke at it :)
I think this decomposes into two questions: 1) does the amplification process, given humans/trained agents solve the problem in a generalizable way (ie. would HCH solve the problem correctly)? 2) Does this generalizability break during the distillation process? (I’m not quite sure which you’re pointing at here).
For the amplification process, I think it would deal with things in an appropriately generalizable way. You are doing something a bit more like training the agents to form nodes in a decision tree that captures all of the important questions you would need to figure out what to do next, including components that examine the situation in detail. Paul has written up an example of what amplification might look like, that I think helped me to understand the level of abstraction that things are working at. The claim then is that expanding the decision tree captures all of the relevant considerations (possibly at some abstract level, ie. instead of capturing considerations directly it captures the thing that generates those considerations), and so will properly generalize to a new decision.
I’m less sure at this point about how well distillation would work, in my understanding this might require providing some kind of continual supervision (if the trained agent goes into a sufficiently new input domain, then it requests more labels on this new input domain from it’s overseer), or might be something Paul expects to fall out of informed oversight + corrigibility?