I disagree that adversarial environments are “disanalogous”. If they are indeed disanalogous, we should be able to formally define a natural class of environment that excludes them but is still sufficiently rich e.g. it should allow other agents of similar complexity. If there is such a class, I would be glad to see the definition. I expect that in most complex polynomial-time environments IRL is be able to extract some information about the utility function but there is a significant fraction it is unable to extract.
I strongly disagree with “computational complexity issues only block people in practice when the setup needs to be adversarial.” My experience is that that computational complexity issues block people all the time. If someone invented a black box that solves arbitrary instances of SAT in practical time it would be a revolution in the technological capacities of civilization. Electronic and mechanical design would be almost fully automated (and they are currently very far from automated; the practical utility of SAT solvers mentioned in Wikipedia is very restricted). Algorithm engineering would become more or less redundant. Mathematical theorem proving would be fully automated. Protein folding would be solved. Training complex neural networks would become trivial. All of the above would work much better than humanity ability.
Also, if you were right people would be much less excited about quantum computers (that is, the excitement might be still unjustified because quantum computer are infeasible or because they are feasible but don’t solve many interesting problems, but at least there’s a chance they will solve many interesting problems).
Obviously there are NP-hard optimization problems with approximate solutions that are good enough for practical applications but it means little about the existence of good enough IRL algorithms. Indeed, if we accept the fragility of value thesis, even small inaccuracies in value learning may have catastrophic consequences. Moreover, in many case there are likely to be strong theoretical limitations to approximation of NP-hard problems (PCP theorem, Unique Games Conjecture).
Finally, Impagliazzo and Levin showed that if cryptography is possible at all then there are almost uniform distributions for NP-complete problems that make them hard in the average case. Thus, hard instances are not very special and your problem has to be special in order to avoid them.
we should be able to formally define a natural class of environment that excludes them but is still sufficiently rich
I disagree. There’s a lot of “no free lunch” theorems out there that in principle show various things (eg no agent can be intelligent across all environments) that in practice require very specific environments for the bad stuff to hurt the agent (ie the environment has to behave, in practice, as if it was a smarter adversarial agent that specifically hated the first agent).
We want to find algorithms that satisfy provable performance guarantees. A “no free lunch” theorem shows that a specific performance guarantee is impossible. Therefore we should look for other performance guarantees, either by changing the setting or changing the class of environments or doing something else.
My counterexample shows that certain natural performance guarantees are unsatisfiable. Therefore it is important to look for other natural settings / desirada. In particular I suggest one solution which seems natural and feasible, namely allowing the Student to randomize control between itself and the Teacher. This variant of IRL thereby seems qualitatively more powerful than the original.
Also, obviously the first counterexample constructed is always the most “adversarial” one since it’s the easiest to prove. This doesn’t mean that there is an algorithm that works well in most other cases. Given that we are in the AI safety business, the burden of proof is one the claim that the “bad” environment is exceptional, not vice versa. Moreover, my intuition is that this counterexample is not exceptionally bad, it’s just maximally bad, i.e. in a typical scenario IRL will only be able to extract a (not 0 but also not 1) portion of the important information about the utility function. If you can prove me wrong, I will be glad to see it!
I disagree that adversarial environments are “disanalogous”. If they are indeed disanalogous, we should be able to formally define a natural class of environment that excludes them but is still sufficiently rich e.g. it should allow other agents of similar complexity. If there is such a class, I would be glad to see the definition. I expect that in most complex polynomial-time environments IRL is be able to extract some information about the utility function but there is a significant fraction it is unable to extract.
I strongly disagree with “computational complexity issues only block people in practice when the setup needs to be adversarial.” My experience is that that computational complexity issues block people all the time. If someone invented a black box that solves arbitrary instances of SAT in practical time it would be a revolution in the technological capacities of civilization. Electronic and mechanical design would be almost fully automated (and they are currently very far from automated; the practical utility of SAT solvers mentioned in Wikipedia is very restricted). Algorithm engineering would become more or less redundant. Mathematical theorem proving would be fully automated. Protein folding would be solved. Training complex neural networks would become trivial. All of the above would work much better than humanity ability.
Also, if you were right people would be much less excited about quantum computers (that is, the excitement might be still unjustified because quantum computer are infeasible or because they are feasible but don’t solve many interesting problems, but at least there’s a chance they will solve many interesting problems).
Obviously there are NP-hard optimization problems with approximate solutions that are good enough for practical applications but it means little about the existence of good enough IRL algorithms. Indeed, if we accept the fragility of value thesis, even small inaccuracies in value learning may have catastrophic consequences. Moreover, in many case there are likely to be strong theoretical limitations to approximation of NP-hard problems (PCP theorem, Unique Games Conjecture).
Finally, Impagliazzo and Levin showed that if cryptography is possible at all then there are almost uniform distributions for NP-complete problems that make them hard in the average case. Thus, hard instances are not very special and your problem has to be special in order to avoid them.
I disagree. There’s a lot of “no free lunch” theorems out there that in principle show various things (eg no agent can be intelligent across all environments) that in practice require very specific environments for the bad stuff to hurt the agent (ie the environment has to behave, in practice, as if it was a smarter adversarial agent that specifically hated the first agent).
Allow me to recapitulate my point.
We want to find algorithms that satisfy provable performance guarantees. A “no free lunch” theorem shows that a specific performance guarantee is impossible. Therefore we should look for other performance guarantees, either by changing the setting or changing the class of environments or doing something else.
My counterexample shows that certain natural performance guarantees are unsatisfiable. Therefore it is important to look for other natural settings / desirada. In particular I suggest one solution which seems natural and feasible, namely allowing the Student to randomize control between itself and the Teacher. This variant of IRL thereby seems qualitatively more powerful than the original.
Also, obviously the first counterexample constructed is always the most “adversarial” one since it’s the easiest to prove. This doesn’t mean that there is an algorithm that works well in most other cases. Given that we are in the AI safety business, the burden of proof is one the claim that the “bad” environment is exceptional, not vice versa. Moreover, my intuition is that this counterexample is not exceptionally bad, it’s just maximally bad, i.e. in a typical scenario IRL will only be able to extract a (not 0 but also not 1) portion of the important information about the utility function. If you can prove me wrong, I will be glad to see it!