That is a problem in principle, but I’d guess that the perception of that problem mostly comes from a couple other phenomena.
First: I think a lot of people don’t realize on a gut level that a solution which isn’t robust is guaranteed to fail in practice. There are always unknown unknowns in a new domain; the presence of unknown unknowns may be the single highest-confidence claim we can make about AGI at this point. A strategy which fails the moment any surprise comes along is going to fail; robustness is necessary. Now, robustness is not the same as “guaranteed to work”, but the two are easy to confuse. A lot of arguments of the form “ah but your strategy fails in case X” look like they’re saying “the strategy is not guaranteed to work”, but the actually-important content is “the strategy is not robust to <broad class of failures>”; the key is to think about how broadly the example-failure generalizes. (I think a common mistake newcomers make is to argue “but that particular failure isn’t very likely”, without thinking about how the failure mode generalizes or what other lack-of-robustness it implies.)
Second: I think it’s very common for people to say “we just don’t know whether X will work” when in fact we have enormous amounts of real-world evidence about close analogues of X. (This thread on the Sandwich Problem post is a central example.) People imagine that we need to run an Official Experiment in order to Know Things, and that’s just not how the world actually works. In general, we have an enormous amount of relevant prior information from the world. But often all that prior information is not as legible as an Official Experiment, so it’s harder to explain the argument. I think people confuse the lack of legibility with a lack of certainty.
The issue is that it’s very difficult to reason correctly in the absence of an “Official Experiment”[1]. I think the alignment community is too quick to dismiss potentially useful ideas, and that the reasons for those dismissals are often wrong. E.g., I still don’t think anyone’s given a clear, mechanistic reason for why rewarding an RL agent for making you smile is bound to fail (as opposed to being a terrible idea that probably fails).
More precisely, it’s very difficult to reason correctly even with many “Official Experiments”, and nearly impossible to do so without any such experiments.
It’s a preparadigmatic field. Nobody is going to prove beyond a shadow of a doubt that X fails, for exactly the same reasons that nobody is going to prove beyond a shadow of a doubt that X works. And that just doesn’t matter very much, for decision-making purposes. If something looks unlikely to work, then the EV-maximizing move is to dismiss it and move on. Maybe one or two people work on the thing-which-is-unlikely-to-work in order to decorrelate their bets with everyone else, but mostly people should ignore things which are unlikely to work, especially if there’s already one or two people looking closer at it.
That is a problem in principle, but I’d guess that the perception of that problem mostly comes from a couple other phenomena.
First: I think a lot of people don’t realize on a gut level that a solution which isn’t robust is guaranteed to fail in practice. There are always unknown unknowns in a new domain; the presence of unknown unknowns may be the single highest-confidence claim we can make about AGI at this point. A strategy which fails the moment any surprise comes along is going to fail; robustness is necessary. Now, robustness is not the same as “guaranteed to work”, but the two are easy to confuse. A lot of arguments of the form “ah but your strategy fails in case X” look like they’re saying “the strategy is not guaranteed to work”, but the actually-important content is “the strategy is not robust to <broad class of failures>”; the key is to think about how broadly the example-failure generalizes. (I think a common mistake newcomers make is to argue “but that particular failure isn’t very likely”, without thinking about how the failure mode generalizes or what other lack-of-robustness it implies.)
Second: I think it’s very common for people to say “we just don’t know whether X will work” when in fact we have enormous amounts of real-world evidence about close analogues of X. (This thread on the Sandwich Problem post is a central example.) People imagine that we need to run an Official Experiment in order to Know Things, and that’s just not how the world actually works. In general, we have an enormous amount of relevant prior information from the world. But often all that prior information is not as legible as an Official Experiment, so it’s harder to explain the argument. I think people confuse the lack of legibility with a lack of certainty.
The issue is that it’s very difficult to reason correctly in the absence of an “Official Experiment”[1]. I think the alignment community is too quick to dismiss potentially useful ideas, and that the reasons for those dismissals are often wrong. E.g., I still don’t think anyone’s given a clear, mechanistic reason for why rewarding an RL agent for making you smile is bound to fail (as opposed to being a terrible idea that probably fails).
More precisely, it’s very difficult to reason correctly even with many “Official Experiments”, and nearly impossible to do so without any such experiments.
It’s a preparadigmatic field. Nobody is going to prove beyond a shadow of a doubt that X fails, for exactly the same reasons that nobody is going to prove beyond a shadow of a doubt that X works. And that just doesn’t matter very much, for decision-making purposes. If something looks unlikely to work, then the EV-maximizing move is to dismiss it and move on. Maybe one or two people work on the thing-which-is-unlikely-to-work in order to decorrelate their bets with everyone else, but mostly people should ignore things which are unlikely to work, especially if there’s already one or two people looking closer at it.