Rohin Shah comments on Alignment as Translation

Rohin Shah 27 Mar 2020 20:44 UTC
LW: 4 AF: 3
AF
Anyway, sounds like value-in-the-tail is a central crux here.
Seems somewhat right to me, subject to caveat below.
it’s not a necessary condition—if the remaining 5% of problems are still existentially deady and likely to come up eventually (but not often enough to be caught in testing), then risk isn’t really decreased.
An important part of my intuition about value-in-the-tail is that if your first solution can knock off 95% of the risk, you can then use the resulting AI system to design a new AI system where you’ve translated better and now you’ve eliminated 99% of the risk, and iterating this process you get to effectively no ongoing risk. There is of course risk during the iteration, but that risk can be reasonably small.
A similar argument applies to economic competitiveness: yes, your first agent is pretty slow relative to what it could be, but you can make it faster and faster over time, so you only lose a lot of value during the first few initial phases.
(For the economic value part, this is mostly based on industry experience trying to automate things.)
I have the same intuition, and strongly agree that usually most of the value is in the long tail. The hope is mostly that you can actually keep making progress on the tail as time goes on, especially with the help of your newly built AI systems.
- johnswentworth 27 Mar 2020 21:43 UTC
  LW: 4 AF: 3
  AF Parent
  An important part of my intuition about value-in-the-tail is that if your first solution can knock off 95% of the risk, you can then use the resulting AI system to design a new AI system where you’ve translated better and now you’ve eliminated 99% of the risk...
  I don’t see how this ever actually gets around the chicken-and-egg problem.
  An analogy: we want to translate from English to Korean. We first obtain a translation dictionary which is 95% accurate, then use it to ask our Korean-speaking friend to help out. Problem is, there’s a very important difference between very similar translations of “help me translate things”—e.g. consider the difference between “what would you say if you wanted to convey X?” and “what should I say if I want to convey X?”, when giving instructions to an AI. Both of those would produce very similar results, right up until everything went wrong. (Let me know if this analogy sounds representative of the strategies you imagine.)
  If you do manage to get that first translation exactly right, and successfully ask your friend for help, then you’re good—similar to the “translate how-to-translate” strategy from the OP. And with a 95% accurate dictionary, you might even have a decent chance of getting that first translation right. But if that first translation isn’t perfect, then you need some way to find that out safely—and the 95% accurate dictionary doesn’t make that any easier.
  Another way to look at it: the chicken-and-egg problem is a ground truth problem. If we have enough data to estimate X to within 5%, then doing clever things with that data is not going reduce that error any further. We need some other way to get at the ground truth, in order to actually reduce the error rate. If we know how to convey what-we-want with 95% accuracy, then we need some other way to get at the ground truth of translation in order to increase that accuracy further.
  - Rohin Shah 28 Mar 2020 0:02 UTC
    LW: 4 AF: 3
    AF Parent
    Let me know if this analogy sounds representative of the strategies you imagine.
    Yeah, it does. I definitely agree that this doesn’t get around the chicken-and-egg problem, and so shouldn’t be expected to succeed on the first try. It’s more like you get to keep trying this strategy over and over again until you eventually succeed, because if everything goes wrong you just unplug the AI system and start over.
    the chicken-and-egg problem is a ground truth problem. If we have enough data to estimate X to within 5%, then doing clever things with that data is not going reduce that error any further.
    I think you get “ground truth data” by trying stuff and seeing whether or not the AI system did what you wanted it to do.
    (This does suggest that you wouldn’t ever be able to ask your AI system to do something completely novel without having a human along to ensure it’s what we actually meant, which seems wrong to me, but I can’t articulate why.)
    - johnswentworth 28 Mar 2020 0:29 UTC
      LW: 4 AF: 3
      AF Parent
      I think you get “ground truth data” by trying stuff and seeing whether or not the AI system did what you wanted it to do.
      That’s the sort of strategy where illusion of transparency is a big problem, from a translation point of view. The difficult cases are exactly the cases where the translation usually produces the results you expect, but then produce something completely different in some rare cases.
      Another way to put it: if we’re gathering data by seeing whether the system did what we wanted, then the long tail problem works against us pretty badly. Those rare tail-cases are exactly the cases we would need to observe in order to notice problems and improve the system. We’re not going to have very many of them to work with. Ability to generalize from small data sets becomes a key capability, but then we need to translate how-to-generalize in order for the AI to generalize in the ways we want (this gets at the can’t-ask-the-AI-to-do-anything-novel problem).
- johnswentworth 27 Mar 2020 22:01 UTC
  LW: 3 AF: 2
  AF Parent
  (The other comment is my main response, but there’s a possibly-tangential issue here.)
  In a long-tail world, if we manage to eliminate 95% of problems, then we generate maybe 10% of the value. So now we use our 10%-of-value product to refine our solution. But it seems rather optimistic to hope that a product which achieves only 10% of the value gets us all the way to a 99% solution. It seems far more likely that it gets to, say, a 96% solution. That, in turn, generates maybe 15% of the value, which in turn gets us to a 96.5% solution, and...
  Point being: in the long-tail world, it’s at least plausible (and I would say more likely than not) that this iterative strategy doesn’t ever converge to a high-value solution. We get fancier and fancier refinements with decreasing marginal returns, which never come close to handling the long tail.
  Now, under this argument, it’s still a fine idea to try the iterative strategy. But you wouldn’t want to bet too heavily on its success, especially without a reliable way to check whether it’s working.
  - Rohin Shah 27 Mar 2020 23:52 UTC
    LW: 5 AF: 4
    AF Parent
    Yeah, this could be a way that things are. My intuition is that it wouldn’t be this way, but I don’t have any good arguments for it.