johnswentworth comments on Alignment By Default

johnswentworth 15 Aug 2020 2:41 UTC
LW: 3 AF: 1
0
AF
Regarding your first pass check for naturalness being whether humans can understand it: strike me thoroughly puzzled. Isn’t one of the core points of the reductionism sequence that, while “thor caused the thunder” sounds simpler to a human than Maxwell’s equations (because the words fit naturally into a human psychology), one of them is much “simpler” in an absolute sense than the other (and is in fact true).
Despite humans giving really dumb verbal explanations (like “Thor caused the thunder”), we tend to be pretty decent at actually predicting things in practice.
The same applies to natural abstractions. If I ask people “is ‘tree’ a natural category?” then they’ll get into some long philosophical debate. But if I show someone five pictures of trees, then show them five other picture which are not all trees, and ask them which of the second set are similar to the first set, they’ll usually have no trouble at all picking the trees in the second set.
I thought the mesa optimisers would definitely arise during the training
If you’re optimizing all the parameters simultaneously at runtime, then there is no training. Whatever parameters were learned during “training” would just be overwritten by the optimal values computed at runtime.
- Ben Pace 15 Aug 2020 7:05 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Despite humans giving really dumb verbal explanations (like “Thor caused the thunder”), we tend to be pretty decent at actually predicting things in practice.
  Mm, quantum mechanics much? I do not think I can reliably tell you which experiments are in the category “real” and the category “made up”, even though it’s a very simple category mathematically. But I don’t expect you’re saying this, I just am still confused what you are saying.
  This reminds me of Oli’s question here, which ties into Abram’s “point of view from somewhere” idea. I feel like I expect ML-systems to take the point of view of the universe, and not learn our natural categories.
  - johnswentworth 15 Aug 2020 16:35 UTC
    LW: 7 AF: 3
    0
    AF Parent
    I’m talking everyday situations. Like “if I push on this door, it will open” or “by next week my laundry hamper will be full” or “it’s probably going to be colder in January than June”. Even with quantum mechanics, people do figure out the pattern and build some intuition, but they need to see a lot of data on it first and most people never study it enough to see that much data.
    In places where the humans in question don’t have much first-hand experiential data, or where the data is mostly noise, that’s where human prediction tends to fail. (And those are also the cases where we expect learning systems in general to fail most often, and where we expect the system’s priors to matter most.) Another way to put it: humans’ priors aren’t great, but in most day-to-day prediction problems we have more than enough data to make up for that.