Regarding your first pass check for naturalness being whether humans can understand it: strike me thoroughly puzzled. Isn’t one of the core points of the reductionism sequence that, while “thor caused the thunder” sounds simpler to a human than Maxwell’s equations (because the words fit naturally into a human psychology), one of them is much “simpler” in an absolute sense than the other (and is in fact true).
Regarding your point about the human values living in humans while the organism’s fitness is living partly in the environment, nothing immediately comes to mind to say here, but I agree it’s a very interesting question.
The things you say about inner/outer alignment hold together quite sensibly. I am surprised to hear you say that mesa optimisers can be avoided by just optimizing all the parameters simultaneously at run-time. That doesn’t match my understanding of mesa optimisation, I thought the mesa optimisers would definitely arise during the training, but if you’re right that it’s trivial-but-expensive to remove them there then I agree it’s intuitively a much easier problem than I had realised.
Regarding your first pass check for naturalness being whether humans can understand it: strike me thoroughly puzzled. Isn’t one of the core points of the reductionism sequence that, while “thor caused the thunder” sounds simpler to a human than Maxwell’s equations (because the words fit naturally into a human psychology), one of them is much “simpler” in an absolute sense than the other (and is in fact true).
Despite humans giving really dumb verbal explanations (like “Thor caused the thunder”), we tend to be pretty decent at actually predicting things in practice.
The same applies to natural abstractions. If I ask people “is ‘tree’ a natural category?” then they’ll get into some long philosophical debate. But if I show someone five pictures of trees, then show them five other picture which are not all trees, and ask them which of the second set are similar to the first set, they’ll usually have no trouble at all picking the trees in the second set.
I thought the mesa optimisers would definitely arise during the training
If you’re optimizing all the parameters simultaneously at runtime, then there is no training. Whatever parameters were learned during “training” would just be overwritten by the optimal values computed at runtime.
Despite humans giving really dumb verbal explanations (like “Thor caused the thunder”), we tend to be pretty decent at actually predicting things in practice.
Mm, quantum mechanics much? I do not think I can reliably tell you which experiments are in the category “real” and the category “made up”, even though it’s a very simple category mathematically. But I don’t expect you’re saying this, I just am still confused what you are saying.
This reminds me of Oli’s question here, which ties into Abram’s “point of view from somewhere” idea. I feel like I expect ML-systems to take the point of view of the universe, and not learn our natural categories.
I’m talking everyday situations. Like “if I push on this door, it will open” or “by next week my laundry hamper will be full” or “it’s probably going to be colder in January than June”. Even with quantum mechanics, people do figure out the pattern and build some intuition, but they need to see a lot of data on it first and most people never study it enough to see that much data.
In places where the humans in question don’t have much first-hand experiential data, or where the data is mostly noise, that’s where human prediction tends to fail. (And those are also the cases where we expect learning systems in general to fail most often, and where we expect the system’s priors to matter most.) Another way to put it: humans’ priors aren’t great, but in most day-to-day prediction problems we have more than enough data to make up for that.
Regarding your first pass check for naturalness being whether humans can understand it: strike me thoroughly puzzled. Isn’t one of the core points of the reductionism sequence that, while “thor caused the thunder” sounds simpler to a human than Maxwell’s equations (because the words fit naturally into a human psychology), one of them is much “simpler” in an absolute sense than the other (and is in fact true).
Regarding your point about the human values living in humans while the organism’s fitness is living partly in the environment, nothing immediately comes to mind to say here, but I agree it’s a very interesting question.
The things you say about inner/outer alignment hold together quite sensibly. I am surprised to hear you say that mesa optimisers can be avoided by just optimizing all the parameters simultaneously at run-time. That doesn’t match my understanding of mesa optimisation, I thought the mesa optimisers would definitely arise during the training, but if you’re right that it’s trivial-but-expensive to remove them there then I agree it’s intuitively a much easier problem than I had realised.
Despite humans giving really dumb verbal explanations (like “Thor caused the thunder”), we tend to be pretty decent at actually predicting things in practice.
The same applies to natural abstractions. If I ask people “is ‘tree’ a natural category?” then they’ll get into some long philosophical debate. But if I show someone five pictures of trees, then show them five other picture which are not all trees, and ask them which of the second set are similar to the first set, they’ll usually have no trouble at all picking the trees in the second set.
If you’re optimizing all the parameters simultaneously at runtime, then there is no training. Whatever parameters were learned during “training” would just be overwritten by the optimal values computed at runtime.
Mm, quantum mechanics much? I do not think I can reliably tell you which experiments are in the category “real” and the category “made up”, even though it’s a very simple category mathematically. But I don’t expect you’re saying this, I just am still confused what you are saying.
This reminds me of Oli’s question here, which ties into Abram’s “point of view from somewhere” idea. I feel like I expect ML-systems to take the point of view of the universe, and not learn our natural categories.
I’m talking everyday situations. Like “if I push on this door, it will open” or “by next week my laundry hamper will be full” or “it’s probably going to be colder in January than June”. Even with quantum mechanics, people do figure out the pattern and build some intuition, but they need to see a lot of data on it first and most people never study it enough to see that much data.
In places where the humans in question don’t have much first-hand experiential data, or where the data is mostly noise, that’s where human prediction tends to fail. (And those are also the cases where we expect learning systems in general to fail most often, and where we expect the system’s priors to matter most.) Another way to put it: humans’ priors aren’t great, but in most day-to-day prediction problems we have more than enough data to make up for that.