On 2, I’m surprised if you think that natural selection isn’t a natural abstraction but that eudaemonia is. (If we’re getting an AGI singleton that want to fully learn our values.)
Secondly I’ll say that if we do not understand it’s representation of X or X-prime, and if a small difference will be catastrophic, then that will also lead to doom.
On 1: I think that’s quite plausible? Like, I assign something in the range of 20-60% probability to that. How much does it have to change for you to feel much safer about inner alignment?
(I’m also not that clear it only applies to this situation. Perhaps I’m mistaken, but in my head subsystem alignment and robust delegation both have this property of ”build a second optimiser that helps achieve your goals” and in both cases passing on the true utility function seems very hard.)
On 2, I’m surprised if you think that natural selection isn’t a natural abstraction but that eudaemonia is.
Currently, my first-pass check for “is this probably a natural abstraction?” is “can humans usually figure out what I’m talking about from a few examples, without a formal definition?”. For human values, the answer seems like an obvious “yes”. For evolutionary fitness… nonobvious. Humans usually get it wrong without the formal definition.
Also, natural abstractions in general involve summarizing the information from one chunk of the universe which is relevant “far away”. For human values, the relevant chunk of the universe is the human—i.e. the information about human values is all embedded in the physical human. But for evolutionary fitness, that’s not the case—an organism does not contain all the information relevant to calculating its evolutionary fitness. So it seems like there’s some qualitative difference there—like, human values “live” in humans, but fitness doesn’t “live” in organisms in the same way. I still don’t feel like I fully understand this, though.
On 1: I think that’s quite plausible? Like, I assign something in the range of 20-60% probability to that.
Sure, inner alignment is a problem which mainly applies to architectures similar to modern ML, and modern ML architecture seems like the most-likely route to AGI.
It still feels like outer alignment is a much harder problem, though. The very fact that inner alignment failure is so specific to certain architectures is evidence that it should be tractable. For instance, we can avoid most inner alignment problems by just optimizing all the parameters simultaneously at run-time. That solution would be too expensive in practice, but the point is that inner alignment is hard in a “we need to find more efficient algorithms” sort of way, not a “we’re missing core concepts and don’t even know how to solve this in principle” sort of way. (At least for mesa-optimization; I agree that there are more general subsystem alignment/robust delegation issues which are potentially conceptually harder.)
Outer alignment, on the other hand, we don’t even know how to solve in principle, on any architecture whatsoever, even with arbitrary amounts of compute and data. That’s why I expect it to be a bottleneck.
Currently, my first-pass check for “is this probably a natural abstraction?” is “can humans usually figure out what I’m talking about from a few examples, without a formal definition?”. For human values, the answer seems like an obvious “yes”. For evolutionary fitness… nonobvious. Humans usually get it wrong without the formal definition.
Hmm, presumably you’re not including something like “internal consistency” in the definition of ‘natural abstraction’. That is, humans who aren’t thinking carefully about something will think there’s an imaginable object even if any attempts to actually construct that object will definitely lead to failure. (For example, Arrow’s Impossibility Theorem comes to mind; a voting rule that satisfies all of those desiderata feels like a ‘natural abstraction’ in the relevant sense, even though there aren’t actually any members of that abstraction.)
Oh this is fascinating. This is basically correct; a high-level model space can include models which do not correspond to any possible low-level model.
One caveat: any high-level data or observations will be consistent with the true low-level model. So while there may be natural abstract objects which can’t exist, and we can talk about those objects, we shouldn’t see data supporting their existence—e.g. we shouldn’t see a real-world voting system behaving like it satisfies all of Arrow’s desiderata.
Regarding your first pass check for naturalness being whether humans can understand it: strike me thoroughly puzzled. Isn’t one of the core points of the reductionism sequence that, while “thor caused the thunder” sounds simpler to a human than Maxwell’s equations (because the words fit naturally into a human psychology), one of them is much “simpler” in an absolute sense than the other (and is in fact true).
Regarding your point about the human values living in humans while the organism’s fitness is living partly in the environment, nothing immediately comes to mind to say here, but I agree it’s a very interesting question.
The things you say about inner/outer alignment hold together quite sensibly. I am surprised to hear you say that mesa optimisers can be avoided by just optimizing all the parameters simultaneously at run-time. That doesn’t match my understanding of mesa optimisation, I thought the mesa optimisers would definitely arise during the training, but if you’re right that it’s trivial-but-expensive to remove them there then I agree it’s intuitively a much easier problem than I had realised.
Regarding your first pass check for naturalness being whether humans can understand it: strike me thoroughly puzzled. Isn’t one of the core points of the reductionism sequence that, while “thor caused the thunder” sounds simpler to a human than Maxwell’s equations (because the words fit naturally into a human psychology), one of them is much “simpler” in an absolute sense than the other (and is in fact true).
Despite humans giving really dumb verbal explanations (like “Thor caused the thunder”), we tend to be pretty decent at actually predicting things in practice.
The same applies to natural abstractions. If I ask people “is ‘tree’ a natural category?” then they’ll get into some long philosophical debate. But if I show someone five pictures of trees, then show them five other picture which are not all trees, and ask them which of the second set are similar to the first set, they’ll usually have no trouble at all picking the trees in the second set.
I thought the mesa optimisers would definitely arise during the training
If you’re optimizing all the parameters simultaneously at runtime, then there is no training. Whatever parameters were learned during “training” would just be overwritten by the optimal values computed at runtime.
Despite humans giving really dumb verbal explanations (like “Thor caused the thunder”), we tend to be pretty decent at actually predicting things in practice.
Mm, quantum mechanics much? I do not think I can reliably tell you which experiments are in the category “real” and the category “made up”, even though it’s a very simple category mathematically. But I don’t expect you’re saying this, I just am still confused what you are saying.
This reminds me of Oli’s question here, which ties into Abram’s “point of view from somewhere” idea. I feel like I expect ML-systems to take the point of view of the universe, and not learn our natural categories.
I’m talking everyday situations. Like “if I push on this door, it will open” or “by next week my laundry hamper will be full” or “it’s probably going to be colder in January than June”. Even with quantum mechanics, people do figure out the pattern and build some intuition, but they need to see a lot of data on it first and most people never study it enough to see that much data.
In places where the humans in question don’t have much first-hand experiential data, or where the data is mostly noise, that’s where human prediction tends to fail. (And those are also the cases where we expect learning systems in general to fail most often, and where we expect the system’s priors to matter most.) Another way to put it: humans’ priors aren’t great, but in most day-to-day prediction problems we have more than enough data to make up for that.
Thank you for being so clear.
On 2, I’m surprised if you think that natural selection isn’t a natural abstraction but that eudaemonia is. (If we’re getting an AGI singleton that want to fully learn our values.)
Secondly I’ll say that if we do not understand it’s representation of X or X-prime, and if a small difference will be catastrophic, then that will also lead to doom.
On 1: I think that’s quite plausible? Like, I assign something in the range of 20-60% probability to that. How much does it have to change for you to feel much safer about inner alignment?
(I’m also not that clear it only applies to this situation. Perhaps I’m mistaken, but in my head subsystem alignment and robust delegation both have this property of ”build a second optimiser that helps achieve your goals” and in both cases passing on the true utility function seems very hard.)
Currently, my first-pass check for “is this probably a natural abstraction?” is “can humans usually figure out what I’m talking about from a few examples, without a formal definition?”. For human values, the answer seems like an obvious “yes”. For evolutionary fitness… nonobvious. Humans usually get it wrong without the formal definition.
Also, natural abstractions in general involve summarizing the information from one chunk of the universe which is relevant “far away”. For human values, the relevant chunk of the universe is the human—i.e. the information about human values is all embedded in the physical human. But for evolutionary fitness, that’s not the case—an organism does not contain all the information relevant to calculating its evolutionary fitness. So it seems like there’s some qualitative difference there—like, human values “live” in humans, but fitness doesn’t “live” in organisms in the same way. I still don’t feel like I fully understand this, though.
Sure, inner alignment is a problem which mainly applies to architectures similar to modern ML, and modern ML architecture seems like the most-likely route to AGI.
It still feels like outer alignment is a much harder problem, though. The very fact that inner alignment failure is so specific to certain architectures is evidence that it should be tractable. For instance, we can avoid most inner alignment problems by just optimizing all the parameters simultaneously at run-time. That solution would be too expensive in practice, but the point is that inner alignment is hard in a “we need to find more efficient algorithms” sort of way, not a “we’re missing core concepts and don’t even know how to solve this in principle” sort of way. (At least for mesa-optimization; I agree that there are more general subsystem alignment/robust delegation issues which are potentially conceptually harder.)
Outer alignment, on the other hand, we don’t even know how to solve in principle, on any architecture whatsoever, even with arbitrary amounts of compute and data. That’s why I expect it to be a bottleneck.
Hmm, presumably you’re not including something like “internal consistency” in the definition of ‘natural abstraction’. That is, humans who aren’t thinking carefully about something will think there’s an imaginable object even if any attempts to actually construct that object will definitely lead to failure. (For example, Arrow’s Impossibility Theorem comes to mind; a voting rule that satisfies all of those desiderata feels like a ‘natural abstraction’ in the relevant sense, even though there aren’t actually any members of that abstraction.)
Oh this is fascinating. This is basically correct; a high-level model space can include models which do not correspond to any possible low-level model.
One caveat: any high-level data or observations will be consistent with the true low-level model. So while there may be natural abstract objects which can’t exist, and we can talk about those objects, we shouldn’t see data supporting their existence—e.g. we shouldn’t see a real-world voting system behaving like it satisfies all of Arrow’s desiderata.
Regarding your first pass check for naturalness being whether humans can understand it: strike me thoroughly puzzled. Isn’t one of the core points of the reductionism sequence that, while “thor caused the thunder” sounds simpler to a human than Maxwell’s equations (because the words fit naturally into a human psychology), one of them is much “simpler” in an absolute sense than the other (and is in fact true).
Regarding your point about the human values living in humans while the organism’s fitness is living partly in the environment, nothing immediately comes to mind to say here, but I agree it’s a very interesting question.
The things you say about inner/outer alignment hold together quite sensibly. I am surprised to hear you say that mesa optimisers can be avoided by just optimizing all the parameters simultaneously at run-time. That doesn’t match my understanding of mesa optimisation, I thought the mesa optimisers would definitely arise during the training, but if you’re right that it’s trivial-but-expensive to remove them there then I agree it’s intuitively a much easier problem than I had realised.
Despite humans giving really dumb verbal explanations (like “Thor caused the thunder”), we tend to be pretty decent at actually predicting things in practice.
The same applies to natural abstractions. If I ask people “is ‘tree’ a natural category?” then they’ll get into some long philosophical debate. But if I show someone five pictures of trees, then show them five other picture which are not all trees, and ask them which of the second set are similar to the first set, they’ll usually have no trouble at all picking the trees in the second set.
If you’re optimizing all the parameters simultaneously at runtime, then there is no training. Whatever parameters were learned during “training” would just be overwritten by the optimal values computed at runtime.
Mm, quantum mechanics much? I do not think I can reliably tell you which experiments are in the category “real” and the category “made up”, even though it’s a very simple category mathematically. But I don’t expect you’re saying this, I just am still confused what you are saying.
This reminds me of Oli’s question here, which ties into Abram’s “point of view from somewhere” idea. I feel like I expect ML-systems to take the point of view of the universe, and not learn our natural categories.
I’m talking everyday situations. Like “if I push on this door, it will open” or “by next week my laundry hamper will be full” or “it’s probably going to be colder in January than June”. Even with quantum mechanics, people do figure out the pattern and build some intuition, but they need to see a lot of data on it first and most people never study it enough to see that much data.
In places where the humans in question don’t have much first-hand experiential data, or where the data is mostly noise, that’s where human prediction tends to fail. (And those are also the cases where we expect learning systems in general to fail most often, and where we expect the system’s priors to matter most.) Another way to put it: humans’ priors aren’t great, but in most day-to-day prediction problems we have more than enough data to make up for that.