A question about alignment via natural abstractions (if you’ve addressed it before, please refer me to where): it seems to me plausible that natural abstractions exist but are not useful for alignment, because alignment is a high-dimensional all-or-nothing property. Like, the AI will learn about “trees”, but not unintentionally killing everyone depends on whether a palm tree is a tree, or on whether a copse counts as full of trees, or some other questions which depends on unnatural details of the natural abstraction.
Do you think that edge cases will just naturally be correctly learned?
Do you think that edge cases just won’t end up mattering for alignment?
Definitions, as we usually use them, are not the correct data structure for word-meaning. Words point to clusters in thing-space; definitions try to carve up those clusters with something like cutting-planes. That’s an unreliable and very lossy way to represent clusters, and can’t handle edge-cases well or ambiguous cases at all. The natural abstractions are the clusters (more precisely the summary parameters of the clusters, like e.g. cluster mean and variance in a gaussian cluster model); they’re not cutting-planes.
I don’t think “definitions” are the crux of my discomfort. Suppose the model learns a cluster; the position, scale, and shape parameters of this cluster summary are not perfectly stable—that is, they vary somewhat with different training data. This is not a problem on its own, because it’s still basically the same; however, the (fuzzy) boundary of the cluster is large (I have a vague intuition that the curse of dimensionality is relevant here, but nothing solid). This means that there are many cutting planes, induced by actions to be taken downstream of the model, on which training on different data could have yielded a different result. My intuition is that most of the risk of misalignment arises at those boundaries:
One reason for my intuition is that in communication between humans, difficulties arise in a similar way (i.e. when two peoples clusters have slightly different shapes)
One reason is that the boundary cases feel like the kind of stuff you can’t reliably learn from data or effectively test.
Your comment seems to be suggesting that you think the edge cases won’t matter, but I’m not really understanding why the fuzzy nature of concepts makes that true.
seems like maybe the naturalness of abstracting a cluster is the disagreement in ensemble of similar-or-shorter-length equivalent models? if your abstraction is natural enough, it always holds. if it’s merely approaching the natural abstraction, it’ll approximate it. current artificial neural networks are probably not strong enough to learn all the relevant natural abstractions to a given context, but they move towards them, some of the way.
is yudkowsky’s monster a claim about the shape of the most natural abstractions? perhaps what we really need is to not assume we know which abstractions have been confirmed to be acceptably natural. ie, perhaps his body of claims about this boils down to “oh maybe soft optimization is all one could possibly ask for until all matter is equally superintelligent, so that we don’t break decision theory and all get taken over by a selfish [?genememetic element?]” or something funky along those lines.
[??] I don’t have a term of art that generalizes these things properly; genes/memes/executable shape fragments in generality
A question about alignment via natural abstractions (if you’ve addressed it before, please refer me to where): it seems to me plausible that natural abstractions exist but are not useful for alignment, because alignment is a high-dimensional all-or-nothing property. Like, the AI will learn about “trees”, but not unintentionally killing everyone depends on whether a palm tree is a tree, or on whether a copse counts as full of trees, or some other questions which depends on unnatural details of the natural abstraction.
Do you think that edge cases will just naturally be correctly learned?
Do you think that edge cases just won’t end up mattering for alignment?
Definitions, as we usually use them, are not the correct data structure for word-meaning. Words point to clusters in thing-space; definitions try to carve up those clusters with something like cutting-planes. That’s an unreliable and very lossy way to represent clusters, and can’t handle edge-cases well or ambiguous cases at all. The natural abstractions are the clusters (more precisely the summary parameters of the clusters, like e.g. cluster mean and variance in a gaussian cluster model); they’re not cutting-planes.
I don’t think “definitions” are the crux of my discomfort. Suppose the model learns a cluster; the position, scale, and shape parameters of this cluster summary are not perfectly stable—that is, they vary somewhat with different training data. This is not a problem on its own, because it’s still basically the same; however, the (fuzzy) boundary of the cluster is large (I have a vague intuition that the curse of dimensionality is relevant here, but nothing solid). This means that there are many cutting planes, induced by actions to be taken downstream of the model, on which training on different data could have yielded a different result. My intuition is that most of the risk of misalignment arises at those boundaries:
One reason for my intuition is that in communication between humans, difficulties arise in a similar way (i.e. when two peoples clusters have slightly different shapes)
One reason is that the boundary cases feel like the kind of stuff you can’t reliably learn from data or effectively test.
Your comment seems to be suggesting that you think the edge cases won’t matter, but I’m not really understanding why the fuzzy nature of concepts makes that true.
seems like maybe the naturalness of abstracting a cluster is the disagreement in ensemble of similar-or-shorter-length equivalent models? if your abstraction is natural enough, it always holds. if it’s merely approaching the natural abstraction, it’ll approximate it. current artificial neural networks are probably not strong enough to learn all the relevant natural abstractions to a given context, but they move towards them, some of the way.
is yudkowsky’s monster a claim about the shape of the most natural abstractions? perhaps what we really need is to not assume we know which abstractions have been confirmed to be acceptably natural. ie, perhaps his body of claims about this boils down to “oh maybe soft optimization is all one could possibly ask for until all matter is equally superintelligent, so that we don’t break decision theory and all get taken over by a selfish [?genememetic element?]” or something funky along those lines.
[??] I don’t have a term of art that generalizes these things properly; genes/memes/executable shape fragments in generality