You’re preaching to the choir there, but even if we were working with more strongly typed epistemic representations that had been inferred by some unexpected innovation of machine learning, automatic inference of those representations would lead them to be uncommented and not well-matched with human compressions of reality, nor would they match exactly against reality,
My impression is that this could be true, but also this post seems to argue (reasonably convincingly, in my view) that the space of possible abstractions (“epistemic representations”) is discrete rather than continuous, such that any representation of reality sufficiently close to “human compressions” would in fact be using those human compressions, rather than an arbitrarily similar set of representations that comes apart in the limit of strong optimization pressure. I’m curious as to whether Eliezer (or anyone who endorses this particular aspect of his view) has a strong counterargument against this, or whether they simply find it unlikely on priors.
I’d also add that having a system that uses abstractions that are close to humans is insufficient for safety, because you’re putting those abstractions under stress by optimizing them.
I do think it’s plausible that any AI modelling humans will model humans as having preferences, but 1) I’d imagine these preference models as calibrated on normal world states, and not extend property off distribution (IE, as soon as the AI starts doing things with nanomachines that humans can’t reason properly about) and 2) “pointing” at the right part of the AI’s world model to yield preferences, instead of a proxy that’s a better model of your human feedback mechanism, is still an unsolved problem. (The latter point is outlined in the post in some detail, I think?) I also think that 3) there’s a possibility that there is no simple, natural core of human values, simpler than “model the biases of people in detail”, for an AI to find.
I think there are some pretty knock-down cases where human concepts are continuous. E.g. if you want to cut a rose off of a rosebush, where can you cut to get a rose? (If this example seems insufficiently important, replace it with brain surgery.)
That said, we should be careful about arguments that we need to match human concepts to high precision, because humans don’t have concepts to high precision.
In the abstraction formalism I use, it can be ambiguous whether any particular thing “is a rose”, while still having a roughly-unambiguous concept of roses. It’s exactly like clustering: a cluster can have unambiguous parameters (mean, variance, etc), but it’s still ambiguous whether any particular data point is “in” that cluster.
Good point. I was more thinking that not only could it be ambiguous for a single observer, but different observers could systematically decide differently, and that would be okay.
Are there any concepts that don’t merely have continuous parameters, but are actually part of continuous families? Maybe the notion of “1 foot long”?
My impression is that this could be true, but also this post seems to argue (reasonably convincingly, in my view) that the space of possible abstractions (“epistemic representations”) is discrete rather than continuous, such that any representation of reality sufficiently close to “human compressions” would in fact be using those human compressions, rather than an arbitrarily similar set of representations that comes apart in the limit of strong optimization pressure. I’m curious as to whether Eliezer (or anyone who endorses this particular aspect of his view) has a strong counterargument against this, or whether they simply find it unlikely on priors.
I’d also add that having a system that uses abstractions that are close to humans is insufficient for safety, because you’re putting those abstractions under stress by optimizing them.
I do think it’s plausible that any AI modelling humans will model humans as having preferences, but 1) I’d imagine these preference models as calibrated on normal world states, and not extend property off distribution (IE, as soon as the AI starts doing things with nanomachines that humans can’t reason properly about) and 2) “pointing” at the right part of the AI’s world model to yield preferences, instead of a proxy that’s a better model of your human feedback mechanism, is still an unsolved problem. (The latter point is outlined in the post in some detail, I think?) I also think that 3) there’s a possibility that there is no simple, natural core of human values, simpler than “model the biases of people in detail”, for an AI to find.
I think there are some pretty knock-down cases where human concepts are continuous. E.g. if you want to cut a rose off of a rosebush, where can you cut to get a rose? (If this example seems insufficiently important, replace it with brain surgery.)
That said, we should be careful about arguments that we need to match human concepts to high precision, because humans don’t have concepts to high precision.
In the abstraction formalism I use, it can be ambiguous whether any particular thing “is a rose”, while still having a roughly-unambiguous concept of roses. It’s exactly like clustering: a cluster can have unambiguous parameters (mean, variance, etc), but it’s still ambiguous whether any particular data point is “in” that cluster.
Good point. I was more thinking that not only could it be ambiguous for a single observer, but different observers could systematically decide differently, and that would be okay.
Are there any concepts that don’t merely have continuous parameters, but are actually part of continuous families? Maybe the notion of “1 foot long”?