We can do something like list a bunch of examples, have humans label them, and then find the lowest Kolomogorov complexity concept that agrees with human judgments in, say, 90% of cases. I’m not sure if this is what you mean by “normatively correct”, but it seems like a plausible concept that multiple concept learning algorithms might converge on. I’m still not convinced that we can do this for many value-laden concepts we care about and end up with something matching CEV, partially due to complexity of value. Still, it’s probably worth systematically studying the extent to which this will give the right answers for non-value-laden concepts, and then see what can be done about value-laden concepts.
We can do something like list a bunch of examples, have humans label them, and then find the lowest Kolomogorov complexity concept that agrees with human judgments in, say, 90% of cases.
Regularization is already a part of training any good classifier.
I’m not sure if this is what you mean by “normatively correct”, but it seems like a plausible concept that multiple concept learning algorithms might converge on.
Roughly speaking, I mean optimizing for the causal-predictive success of a generative model, given not only a training set but a “level of abstraction” (something like tagging the training features with lower-level concepts, type-checking for feature data) and a “context” (ie: which assumptions are being conditioned-on when learning the model).
Again, roughly speaking, humans tend to make pretty blatant categorization errors (ie: magical categories, non-natural hypotheses, etc.), but we also are doing causal modelling in the first place, so we accept fully-naturalized causal models as the correct way to handle concepts. However, we also handle reality on multiple levels of abstraction: we can think in chairs and raw materials and chemical treatments and molecular physics, all of which are entirely real. For something like FAI, I want a concept-learning algorithm that will look at the world in this naturalized, causal way (which is what normal modelling shoots for!), and that will model correctly at any level of abstraction or under any available set of features, and will be able to map between these levels as the human mind can.
Basically, I want my “FAI” to be built out of algorithms that can dissolve questions and do other forms of conceptual analysis without turning Straw Vulcan and saying, “Because ‘goodness’ dissolves into these other things when I naturalize it, it can’t be real!”. Because once I get that kind of conceptual understanding, it really does get a lot closer to being a problem of just telling the agent to optimize for “goodness” and trusting its conceptual inference to work out what I mean by that.
Sorry for rambling, but I think I need to do more cog-sci reading to clarify my own thoughts here.
Regularization is already a part of training any good classifier.
A technical point here: we don’t learn a raw classifier, because that would just learn human judgments. In order to allow the system to disagree with a human, we need to use some metric other than “is simple and assigns high probability to human judgments”.
For something like FAI, I want a concept-learning algorithm that will look at the world in this naturalized, causal way (which is what normal modelling shoots for!), and that will model correctly at any level of abstraction or under any available set of features, and will be able to map between these levels as the human mind can.
I totally agree that a good understanding of multi-level models is important for understanding FAI concept spaces. I don’t have a good understanding of multi-level maps; we can definitely see them as useful constructs for bounded reasoners, but it seems difficult to integrate higher levels into the goal system without deciding things about the high-level map a priori so you can define goals relative to this.
I don’t have a good understanding of multi-level maps; we can definitely see them as useful constructs for bounded reasoners
Well, all real reasoners are bounded reasoners. If you just don’t care about computational time bounds, you can run the Ordered Optimal Problem Solver as the initial input program to a Goedel Machine, and out pops your AI (in 200 trillion years, of course)!
it seems difficult to integrate higher levels into the goal system without deciding things about the high-level map a priori so you can define goals relative to this.
I would tend to say that you should be training a conceptual map of the world before you install anything like action-taking capability or a goal system of any kind. Of course, I also tend to say that you should just use a debugged (ie: cured of systematic faults) model of human evaluative processes for your goal system, and then use actual human evaluations to train the free parameters, and then set up learning feedback from the learned concept of “human” to the free-parameter space of the evaluation model.
I would tend to say that you should be training a conceptual map of the world before you install anything like action-taking capability or a goal system of any kind.
This seems like a sane thing to do. If this didn’t work, it would probably be because either
lack of conceptual convergence and human understandability; this seems somewhat likely and is probably the most important unknown
our conceptual representations are only efficient for talking about things we care about because we care about these things; a “neutral” standard such as resource-bounded Solomonoff induction will horribly learn things we care about for “no free lunch” reasons. I find this plausible but not too likely (it seems like it ought to be possible to “bootstrap” an importance metric for deciding where in the concept space to allocate resources).
we need the system to have a goal system in order to self-improve to the point of creating this conceptual map. I find this a little likely (this is basically the question of whether we can create something that manages to self-improve without needing goals; it is related to low impact).
Of course, I also tend to say that you should just use a debugged (ie: cured of systematic faults) model of human evaluative processes for your goal system, and then use actual human evaluations to train the free parameters, and then set up learning feedback from the learned concept of “human” to the free-parameter space of the evaluation model.
I agree that this is a good idea. It seems like the main problem here is that we need some sort of “skeleton” of a normative human model whose parts can be filled in empirically, and which will infer the right goals after enough training.
In order to allow the system to disagree with a human, we need to use some metric other than “is simple and assigns high probability to human judgments”.
Right: and the metric I would propose is, “counterfactual-prediction power”. Or in other words, the power to predict well in a causal fashion, to be able to answer counterfactual questions or predict well when we deliberately vary the experimental conditions.
To give a simple example: I train a system to recognize cats, but my training data contains only tabbies. What I want is a way of modelling that, while it may concentrate more probability on a tabby cat-like-thingy being a cat than a non-tabby cat-like-thingy, will still predict appropriately if I actually condition it on “but what if cats weren’t tabby by nature?”.
I think you said you’re a follower of the probabilistic programming approach, and in terms of being able to condition those models on counterfactual parameterizations and make predictions, I think they’re very much on the right track.
We can do something like list a bunch of examples, have humans label them, and then find the lowest Kolomogorov complexity concept that agrees with human judgments in, say, 90% of cases. I’m not sure if this is what you mean by “normatively correct”, but it seems like a plausible concept that multiple concept learning algorithms might converge on. I’m still not convinced that we can do this for many value-laden concepts we care about and end up with something matching CEV, partially due to complexity of value. Still, it’s probably worth systematically studying the extent to which this will give the right answers for non-value-laden concepts, and then see what can be done about value-laden concepts.
Regularization is already a part of training any good classifier.
Roughly speaking, I mean optimizing for the causal-predictive success of a generative model, given not only a training set but a “level of abstraction” (something like tagging the training features with lower-level concepts, type-checking for feature data) and a “context” (ie: which assumptions are being conditioned-on when learning the model).
Again, roughly speaking, humans tend to make pretty blatant categorization errors (ie: magical categories, non-natural hypotheses, etc.), but we also are doing causal modelling in the first place, so we accept fully-naturalized causal models as the correct way to handle concepts. However, we also handle reality on multiple levels of abstraction: we can think in chairs and raw materials and chemical treatments and molecular physics, all of which are entirely real. For something like FAI, I want a concept-learning algorithm that will look at the world in this naturalized, causal way (which is what normal modelling shoots for!), and that will model correctly at any level of abstraction or under any available set of features, and will be able to map between these levels as the human mind can.
Basically, I want my “FAI” to be built out of algorithms that can dissolve questions and do other forms of conceptual analysis without turning Straw Vulcan and saying, “Because ‘goodness’ dissolves into these other things when I naturalize it, it can’t be real!”. Because once I get that kind of conceptual understanding, it really does get a lot closer to being a problem of just telling the agent to optimize for “goodness” and trusting its conceptual inference to work out what I mean by that.
Sorry for rambling, but I think I need to do more cog-sci reading to clarify my own thoughts here.
A technical point here: we don’t learn a raw classifier, because that would just learn human judgments. In order to allow the system to disagree with a human, we need to use some metric other than “is simple and assigns high probability to human judgments”.
I totally agree that a good understanding of multi-level models is important for understanding FAI concept spaces. I don’t have a good understanding of multi-level maps; we can definitely see them as useful constructs for bounded reasoners, but it seems difficult to integrate higher levels into the goal system without deciding things about the high-level map a priori so you can define goals relative to this.
Well, all real reasoners are bounded reasoners. If you just don’t care about computational time bounds, you can run the Ordered Optimal Problem Solver as the initial input program to a Goedel Machine, and out pops your AI (in 200 trillion years, of course)!
I would tend to say that you should be training a conceptual map of the world before you install anything like action-taking capability or a goal system of any kind. Of course, I also tend to say that you should just use a debugged (ie: cured of systematic faults) model of human evaluative processes for your goal system, and then use actual human evaluations to train the free parameters, and then set up learning feedback from the learned concept of “human” to the free-parameter space of the evaluation model.
This seems like a sane thing to do. If this didn’t work, it would probably be because either
lack of conceptual convergence and human understandability; this seems somewhat likely and is probably the most important unknown
our conceptual representations are only efficient for talking about things we care about because we care about these things; a “neutral” standard such as resource-bounded Solomonoff induction will horribly learn things we care about for “no free lunch” reasons. I find this plausible but not too likely (it seems like it ought to be possible to “bootstrap” an importance metric for deciding where in the concept space to allocate resources).
we need the system to have a goal system in order to self-improve to the point of creating this conceptual map. I find this a little likely (this is basically the question of whether we can create something that manages to self-improve without needing goals; it is related to low impact).
I agree that this is a good idea. It seems like the main problem here is that we need some sort of “skeleton” of a normative human model whose parts can be filled in empirically, and which will infer the right goals after enough training.
Right: and the metric I would propose is, “counterfactual-prediction power”. Or in other words, the power to predict well in a causal fashion, to be able to answer counterfactual questions or predict well when we deliberately vary the experimental conditions.
To give a simple example: I train a system to recognize cats, but my training data contains only tabbies. What I want is a way of modelling that, while it may concentrate more probability on a tabby cat-like-thingy being a cat than a non-tabby cat-like-thingy, will still predict appropriately if I actually condition it on “but what if cats weren’t tabby by nature?”.
I think you said you’re a follower of the probabilistic programming approach, and in terms of being able to condition those models on counterfactual parameterizations and make predictions, I think they’re very much on the right track.