Thanks for your reference it is good to get down to some more specific examples.
Most AI techniques are model based by necessity: it is not possible to generalise from samples unless the sample is used to inform the shape of a model which then determines the properties of other samples. In effect, AI is model fitting. Bayesian techniques are one scheme for updating a model from data. I call them incomplete because they leave a lot of the intelligence in the hands of the user.
For example, in the thesis reference the author designs a model of transformations on handwritten letters that (thanks to the authors intelligence) is similar to the set of transformations applied to numeric characters. The primary reason why the technique is effective is because the author has constructed a good transformation. The only way to determine if this is true is through experimentation, I doubt the bayesian updating is contributing significantly to the results, if another scheme such as an SVM was chosen I would expect it to produce similar recognition results.
The point is that the legitimacy or otherwise of the model parameter updating scheme is relatively insignificant in comparison to the difficulty in selecting a good model in the first place. As far as I am aware, as there are a potentially infinite set of models, Bayesian techniques cannot be applied to select between them, leaving the real intelligence being provided by the user in the form of the model. In contrast, SVMs are an attempt to construct experimentally useful models from samples and so are much closer to being intelligent in the sense of being able to produce good results with limited human interaction. However, neither technique addresses the fundamental difficulty of replicating the intelligence used by the author in creating the transformation in the first place. Fixating on a particular approach to model updating when model selection is not addressed is to miss the point, it may be meaningful for gambling problems but for real AI challenges the difference it makes appears to be irrelevant to actual performance.
I would love to discuss what the real challenges of GAI are and explore ways of addressing them, but often the posts on LW seem to focus on seemingly obscure game theory or gambling based problems which don’t appear to be bringing us closer to a real solution. If the model selection problem can’t be addressed then there is no way to guarantee that whatever we want an AI to value, it won’t create an internal model that finds something similar (like paperclips) and decides to optimise for that instead.
Silently down voting criticism of Bayesian probability without justification is not helpful either.
Model selection is definitely one of the biggest conceptual problems in GAI right now (I would say that planning once you have a model is of comparable importance / difficulty). I think the way to solve this sort of problem is by having humans carefully pick a really good model (flexible enough to capture even unexpected situations while still structured enough to make useful predictions). Even with SVMs you are implicitly assuming some sort of structure on the data, because you usually transform your inputs into some higher-dimensional space consisting of what you see as useful features in the data.
Even though picking the model is the hard part, using Bayes by default seems like a good idea because it is the only general method I know of for combining all of my assumptions without having to make additional arbitrary choices about how everything should fit together. If there are other methods, I would be interested in learning about them.
What would the “really good model” for a GAI look like? Ideally it should capture our intuitive notions of what sorts of things go on in the world without imposing constraints that we don’t want. Examples of these intuitions: superficially similar objects tend to come from the same generative process (so if A and B are similar in ways X and Y, and C is similar to both A and B in way X, then we would expect C to be similar to A and B in way Y, as well); temporal locality and spatial locality underly many types of causality (so if we are trying to infer an input-output relationship, it should be highly correlated over inputs that are close in space/time); and as a more concrete example, linear momentum tends to persist over short time scales. A lot of work has been done in the past decade on formalizing such intuitions, leading to nonparametric models such as Dirichlet processes and Gaussian processes. See for instance David Blei’s class on Bayesian nonparametrics (http://www.cs.princeton.edu/courses/archive/fall07/cos597C/index.html) or Michael Jordan’s tutorial on Dirichlet processes (http://www.cs.berkeley.edu/~jordan/papers/pearl-festschrift.pdf).
I’m beginning to think that a top-level post on how Bayes is actually used in machine learning would be helpful. Perhaps I will make on when I have a bit more time. Also, does anyone happen to know how to collapse URLs in posts (e.g. the equivalent of test in HTML).
A high level post on its use would be very interesting.
I think my main criticism of the Bayes approach is that it leads to the kind of work you are suggesting i.e. have a person construct a model and then have a machine calculate its parameters.
I think that much of what we value in intelligent people is their ability to form the model themselves. By focusing on parameter updating we aren’t developing the AI techniques necessary for intelligent behavior. In addition, because correct updating does not guarantee good performance (because the model properties dominate) then we will always have to judge methods based on experimental results.
Because we always come back to experimental results, whatever general AI strategy we develop its structure is more likely to be one that searches for new ways to learn (with bayesian model updating and SVMs as examples) and validates these strategies using experimental data (replicating the behaviour of the AI field as a whole).
I find it useful to think about how people solve problems and examine the huge gulf between specific learning techniques and these approaches. For example, to replicate a Bayesian AI researcher an AI needs to take a small amount of data, an incomplete informal model of the process that generates it (e.g. based on informal metaphors of physical processes the author is familiar with) and then find a way of formalising this informal model (so that its behaviour under all conditions can be calculated) and possibly doing some theorem proving to investigate properties of the model. They then apply potentially standard techniques to determine the models parameters and judge its worth based on experiment (potentially repeating the whole process if it doesn’t work).
By focusing on Bayesian approaches we aren’t developing techniques that can replicate these kinds of lateral and creative thinking behaviour. Saying there is only one valid form of inference is absurd because it doesn’t address these problems.
I feel that trying to force our problems to suit our tools is unlikely to make much progress. For example, unless we can model (and therefore largely solve) all of the problems we want an AI to address we can’t create a “Really Good Model”.
Rather than manually developing formalisations of specific forms of similarity we need an algorithm to learn different types of similarity and then construct the formalisation itself (or not as I don’t think we actually formalise our notions of similarity and yet can still solve problems).
Automated theorem proving is a good example where the problems are well defined yet unique, so any algorithm that can construct proofs needs to see meta patterns in other proofs and apply them. This brings home the difficulty of identifying what it means for things to be similar and also emphasises the incompleteness of a probabilistic approach: the proof that the AI is trying to construct has never been encountered before, in order for it to benefit from experience it needs to invent a type of similarity to map the current problem to the past.
But even “learning to learn” is done in the context of a model, it’s just a higher-level model. There are in fact models that allow experience gained in one area to generalize to other areas (by saying that the same sorts of structures that are helpful for explaining things in one area should be considered in that other area). Talking about what an AI researcher would do is asking much more out of an AI than one would ask out of a human. If we could get an AI to even be as intelligent as a 3-year-old child then we would be more or less done. People don’t develop sophisticated problem solving skills until at least high school age, so it seems difficult to believe that such a problem is fundamental to AGI.
Another reference, this time on learning to learn, although unfortunately it is behind a pay barrier (Tenenbaum, Goodman, Kemp, “Learning to learn causal models”).
It appears that there is also a book on more general (mostly non-Bayesian) techniques for learning to learn: Sebastian Thrun’s book. I got the latter just by googling, so I have no idea what’s actually in it, other than by skimming through the chapter descriptions. It’s also not available online.
Is model selection really a big problem? I thought that there was a conceptually simple way to incorporate this into a model (just add a model index parameter), though it might be computationally tricky sometimes. As JohnDavidBustard points out below, the real difficulty seems like model creation. Though I suppose you can frame this as model selection if you have some prior over a broad enough category of models (say all turing machines).
It depends on what you mean by model selection. If you mean e.g. figuring out whether to use quadratics or cubics, then the standard solution that people cite is to use Bayesian Occam’s razor, i.e. compute
Where we compute the probabilities on the right-hand side by marginalizing over all cubics and quadratics. But the number you get out of this will depend strongly on how quickly the tails decay on your distribution over cubics and quadratics, so I don’t find this particularly satisfying. (I’m not alone in this, although there are people who would disagree with me or propose various methods for choosing the prior distributions appropriately.)
If you mean something else, like figuring out what specific model to pick out from your entire space (e.g. picking a specific function to fit your data), then you can run into problems like having to compare probability masses to probability densities, or comparing measures with different dimensionality (e.g. densities on the line versus the plane); a more fundamental issue is that picking a specific model potentially ignores other features of your posterior distribution, like how concentrated the probability mass is about that model.
I would say that the most principled way to get a single model out at the end of the day is variational inference, which basically attempts to set parameters in order to minimize the relative entropy between the distribution implied by the parameters and the actual posterior distribution. I don’t know a whole lot about this area, other than a couple papers I read, but it does seem like a good way to perform inference if you’d like to restrict yourself to considering a single model at a time.
OK, so you’re saying that a big problem in model selection is coming up with good prior distributions for different classes of models, specifically those with different tail decays (it sounds like you think it could also be that the standard bayes framework is missing something). This is an interesting idea which I had heard about before, but didn’t understand till now. Thank you for telling me about it.
I would say that when you have a somewhat dispersed posterior it is simply misleading to pick any specific model+parameters as your fit. The correct thing to do is average over possible models+parameters.
It’s only when you have a relatively narrow posterior or the errors bars on the estimate you give for some parameter or prediction don’t matter that it’s OK to select a single model.
I think I basically agree with you on that; whenever feasible the full posterior (as opposed to the maximum-likelihood model) is what you should be using. So instead of using “Bayesian model selection” to decide whether to pick cubics or quadratics, and then fitting the best cubic or the best quadratic depending on the answer, the “right” thing to do is to just look at the posterior distribution over possible functions f, and use that to get a posterior distribution over f(x) for any given x.
The problem is that this is not always reasonable for the application you have in mind, and I’m not sure if we have good general methods for coming up with the right way to get a good approximation. But certainly an average over the models is what we should be trying to approximate.
Thanks for your reference it is good to get down to some more specific examples.
Most AI techniques are model based by necessity: it is not possible to generalise from samples unless the sample is used to inform the shape of a model which then determines the properties of other samples. In effect, AI is model fitting. Bayesian techniques are one scheme for updating a model from data. I call them incomplete because they leave a lot of the intelligence in the hands of the user.
For example, in the thesis reference the author designs a model of transformations on handwritten letters that (thanks to the authors intelligence) is similar to the set of transformations applied to numeric characters. The primary reason why the technique is effective is because the author has constructed a good transformation. The only way to determine if this is true is through experimentation, I doubt the bayesian updating is contributing significantly to the results, if another scheme such as an SVM was chosen I would expect it to produce similar recognition results.
The point is that the legitimacy or otherwise of the model parameter updating scheme is relatively insignificant in comparison to the difficulty in selecting a good model in the first place. As far as I am aware, as there are a potentially infinite set of models, Bayesian techniques cannot be applied to select between them, leaving the real intelligence being provided by the user in the form of the model. In contrast, SVMs are an attempt to construct experimentally useful models from samples and so are much closer to being intelligent in the sense of being able to produce good results with limited human interaction. However, neither technique addresses the fundamental difficulty of replicating the intelligence used by the author in creating the transformation in the first place. Fixating on a particular approach to model updating when model selection is not addressed is to miss the point, it may be meaningful for gambling problems but for real AI challenges the difference it makes appears to be irrelevant to actual performance.
I would love to discuss what the real challenges of GAI are and explore ways of addressing them, but often the posts on LW seem to focus on seemingly obscure game theory or gambling based problems which don’t appear to be bringing us closer to a real solution. If the model selection problem can’t be addressed then there is no way to guarantee that whatever we want an AI to value, it won’t create an internal model that finds something similar (like paperclips) and decides to optimise for that instead.
Silently down voting criticism of Bayesian probability without justification is not helpful either.
Model selection is definitely one of the biggest conceptual problems in GAI right now (I would say that planning once you have a model is of comparable importance / difficulty). I think the way to solve this sort of problem is by having humans carefully pick a really good model (flexible enough to capture even unexpected situations while still structured enough to make useful predictions). Even with SVMs you are implicitly assuming some sort of structure on the data, because you usually transform your inputs into some higher-dimensional space consisting of what you see as useful features in the data.
Even though picking the model is the hard part, using Bayes by default seems like a good idea because it is the only general method I know of for combining all of my assumptions without having to make additional arbitrary choices about how everything should fit together. If there are other methods, I would be interested in learning about them.
What would the “really good model” for a GAI look like? Ideally it should capture our intuitive notions of what sorts of things go on in the world without imposing constraints that we don’t want. Examples of these intuitions: superficially similar objects tend to come from the same generative process (so if A and B are similar in ways X and Y, and C is similar to both A and B in way X, then we would expect C to be similar to A and B in way Y, as well); temporal locality and spatial locality underly many types of causality (so if we are trying to infer an input-output relationship, it should be highly correlated over inputs that are close in space/time); and as a more concrete example, linear momentum tends to persist over short time scales. A lot of work has been done in the past decade on formalizing such intuitions, leading to nonparametric models such as Dirichlet processes and Gaussian processes. See for instance David Blei’s class on Bayesian nonparametrics (http://www.cs.princeton.edu/courses/archive/fall07/cos597C/index.html) or Michael Jordan’s tutorial on Dirichlet processes (http://www.cs.berkeley.edu/~jordan/papers/pearl-festschrift.pdf).
I’m beginning to think that a top-level post on how Bayes is actually used in machine learning would be helpful. Perhaps I will make on when I have a bit more time. Also, does anyone happen to know how to collapse URLs in posts (e.g. the equivalent of test in HTML).
A high level post on its use would be very interesting.
I think my main criticism of the Bayes approach is that it leads to the kind of work you are suggesting i.e. have a person construct a model and then have a machine calculate its parameters.
I think that much of what we value in intelligent people is their ability to form the model themselves. By focusing on parameter updating we aren’t developing the AI techniques necessary for intelligent behavior. In addition, because correct updating does not guarantee good performance (because the model properties dominate) then we will always have to judge methods based on experimental results.
Because we always come back to experimental results, whatever general AI strategy we develop its structure is more likely to be one that searches for new ways to learn (with bayesian model updating and SVMs as examples) and validates these strategies using experimental data (replicating the behaviour of the AI field as a whole).
I find it useful to think about how people solve problems and examine the huge gulf between specific learning techniques and these approaches. For example, to replicate a Bayesian AI researcher an AI needs to take a small amount of data, an incomplete informal model of the process that generates it (e.g. based on informal metaphors of physical processes the author is familiar with) and then find a way of formalising this informal model (so that its behaviour under all conditions can be calculated) and possibly doing some theorem proving to investigate properties of the model. They then apply potentially standard techniques to determine the models parameters and judge its worth based on experiment (potentially repeating the whole process if it doesn’t work).
By focusing on Bayesian approaches we aren’t developing techniques that can replicate these kinds of lateral and creative thinking behaviour. Saying there is only one valid form of inference is absurd because it doesn’t address these problems.
I feel that trying to force our problems to suit our tools is unlikely to make much progress. For example, unless we can model (and therefore largely solve) all of the problems we want an AI to address we can’t create a “Really Good Model”.
Rather than manually developing formalisations of specific forms of similarity we need an algorithm to learn different types of similarity and then construct the formalisation itself (or not as I don’t think we actually formalise our notions of similarity and yet can still solve problems).
Automated theorem proving is a good example where the problems are well defined yet unique, so any algorithm that can construct proofs needs to see meta patterns in other proofs and apply them. This brings home the difficulty of identifying what it means for things to be similar and also emphasises the incompleteness of a probabilistic approach: the proof that the AI is trying to construct has never been encountered before, in order for it to benefit from experience it needs to invent a type of similarity to map the current problem to the past.
But even “learning to learn” is done in the context of a model, it’s just a higher-level model. There are in fact models that allow experience gained in one area to generalize to other areas (by saying that the same sorts of structures that are helpful for explaining things in one area should be considered in that other area). Talking about what an AI researcher would do is asking much more out of an AI than one would ask out of a human. If we could get an AI to even be as intelligent as a 3-year-old child then we would be more or less done. People don’t develop sophisticated problem solving skills until at least high school age, so it seems difficult to believe that such a problem is fundamental to AGI.
Another reference, this time on learning to learn, although unfortunately it is behind a pay barrier (Tenenbaum, Goodman, Kemp, “Learning to learn causal models”).
It appears that there is also a book on more general (mostly non-Bayesian) techniques for learning to learn: Sebastian Thrun’s book. I got the latter just by googling, so I have no idea what’s actually in it, other than by skimming through the chapter descriptions. It’s also not available online.
Click the “Help” link that appears to the right of the “comment” and “Cancel” buttons for directions.
Is model selection really a big problem? I thought that there was a conceptually simple way to incorporate this into a model (just add a model index parameter), though it might be computationally tricky sometimes. As JohnDavidBustard points out below, the real difficulty seems like model creation. Though I suppose you can frame this as model selection if you have some prior over a broad enough category of models (say all turing machines).
It depends on what you mean by model selection. If you mean e.g. figuring out whether to use quadratics or cubics, then the standard solution that people cite is to use Bayesian Occam’s razor, i.e. compute
p(Cubic | Data)/p(Quadratic | Data) = p(Data | Cubic)/p(Data | Quadratic) * p(Cubic)/p(Quadratic)
Where we compute the probabilities on the right-hand side by marginalizing over all cubics and quadratics. But the number you get out of this will depend strongly on how quickly the tails decay on your distribution over cubics and quadratics, so I don’t find this particularly satisfying. (I’m not alone in this, although there are people who would disagree with me or propose various methods for choosing the prior distributions appropriately.)
If you mean something else, like figuring out what specific model to pick out from your entire space (e.g. picking a specific function to fit your data), then you can run into problems like having to compare probability masses to probability densities, or comparing measures with different dimensionality (e.g. densities on the line versus the plane); a more fundamental issue is that picking a specific model potentially ignores other features of your posterior distribution, like how concentrated the probability mass is about that model.
I would say that the most principled way to get a single model out at the end of the day is variational inference, which basically attempts to set parameters in order to minimize the relative entropy between the distribution implied by the parameters and the actual posterior distribution. I don’t know a whole lot about this area, other than a couple papers I read, but it does seem like a good way to perform inference if you’d like to restrict yourself to considering a single model at a time.
OK, so you’re saying that a big problem in model selection is coming up with good prior distributions for different classes of models, specifically those with different tail decays (it sounds like you think it could also be that the standard bayes framework is missing something). This is an interesting idea which I had heard about before, but didn’t understand till now. Thank you for telling me about it.
I would say that when you have a somewhat dispersed posterior it is simply misleading to pick any specific model+parameters as your fit. The correct thing to do is average over possible models+parameters.
It’s only when you have a relatively narrow posterior or the errors bars on the estimate you give for some parameter or prediction don’t matter that it’s OK to select a single model.
I think I basically agree with you on that; whenever feasible the full posterior (as opposed to the maximum-likelihood model) is what you should be using. So instead of using “Bayesian model selection” to decide whether to pick cubics or quadratics, and then fitting the best cubic or the best quadratic depending on the answer, the “right” thing to do is to just look at the posterior distribution over possible functions f, and use that to get a posterior distribution over f(x) for any given x.
The problem is that this is not always reasonable for the application you have in mind, and I’m not sure if we have good general methods for coming up with the right way to get a good approximation. But certainly an average over the models is what we should be trying to approximate.
Text goes here
Click the “Help” link that’s to the right of the “Comment” and Cancel buttons for more details.