This is mostly a gestalt sense from years of interacting with people in the space, so unrolling the full belief-production process into something legible would be a lot of work. But I can try a few sub-queries and give some initial answers.
Zeroth query: let’s try to query my intuition and articulate a little more clearly the kind of models which I think the median ML researcher doesn’t have. I think the core thing here is gears. Like, here’s a simple (not necessarily correct/incorrect) mental model of training of some random net:
We’re doing high dimensional optimization via gradient descent. The high dimensionality will typically make globally-suboptimal local minima rare, but high condition numbers quite common, so the main failure mode of the training process (other than fundamental limitations of the data or architecture) will be very slow convergence to minima along the bottom of long, thin “valleys” in the loss landscape.
That mental model immediately exposes a lot of gears. If that’s my mental model, and my training process is failing somehow, then I can go test that hypothesis via e.g. estimating the local condition number of the Hessian (this can be done in linear time, unlike calculation of the full Hessian), or by trying a type of optimizer suited to poor condition numbers (maybe conjugate gradient), or by looking for a “back-and-forth” pattern in the update steps; the model predicts that all those measurements will have highly correlated results. And if I do such measurements in a few contexts and find that the condition number is generally reasonable, or that it’s uncorrelated with how well training is going, then that would in-turn update a bunch of related things, like e.g. which aspects of the NTK model are likely to hold, or how directions in weight-space which embed human-intelligible concepts are likely to correspond to loss basin geometry. So we’ve got a mental model which involves lots of different measurements and related phenomena being tightly coupled epistemically. It makes a bunch of predictions about different things going on inside the black box of the training process and the network itself. That’s gearsiness.
(In analogy to the “dark room” example from the OP: for the person who “models the room as containing walls”, there’s tight coupling between a whole bunch of predictions involving running into something along a particular line where they expect a wall to be. If they reach toward a spot where they expect a wall, and feel nothing, then that’s a big update; maybe the wall ended! That, in turn, updates a bunch of other predictions about where the person will/won’t run into things. The model projects a bunch of internal structure into the literal black box of the room. That’s gearsiness. Contrast to the person who doesn’t model the room as containing walls: they don’t make a bunch of tightly coupled predictions, so they don’t update a bunch of related things when they hit a surprise.)
Now contrast the “high condition numbers” mental model to another (not necessarily correct/incorrect) mental model:
We’re doing optimization via gradient descent, so the main failure mode of the training process (other than fundamental limitations of the data or architecture) will be getting stuck in local minima which are not global minima (or close to them in performance).
This mental model exposes fewer gears. It allows basically one way to test the hypothesis: randomize to a new start location many times (or otherwise jump to a random location many times, as in e.g. simulated annealing), and see if training goes better. Based on this mental model in isolation, I don’t have a bunch of qualitatively different tests to run which I expect to yield highly correlated results. I don’t have a bunch of related things which update based on how the tests turn out. I don’t have predictions about what’s going on inside the magic box—there’s nothing analogous to e.g. “check the condition number of the Hessian”. So not much in terms of gears. (This “local minima” model could still be a component of a larger model with more gears in it, but few of those gears are in this model itself.)
So that’s the sort of thing I’m gesturing at. Again, note that it’s not about whether the model is true or false. It’s also not about how mathematically principled/justified the model is, though that does tend to correlate with gearsiness in practice.
Ok, on to the main question. First query: what are the general types of observations which served as input to my belief? Also maybe some concrete examples...
Taking ML courses back in the day, as well as occasionally looking at syllabi for more recent courses, gives some idea of both what noobs are learning and what more experienced people are trying to teach.
Let’s take this Udacity course as a prototypical example (it was the first one I saw; not super up-to-date or particularly advanced but I expect the points I make about it to generalize). Looks like it walks students through implementing and training some standard net types; pretty typical for a course IIUC. The closest thing to that course which I expect would install the sorts of models the OP talks about would be a project-based course in which students have to make up a new architecture, or a new training method, or some such, and figure out things like e.g. normalization, preprocessing, performance bottlenecks, how to distinguish different failure modes, etc—and the course would provide the background mental models people use for such things. That’s pretty centrally the kind of skill behind the example of Bengio et al’s paper from the OP, and it’s not something I’d expect someone to get from Udacity’s course based on the syllabus.
Reading papers and especially blog posts from people working in ML gives some sense of what mental models are common.
For instance, both the “local minima” and “high condition numbers” examples above are mental models which at least some people use.
Talking to people working on ML projects, e.g. in the lightcone office or during the MATS program.
Again, I often see peoples’ mental models in conversation.
Looking at peoples’ resumes/CVs.
By looking at peoples’ background, I can often rule out some common mental models—e.g. someone who doesn’t have much-if-any linear algebra background probably won’t understand-well-enough-to-measure gradient explosion, poorly conditioned basins in the loss landscape, low-rank components in a net, NTK-style approximations, etc (not that all of those are necessarily correct models). That doesn’t mean the person doesn’t have any gearsy mental models of nets, but it sure does rule a lot out and the remainder are much more limited.
Second query: any patterns which occasionally come up and can be especially revealing when they do?
If something unexpected happens, does the person typically have a hypothesis for why it happened, with some gears in it? Do they have the kind of hypotheses with sub-claims which can be tested by looking at internals of a system? If some of a model’s hypotheses turn out to be wrong, does that induce confusion about a bunch of other stuff?
Does the person’s model only engage with salient externally-visible knobs/features? A gearsy model typically points to specific internal structures as interesting things to examine (e.g. the Hessian condition number in the example earlier), which are not readily visible “externally”. If a model’s ontology only involves externally-visible behavior, then that usually means that it lacks gears.
Does the model sound like shallow babble? “Only engaging with salient externally-visible knobs/features” is one articulable sign of shallow babble, but probably my/your intuitions pick up on lots of other signs that we don’t yet know how to articulate.
These are all very much the kinds of patterns which come up in conversation and papers/blog posts.
Ok, that’s all the answer I have time for now. Not really a full answer to the question, but hopefully it gave some sense of where the intuition comes from.
This is mostly a gestalt sense from years of interacting with people in the space, so unrolling the full belief-production process into something legible would be a lot of work. But I can try a few sub-queries and give some initial answers.
Zeroth query: let’s try to query my intuition and articulate a little more clearly the kind of models which I think the median ML researcher doesn’t have. I think the core thing here is gears. Like, here’s a simple (not necessarily correct/incorrect) mental model of training of some random net:
That mental model immediately exposes a lot of gears. If that’s my mental model, and my training process is failing somehow, then I can go test that hypothesis via e.g. estimating the local condition number of the Hessian (this can be done in linear time, unlike calculation of the full Hessian), or by trying a type of optimizer suited to poor condition numbers (maybe conjugate gradient), or by looking for a “back-and-forth” pattern in the update steps; the model predicts that all those measurements will have highly correlated results. And if I do such measurements in a few contexts and find that the condition number is generally reasonable, or that it’s uncorrelated with how well training is going, then that would in-turn update a bunch of related things, like e.g. which aspects of the NTK model are likely to hold, or how directions in weight-space which embed human-intelligible concepts are likely to correspond to loss basin geometry. So we’ve got a mental model which involves lots of different measurements and related phenomena being tightly coupled epistemically. It makes a bunch of predictions about different things going on inside the black box of the training process and the network itself. That’s gearsiness.
(In analogy to the “dark room” example from the OP: for the person who “models the room as containing walls”, there’s tight coupling between a whole bunch of predictions involving running into something along a particular line where they expect a wall to be. If they reach toward a spot where they expect a wall, and feel nothing, then that’s a big update; maybe the wall ended! That, in turn, updates a bunch of other predictions about where the person will/won’t run into things. The model projects a bunch of internal structure into the literal black box of the room. That’s gearsiness. Contrast to the person who doesn’t model the room as containing walls: they don’t make a bunch of tightly coupled predictions, so they don’t update a bunch of related things when they hit a surprise.)
Now contrast the “high condition numbers” mental model to another (not necessarily correct/incorrect) mental model:
This mental model exposes fewer gears. It allows basically one way to test the hypothesis: randomize to a new start location many times (or otherwise jump to a random location many times, as in e.g. simulated annealing), and see if training goes better. Based on this mental model in isolation, I don’t have a bunch of qualitatively different tests to run which I expect to yield highly correlated results. I don’t have a bunch of related things which update based on how the tests turn out. I don’t have predictions about what’s going on inside the magic box—there’s nothing analogous to e.g. “check the condition number of the Hessian”. So not much in terms of gears. (This “local minima” model could still be a component of a larger model with more gears in it, but few of those gears are in this model itself.)
So that’s the sort of thing I’m gesturing at. Again, note that it’s not about whether the model is true or false. It’s also not about how mathematically principled/justified the model is, though that does tend to correlate with gearsiness in practice.
Ok, on to the main question. First query: what are the general types of observations which served as input to my belief? Also maybe some concrete examples...
Taking ML courses back in the day, as well as occasionally looking at syllabi for more recent courses, gives some idea of both what noobs are learning and what more experienced people are trying to teach.
Let’s take this Udacity course as a prototypical example (it was the first one I saw; not super up-to-date or particularly advanced but I expect the points I make about it to generalize). Looks like it walks students through implementing and training some standard net types; pretty typical for a course IIUC. The closest thing to that course which I expect would install the sorts of models the OP talks about would be a project-based course in which students have to make up a new architecture, or a new training method, or some such, and figure out things like e.g. normalization, preprocessing, performance bottlenecks, how to distinguish different failure modes, etc—and the course would provide the background mental models people use for such things. That’s pretty centrally the kind of skill behind the example of Bengio et al’s paper from the OP, and it’s not something I’d expect someone to get from Udacity’s course based on the syllabus.
Reading papers and especially blog posts from people working in ML gives some sense of what mental models are common.
For instance, both the “local minima” and “high condition numbers” examples above are mental models which at least some people use.
Talking to people working on ML projects, e.g. in the lightcone office or during the MATS program.
Again, I often see peoples’ mental models in conversation.
Looking at peoples’ resumes/CVs.
By looking at peoples’ background, I can often rule out some common mental models—e.g. someone who doesn’t have much-if-any linear algebra background probably won’t understand-well-enough-to-measure gradient explosion, poorly conditioned basins in the loss landscape, low-rank components in a net, NTK-style approximations, etc (not that all of those are necessarily correct models). That doesn’t mean the person doesn’t have any gearsy mental models of nets, but it sure does rule a lot out and the remainder are much more limited.
Second query: any patterns which occasionally come up and can be especially revealing when they do?
If something unexpected happens, does the person typically have a hypothesis for why it happened, with some gears in it? Do they have the kind of hypotheses with sub-claims which can be tested by looking at internals of a system? If some of a model’s hypotheses turn out to be wrong, does that induce confusion about a bunch of other stuff?
Does the person’s model only engage with salient externally-visible knobs/features? A gearsy model typically points to specific internal structures as interesting things to examine (e.g. the Hessian condition number in the example earlier), which are not readily visible “externally”. If a model’s ontology only involves externally-visible behavior, then that usually means that it lacks gears.
Does the model sound like shallow babble? “Only engaging with salient externally-visible knobs/features” is one articulable sign of shallow babble, but probably my/your intuitions pick up on lots of other signs that we don’t yet know how to articulate.
These are all very much the kinds of patterns which come up in conversation and papers/blog posts.
Ok, that’s all the answer I have time for now. Not really a full answer to the question, but hopefully it gave some sense of where the intuition comes from.