Gears’ claim (IIUC) that ~every non-stupid AI researcher who was paying much attention knew in advance that Go was going to fall in ~2015. (For some value of “non-stupid” that’s a pretty large number of people, rather than just “me and two of my friends and maybe David Silver” or whatever.)
Specifically the ones *working on or keeping up with* go could *see it coming* enough to *make solid research bets* about what would do it. If they had read up on go, their predictive distribution over next things to try contained the thing that would work well enough to be worth scaling seriously if you wanted to build the thing that worked. What I did was, as someone not able to implement it myself at the time, read enough of the go research and general pattern of neural network successes to have a solid hunch about what it looks like to approximate a planning trajectory with a neural network. It looked very much like the people actually doing the work at facebook were on the same track. What was surprising was mostly that google funded scaling it so early, which relied on them having found an algorithm that scaled well sooner than I expected, by a bit. Also, I lost a bet about how strong it would be; after updating on the matches from when it was initially announced, I thought it would win some but lose overall, instead it won outright.
Gears’ claim that ML has been super predictable and that Gears has predicted it all so far (maybe I don’t understand what Gears is saying and they mean something weaker than what I’m hearing?).
I have hardly predicted all ml, but I’ve predicted the overall manifold of which clusters of techniques would work well and have high success at what scales and what times. Until you challenged me to do it on manifold, I’d been intentionally keeping off the record about this except when trying to explain my intuitive/pretheoretic understanding of the general manifold of ML hunchspace, which I continue to claim is not that hard to do if you keep up with abstracts and let yourself assume it’s possible to form a reasonable manifold of what abstracts refine the possibility manifold. Sorry to make strong unfalsifiable claims, I’m used to it. But I think you’ll hear something similar—if phrased a bit less dubiously—from deep learning researchers experienced at picking which papers to work on in the pretheoretic regime. Approximately, it’s obvious to everyone who’s paying attention to a particular subset what’s next in that subset, but it’s not necessarily obvious how much compute it’ll take, whether you’ll be able to find hyperparameters that work, if your version of the idea is subtly corrupt, or whether you’ll be interrupted in the middle of thinking about it because boss wants a new vision model for ad ranking.
Gears’ level of confidence in predicting imminent AGI. (Seems possible, but not my current guess.)
Sure, I’ve been the most research-trajectory optimistic person in any deep learning room for a long time, and I often wonder if that’s because I’m arrogant enough to predict other people’s research instead of getting my year-scale optimism burnt out by the pain of the slog of hyperparameter searching ones own ideas, so I’ve been more calibrated about what other people’s clusters can do (and even less calibrated about my own). As a capabilities researcher, you keep getting scooped by someone else who has a bigger hyperparam search cluster! As a capabilities researcher, you keep being right about the algorithms’ overall structure, but now you can’t prove you knew it ahead of time in any detail! More effective capabilities researchers have this problem less, I’m certainly not one. Also, you can easily exceed my map quality by reading enough to train your intuitions about the manifold of what works—just drastically decrease your confidence in *everything* you’ve known since 2011 about what’s hard and easy on tiny computers, treat it as a palette of inspiration for what you can build now that computers are big. Roleplay as a 2015 capabilities researcher and try to use your map of the manifold of what algorithms work to predict whether each abstract will contain a paper that lives up to its claims. Just browse the arxiv, don’t look at the most popular papers, those have been filtered by what actually worked well.
Btw, call me gta or tgta or something. I’m not gears, I’m pre-theoretic map of or reference to them or something. ;)
Also, I should mention—Jessicata, jack gallagher, and poossssibly tsvi bt can tell you some of what I told them circa 2016-2017 about neural networks’ trajectory. I don’t know if they ever believed me until each thing was confirmed, and I don’t know which things they’d remember or exactly which things were confirmed as stated, but I definitely remember arguing in person in the MIRI office on addison, in the backest back room with beanbags and a whiteboard and if I remember correctly a dripping ceiling (though that’s plausibly just memory decay confusing references), that neural networks are a form of program inference that works with arbitrary complicated nonlinear programs given an acceptable network interference pattern prior, just a shitty one that needs a big network to have enough hypotheses to get it done (stated with the benefit of hindsight; it was a much lower quality claim at the time). I feel like that’s been pretty thoroughly demonstrated now, though pseudo-second-order gradient descent (ADAM and friends) still has weird biases that make its results less reliable than the proper version of itself. It’s so damn efficient, though, that you’d need a huge real-wattage power benefit to use something that was less informationally efficient relative to its vm.
Specifically the ones *working on or keeping up with* go could *see it coming* enough to *make solid research bets* about what would do it. If they had read up on go, their predictive distribution over next things to try contained the thing that would work well enough to be worth scaling seriously if you wanted to build the thing that worked. What I did was, as someone not able to implement it myself at the time, read enough of the go research and general pattern of neural network successes to have a solid hunch about what it looks like to approximate a planning trajectory with a neural network. It looked very much like the people actually doing the work at facebook were on the same track. What was surprising was mostly that google funded scaling it so early, which relied on them having found an algorithm that scaled well sooner than I expected, by a bit. Also, I lost a bet about how strong it would be; after updating on the matches from when it was initially announced, I thought it would win some but lose overall, instead it won outright.
I have hardly predicted all ml, but I’ve predicted the overall manifold of which clusters of techniques would work well and have high success at what scales and what times. Until you challenged me to do it on manifold, I’d been intentionally keeping off the record about this except when trying to explain my intuitive/pretheoretic understanding of the general manifold of ML hunchspace, which I continue to claim is not that hard to do if you keep up with abstracts and let yourself assume it’s possible to form a reasonable manifold of what abstracts refine the possibility manifold. Sorry to make strong unfalsifiable claims, I’m used to it. But I think you’ll hear something similar—if phrased a bit less dubiously—from deep learning researchers experienced at picking which papers to work on in the pretheoretic regime. Approximately, it’s obvious to everyone who’s paying attention to a particular subset what’s next in that subset, but it’s not necessarily obvious how much compute it’ll take, whether you’ll be able to find hyperparameters that work, if your version of the idea is subtly corrupt, or whether you’ll be interrupted in the middle of thinking about it because boss wants a new vision model for ad ranking.
Sure, I’ve been the most research-trajectory optimistic person in any deep learning room for a long time, and I often wonder if that’s because I’m arrogant enough to predict other people’s research instead of getting my year-scale optimism burnt out by the pain of the slog of hyperparameter searching ones own ideas, so I’ve been more calibrated about what other people’s clusters can do (and even less calibrated about my own). As a capabilities researcher, you keep getting scooped by someone else who has a bigger hyperparam search cluster! As a capabilities researcher, you keep being right about the algorithms’ overall structure, but now you can’t prove you knew it ahead of time in any detail! More effective capabilities researchers have this problem less, I’m certainly not one. Also, you can easily exceed my map quality by reading enough to train your intuitions about the manifold of what works—just drastically decrease your confidence in *everything* you’ve known since 2011 about what’s hard and easy on tiny computers, treat it as a palette of inspiration for what you can build now that computers are big. Roleplay as a 2015 capabilities researcher and try to use your map of the manifold of what algorithms work to predict whether each abstract will contain a paper that lives up to its claims. Just browse the arxiv, don’t look at the most popular papers, those have been filtered by what actually worked well.
Btw, call me gta or tgta or something. I’m not gears, I’m pre-theoretic map of or reference to them or something. ;)
Also, I should mention—Jessicata, jack gallagher, and poossssibly tsvi bt can tell you some of what I told them circa 2016-2017 about neural networks’ trajectory. I don’t know if they ever believed me until each thing was confirmed, and I don’t know which things they’d remember or exactly which things were confirmed as stated, but I definitely remember arguing in person in the MIRI office on addison, in the backest back room with beanbags and a whiteboard and if I remember correctly a dripping ceiling (though that’s plausibly just memory decay confusing references), that neural networks are a form of program inference that works with arbitrary complicated nonlinear programs given an acceptable network interference pattern prior, just a shitty one that needs a big network to have enough hypotheses to get it done (stated with the benefit of hindsight; it was a much lower quality claim at the time). I feel like that’s been pretty thoroughly demonstrated now, though pseudo-second-order gradient descent (ADAM and friends) still has weird biases that make its results less reliable than the proper version of itself. It’s so damn efficient, though, that you’d need a huge real-wattage power benefit to use something that was less informationally efficient relative to its vm.