From what I recall, BioAnchors isn’t quite a simple model which postdicts the past, and thus isn’t really bayesian, regardless of how detailed/explicit it is in probability calculations. None of its main submodel ‘anchors’ well explain/postdict the progress of DL, the ‘horizon length’ concept seems ill formed, and it overfocuses on predicting what I consider idiosyncratic specific microtrends (transformer LLM scaling, linear software speedups).
The model here can be considered an upgrade of Moravec’s model, which has the advantage that its predecessor already vaguely predicted the current success of DL, many decades in advance.
But there are several improvements here:
the use of cumulative optimization power (net training compute) rather than inference compute
the bit capacity sub-model ( I didn’t realize that successful BNNs and DNNs follow the same general rule in terms of model capacity vs dataset capacity. That was a significant surprise/update as I gathered the data. I think Schmidhuber made an argument once that AGI should have enough bit capacity to remember it’s history, but I didn’t expect that to be so true)
I personally find the “end of moore’s law bounding brain compute and thus implying near future brain parity” sub-argument compelling
That’s a nice critique of Bio Anchors, I encourage you to write it up! Here’s my off-the-cuff reaction to it, I apologize in advance for any confusions:
1. I don’t yet agree that BioAnchors fails to postdict the past whilst your model does. I think both do about equally well at retrodicting the progress of DL. Also, I don’t see why it’s a problem that it predicts microtrends whilst yours doesn’t. 2. I too have beef with the concept of horizon length, but I have enough respect for it that I’d like to see you write out your argument for why it’s ill formed. 3. Bio Anchors also considers itself an upgrade to Moravec & shares in its glory. At least, it would if it didn’t have such conservative values for various parameters (e.g. TAI FLOP/s, number of medium-horizon or long-horizon data points required, rate of algorithmic progress) and thus end up with a substantially different estimate than Moravec. If instead you put in the parameters that I think you should put in, you get TAI in the 2020′s. 4. Bio Anchors focuses on net training compute too. I’m confused. Oh… are you saying that it draws the comparison to biology at inference flop<>synapse-firings (and/or parameter count <> synapse count) whereas you make the comparison at training compute <> synapse-firings-per-lifetime? Yeah I can see how that might be a point in your favor, though I could also see it going the other way. I think I weakly agree with you here. 5. I agree that measuring data in bits instead of training cycles / datapoints is novel and interesting & I’m glad you did it. I’m not yet convinced that it’s superior. 6. I am not yet impressed by the moores law ending means computers are finally as efficient as the brain in some ways therefore AGI is nigh argument. I’d be interested to see it spelled out more. I suspect that it is a weak argument because I don’t think that the people with longer timelines have longer timelines because they think that the brain is more efficient in those ways than computers. Instead they probably think the architecture/algorithms are better.
I don’t yet agree that BioAnchors fails to postdict the past whilst your model does. I think both do about equally well at retrodicting the progress of DL. Also, I don’t see why it’s a problem that it predicts microtrends whilst yours doesn’t.
It’s been a while since I loaded that beast of a report, but according to the main model, it estimates brain inference compute and params (1e15), then applies some wierd LLM scaling function to derive training flops from that, scaled by some arbitrary constant H—horizon length—to make the number much bigger, resulting in 10^30, 10^33, and 10^36 FLOPs (from scott’s summary).
Apply that exact same model to the milestones I’ve listed .. it is wildly off base. It postdicts we would be nowhere near having anything close to human, let alone primate level vision, etc.
I too have beef with the concept of horizon length,
It’s a new term invented in the report and not a concept in ML? If this was a thing, it would be a thing discussed in ML papers already.
I agree that measuring data in bits instead of training cycles / datapoints is novel and interesting & I’m glad you did it. I’m not yet convinced that it’s superior.
Would you have predicted in advance that successful BNNs/ANNs generally follow the same rule of having model bit capacity on order dataset bit capacity? Really? It’s a shockingly good fit, it invalidates a bunch of conventional wisdom about DL supposed data inefficiency, and it even makes specific interesting postdictions/predictions like the MNIST MLP has excess capacity and could easily be pruned (which is true! MNIST MLPs are notoriously wierd in how easily prunable they are). This also invalidates much of the chinchilla takeaway—GPT3 was a bit overparameterized perhaps, but that is all.
I am not yet impressed by the moores law ending means computers are finally as efficient as the brain in some ways therefore AGI is nigh argument. I’d be interested to see it spelled out more. I suspect that it is a weak argument because I don’t think that the people with longer timelines have longer timelines because they think that the brain is more efficient in those ways than computers. Instead they probably think the architecture/algorithms are better.
There’s a whole contingent who believe the brain has many OOM more compute than current GPUs, and this explains the lack of AGI. The idea that we are actually near the end due to physical limits invalidates that, and then the remaining uncertainty around AGI is software gap as you say, which I mostly address in the article (human brain exceptionalism).
But to be clear, I do believe there is a software gap, and that is part of the primary explanation for why we don’t have AGI yet through some super project—it’s just more of a gap between current DL systems and brain algos in general, rather than a specific human brain gap. It’s also closing.
Re 1: The scaling function isn’t weird, and the horizon length constant isn’t arbitrary. But I think I see what you are saying now. Something like “We currently have stuff about as impressive as a raven/bee/etc. but if you were to predict when we’d get that using Bio Anchors, you’d predict 2030 or something like that, because you’d be using medium-horizon training and 1 data point per parameter and 10x as many parameters as ravens/bees/etc. have synapses...” What if I don’t agree that we currently have stuff about as impressive as a raven/bee/etc.? You mention primate level vision, that does seem like a good argument to me because it’s hard to argue that we don’t have good vision these days. But I’d like to see the math worked out. I think you should write a whole post (doesn’t need to be wrong) on just this point, because if you are right here I think it’s pretty strong evidence for shorter timelines & will be convincing to many people.
Re 2: “if it was a thing it would be in ML papers already” hahaha… I don’t take offense, and I hope you don’t take offense either, but suffice it to say this appeal to authority has no weight with me & I’d appreciate an object-level argument.
Re 5: No no I agree with you here, that’s why I said it was novel & interesting. (Well, I don’t yet agree that the fit is shockingly good. I’d want to think about it more & see a graph & spot check the calculations, and compare the result to the graphs Ajeya cites in support of the horizon length hypothesis.)
Re 6: Ah, OK. Makes sense. I’ll stop trying to defend people who I think are wrong and let them step up and defend themselves. On this point at least.
You mention primate level vision, that does seem like a good argument to me because it’s hard to argue that we don’t have good vision these days. But I’d like to see the math worked out. I think you should write a whole post (doesn’t need to be wrong) on just this point, because if you are right here I think it’s pretty strong evidence for shorter timelines & will be convincing to many people.
I’ve updated the article to include a concise summary of a subset of the evidence for parity between modern vision ANNs and primate visual cortex, and then modern LLMs and linguistic cortex. I’ll probably also summarize the discussion of Cat vs VPT, but I do think that VPT > Cat, in terms of actual AGI-relevant skills, even though the Cat brain would still be a better arch for AGI. We haven’t really tried as hard at the sim Cat task (unless you count driverless cars, but I’d guess those may require raven-like intelligence, and robotics lags due to much harder inference performance constraints) . That’s all very compatible with the general thesis that we get hardware parity first, then software catches up a bit later. (At this point I would not be surprised if we have AGI before driverless cars are common)
Re 1: Yeah so maybe I need to put more of the comparisons in an appendix or something, I’m just assuming background knowledge here that others may not have. Biological vision has been pretty extensively studied and is fairly well understood. We’ve had detailed functional computational models that can predict activations in IT since 2016 ish—they are DL models. I discussed some of that in my previous brain efficiency post here. More recently that same approach was used to model linguistic cortex using LLMs and was just as, or more effective, discussed a bit in my simbox post here. So I may just be assuming common background knowledge that BNNs and ANNs converge to learn similar or even equivalent circuits given similar training data.
I guess I just assume as background that readers know:
we have superhuman vision, and not by using very different techniques, but by using functionally equivalent to brain techniques, and P explains performance
that vision is typically 10% of the compute of most brains, and since cortex is uniform this implies that language, motor control, navigation, etc are all similar and can be solved using similar techniques ( I did predict this in 2015). Transformer LLMs fulfilled this for language recently.
Comparisons to full brains is more complex because there is—to first approximation—little funding for serious foundation DL projects trying to replicate cat brains, for example. We only have things like VPT—which I try to compare to cats in another comment thread. But basically I do not think cats are intelligent in the way ravens/primates are. Ex: my cat doesn’t really understand what it’s doing when it digs to cover pee in its litter box. It just sort of blindly follows an algorithm (after smell urine/poo, dig vaguely in several random related directions).
One issue is there’s a mystery bias—chess was once considered a test of intelligence, etc.
Re: 2. By saying “if horizon length was a thing, it would be a thing in ML papers”, I mean we would be seeing the effect, it would be something discussed and modeled in scaling law analysis, etc. So BioAnchors has to explain—and provide pretty enormous evidence at this point—that horizon length is a thing already in DL, a thing that helps explain/predict training, etc.
The steelman version of ‘horizon length’ is perhaps some abstraction of 1.) reward sparsity in RL, but there’s nothing fundemental about that and neither BNNs or advanced ANNs are limited by that, because they use denser self-supervised signals. or 2.) meta-learning, but if the model is use of meta learning causes H > 1, then it just predicts teams don’t use meta-learning (which is mostly correct): optional use of meta-learning can only speed up progress, and may be used internally in brains or large ANNs anyway, in which case the H model doesn’t really apply.
Re: 5. I don’t see the direct connection between dataset capacity vs model capacity and ‘horizon length hypothesis’?
I’ve seen a few of those before, and it’s hard to evaluate cognition from a quick glance. I doubt Billi really uses/understands even that vocab, but it’s hard to say. My cat clearly understands perhaps a dozen words/phrases, but it’s hard to differentiate that from ‘only cares about a dozen words/phrases’.
The thing is, if you had a VPT like minecraft agent with similar vocab/communication skills, few would care or find it impressive.
Thanks, I like that summary.
From what I recall, BioAnchors isn’t quite a simple model which postdicts the past, and thus isn’t really bayesian, regardless of how detailed/explicit it is in probability calculations. None of its main submodel ‘anchors’ well explain/postdict the progress of DL, the ‘horizon length’ concept seems ill formed, and it overfocuses on predicting what I consider idiosyncratic specific microtrends (transformer LLM scaling, linear software speedups).
The model here can be considered an upgrade of Moravec’s model, which has the advantage that its predecessor already vaguely predicted the current success of DL, many decades in advance.
But there are several improvements here:
the use of cumulative optimization power (net training compute) rather than inference compute
the bit capacity sub-model ( I didn’t realize that successful BNNs and DNNs follow the same general rule in terms of model capacity vs dataset capacity. That was a significant surprise/update as I gathered the data. I think Schmidhuber made an argument once that AGI should have enough bit capacity to remember it’s history, but I didn’t expect that to be so true)
I personally find the “end of moore’s law bounding brain compute and thus implying near future brain parity” sub-argument compelling
That’s a nice critique of Bio Anchors, I encourage you to write it up! Here’s my off-the-cuff reaction to it, I apologize in advance for any confusions:
1. I don’t yet agree that BioAnchors fails to postdict the past whilst your model does. I think both do about equally well at retrodicting the progress of DL. Also, I don’t see why it’s a problem that it predicts microtrends whilst yours doesn’t.
2. I too have beef with the concept of horizon length, but I have enough respect for it that I’d like to see you write out your argument for why it’s ill formed.
3. Bio Anchors also considers itself an upgrade to Moravec & shares in its glory. At least, it would if it didn’t have such conservative values for various parameters (e.g. TAI FLOP/s, number of medium-horizon or long-horizon data points required, rate of algorithmic progress) and thus end up with a substantially different estimate than Moravec. If instead you put in the parameters that I think you should put in, you get TAI in the 2020′s.
4. Bio Anchors focuses on net training compute too. I’m confused. Oh… are you saying that it draws the comparison to biology at inference flop<>synapse-firings (and/or parameter count <> synapse count) whereas you make the comparison at training compute <> synapse-firings-per-lifetime? Yeah I can see how that might be a point in your favor, though I could also see it going the other way. I think I weakly agree with you here.
5. I agree that measuring data in bits instead of training cycles / datapoints is novel and interesting & I’m glad you did it. I’m not yet convinced that it’s superior.
6. I am not yet impressed by the moores law ending means computers are finally as efficient as the brain in some ways therefore AGI is nigh argument. I’d be interested to see it spelled out more. I suspect that it is a weak argument because I don’t think that the people with longer timelines have longer timelines because they think that the brain is more efficient in those ways than computers. Instead they probably think the architecture/algorithms are better.
It’s been a while since I loaded that beast of a report, but according to the main model, it estimates brain inference compute and params (1e15), then applies some wierd LLM scaling function to derive training flops from that, scaled by some arbitrary constant H—horizon length—to make the number much bigger, resulting in 10^30, 10^33, and 10^36 FLOPs (from scott’s summary).
Apply that exact same model to the milestones I’ve listed .. it is wildly off base. It postdicts we would be nowhere near having anything close to human, let alone primate level vision, etc.
It’s a new term invented in the report and not a concept in ML? If this was a thing, it would be a thing discussed in ML papers already.
Would you have predicted in advance that successful BNNs/ANNs generally follow the same rule of having model bit capacity on order dataset bit capacity? Really? It’s a shockingly good fit, it invalidates a bunch of conventional wisdom about DL supposed data inefficiency, and it even makes specific interesting postdictions/predictions like the MNIST MLP has excess capacity and could easily be pruned (which is true! MNIST MLPs are notoriously wierd in how easily prunable they are). This also invalidates much of the chinchilla takeaway—GPT3 was a bit overparameterized perhaps, but that is all.
There’s a whole contingent who believe the brain has many OOM more compute than current GPUs, and this explains the lack of AGI. The idea that we are actually near the end due to physical limits invalidates that, and then the remaining uncertainty around AGI is software gap as you say, which I mostly address in the article (human brain exceptionalism).
But to be clear, I do believe there is a software gap, and that is part of the primary explanation for why we don’t have AGI yet through some super project—it’s just more of a gap between current DL systems and brain algos in general, rather than a specific human brain gap. It’s also closing.
Thanks for the point by point reply!
Re 1: The scaling function isn’t weird, and the horizon length constant isn’t arbitrary. But I think I see what you are saying now. Something like “We currently have stuff about as impressive as a raven/bee/etc. but if you were to predict when we’d get that using Bio Anchors, you’d predict 2030 or something like that, because you’d be using medium-horizon training and 1 data point per parameter and 10x as many parameters as ravens/bees/etc. have synapses...” What if I don’t agree that we currently have stuff about as impressive as a raven/bee/etc.? You mention primate level vision, that does seem like a good argument to me because it’s hard to argue that we don’t have good vision these days. But I’d like to see the math worked out. I think you should write a whole post (doesn’t need to be wrong) on just this point, because if you are right here I think it’s pretty strong evidence for shorter timelines & will be convincing to many people.
Re 2: “if it was a thing it would be in ML papers already” hahaha… I don’t take offense, and I hope you don’t take offense either, but suffice it to say this appeal to authority has no weight with me & I’d appreciate an object-level argument.
Re 5: No no I agree with you here, that’s why I said it was novel & interesting. (Well, I don’t yet agree that the fit is shockingly good. I’d want to think about it more & see a graph & spot check the calculations, and compare the result to the graphs Ajeya cites in support of the horizon length hypothesis.)
Re 6: Ah, OK. Makes sense. I’ll stop trying to defend people who I think are wrong and let them step up and defend themselves. On this point at least.
I’ve updated the article to include a concise summary of a subset of the evidence for parity between modern vision ANNs and primate visual cortex, and then modern LLMs and linguistic cortex. I’ll probably also summarize the discussion of Cat vs VPT, but I do think that VPT > Cat, in terms of actual AGI-relevant skills, even though the Cat brain would still be a better arch for AGI. We haven’t really tried as hard at the sim Cat task (unless you count driverless cars, but I’d guess those may require raven-like intelligence, and robotics lags due to much harder inference performance constraints) . That’s all very compatible with the general thesis that we get hardware parity first, then software catches up a bit later. (At this point I would not be surprised if we have AGI before driverless cars are common)
Re 1: Yeah so maybe I need to put more of the comparisons in an appendix or something, I’m just assuming background knowledge here that others may not have. Biological vision has been pretty extensively studied and is fairly well understood. We’ve had detailed functional computational models that can predict activations in IT since 2016 ish—they are DL models. I discussed some of that in my previous brain efficiency post here. More recently that same approach was used to model linguistic cortex using LLMs and was just as, or more effective, discussed a bit in my simbox post here. So I may just be assuming common background knowledge that BNNs and ANNs converge to learn similar or even equivalent circuits given similar training data.
I guess I just assume as background that readers know:
we have superhuman vision, and not by using very different techniques, but by using functionally equivalent to brain techniques, and P explains performance
that vision is typically 10% of the compute of most brains, and since cortex is uniform this implies that language, motor control, navigation, etc are all similar and can be solved using similar techniques ( I did predict this in 2015). Transformer LLMs fulfilled this for language recently.
Comparisons to full brains is more complex because there is—to first approximation—little funding for serious foundation DL projects trying to replicate cat brains, for example. We only have things like VPT—which I try to compare to cats in another comment thread. But basically I do not think cats are intelligent in the way ravens/primates are. Ex: my cat doesn’t really understand what it’s doing when it digs to cover pee in its litter box. It just sort of blindly follows an algorithm (after smell urine/poo, dig vaguely in several random related directions).
One issue is there’s a mystery bias—chess was once considered a test of intelligence, etc.
Re: 2. By saying “if horizon length was a thing, it would be a thing in ML papers”, I mean we would be seeing the effect, it would be something discussed and modeled in scaling law analysis, etc. So BioAnchors has to explain—and provide pretty enormous evidence at this point—that horizon length is a thing already in DL, a thing that helps explain/predict training, etc.
The steelman version of ‘horizon length’ is perhaps some abstraction of 1.) reward sparsity in RL, but there’s nothing fundemental about that and neither BNNs or advanced ANNs are limited by that, because they use denser self-supervised signals. or 2.) meta-learning, but if the model is use of meta learning causes H > 1, then it just predicts teams don’t use meta-learning (which is mostly correct): optional use of meta-learning can only speed up progress, and may be used internally in brains or large ANNs anyway, in which case the H model doesn’t really apply.
Re: 5. I don’t see the direct connection between dataset capacity vs model capacity and ‘horizon length hypothesis’?
some cats might not understand, but others definitely do:
https://youtube.com/c/BilliSpeaks
I’ve seen a few of those before, and it’s hard to evaluate cognition from a quick glance. I doubt Billi really uses/understands even that vocab, but it’s hard to say. My cat clearly understands perhaps a dozen words/phrases, but it’s hard to differentiate that from ‘only cares about a dozen words/phrases’.
The thing is, if you had a VPT like minecraft agent with similar vocab/communication skills, few would care or find it impressive.