One aspect you skipped over was how a superintelligence might reason if given data that has many possible hypotheses for an explanation.
You mentioned occam’s razor, and kinda touched on inductive biases, but I think you left out something important.
If you think about it, Occam’s razor is part of a process of : consider multiple hypothesis. Take the minimum complexity hypothesis, discard the others.
We can do better than that, trivially. See particle filters. In that case the algorithm is : consider up to n possible hypotheses and store them in memory in a hypothesis space able to contain them. (so the 2d particle filter exists in a space for only 2d coordinates, but it is possible to have an n dimension filter for the space of coherent scientific theories etc).
A human intelligence using occam’s razor is just doing a particle filter where you carry over only 1 point from step to step. And during famous scientific debates, another “champion” of a second theory held onto a different point.
Since a superintelligence can have an architecture with more memory and more compute, it can hold n points. It could generate millions of hypotheses (or near infinite) from the “3 frames” example, and some would contain correct theories of gravity. It can then reason using all hypotheses it has in memory, weighted by probability or by a clustering algorithm or other methods. This means it would be able to act, controlling robotics in the real world or making decisions, without having found a coherent theory of gravity yet, just a large collection of hypotheses biased towards ‘objects fall’. (3 frames is probably not enough information to act effectively)
I’m not sure “architecture” isn’t distinct from inductive bias. “Architecture” is the dimensions of each network, the topology connecting each network, the subcomponents of each network, and the training loss functions used at each stage. What’s different is that a model cannot learn it’s way past architectural limits, a model the scale of GPT-2 cannot approach the performance of GPT-4 no matter the amount of training data.
So inductive bias = information you started with, isn’t necessary because a general enough network can still learn the same information if it has more training information.
architecture = technical way the machine is constructed, it puts a ceiling on capabilities even with infinite training data.
Another aspect of this is considering the particle filter case, where a superintelligence tracks n hypotheses for what it believes during a decision making process. Each time you increment n+1, you increase compute needed per decision by O(n) or some cases much worse than that. There’s probably a way to mathematically formalize this and estimate how a superintelligence’s decision making ability scales with compute, since each additional hypothesis you track has diminishing returns. (probably per the same power law for training loss for llm training)
To your point about the particle filter, my whole point is that you can’t just assume the super intelligence can generate an infinite number of particles, because that takes infinite processing. At the end of the day, superintelligence isn’t magic—those hypotheses have to come from somewhere. They have to be built, and they have to be built sequentially. The only way you get to skip steps is by reusing knowledge that came from somewhere else.
Take a look at the game of Go. The computational limits on the number of games that could be simulated made this “try everything” approach essentially impossible. When Go was finally “solved”, it was with an ML algorithm that proposed only a limited number of possible sequences—it was just that the sequences it proposed were better.
But how did it get those better moves? It didn’t pull them out of the air, it used abstractions it had accumulated form playing a huge number of games.
_____
I do agree with some of the things you’re saying about architecture, though. Sometimes inductive bias imposes limitations. In terms of hypotheses, it can and does often put hard limits on which hypotheses you can consider, period.
I also admit I was wrong and was careless in saying that inductive bias is just information you started with. But I don’t think it’s imprecise to say that “information you started with” is just another form of inductive bias, of which ”architecture” is another.
But at a certain point, the line between architecture and information is going to blur. As I’ve pointed out, a transformer without some of the explicit benefits of a CNN’s architecture can still structure itself in a way that learns shift invariance. I also don’t think any of this effects my key arguments.
Lets assume that as part of pondering the three webam frames, the AI thought of the rules of Go- ignoring how likely this is.
In that circumstance, in your framing of the question, would it be allowed to play several million games against itself to see if that helped it explain the arrays of pixels?
I guess so? I’m not sure what point you’re making, so it’s hard for me to address it.
My point is that if you want to build something intelligent, you have to do a lot of processing and there’s no way around it. Playing several million games of Go counts as a lot of processing.
One aspect you skipped over was how a superintelligence might reason if given data that has many possible hypotheses for an explanation.
You mentioned occam’s razor, and kinda touched on inductive biases, but I think you left out something important.
If you think about it, Occam’s razor is part of a process of : consider multiple hypothesis. Take the minimum complexity hypothesis, discard the others.
We can do better than that, trivially. See particle filters. In that case the algorithm is : consider up to n possible hypotheses and store them in memory in a hypothesis space able to contain them. (so the 2d particle filter exists in a space for only 2d coordinates, but it is possible to have an n dimension filter for the space of coherent scientific theories etc).
A human intelligence using occam’s razor is just doing a particle filter where you carry over only 1 point from step to step. And during famous scientific debates, another “champion” of a second theory held onto a different point.
Since a superintelligence can have an architecture with more memory and more compute, it can hold n points. It could generate millions of hypotheses (or near infinite) from the “3 frames” example, and some would contain correct theories of gravity. It can then reason using all hypotheses it has in memory, weighted by probability or by a clustering algorithm or other methods. This means it would be able to act, controlling robotics in the real world or making decisions, without having found a coherent theory of gravity yet, just a large collection of hypotheses biased towards ‘objects fall’. (3 frames is probably not enough information to act effectively)
I’m not sure “architecture” isn’t distinct from inductive bias. “Architecture” is the dimensions of each network, the topology connecting each network, the subcomponents of each network, and the training loss functions used at each stage. What’s different is that a model cannot learn it’s way past architectural limits, a model the scale of GPT-2 cannot approach the performance of GPT-4 no matter the amount of training data.
So inductive bias = information you started with, isn’t necessary because a general enough network can still learn the same information if it has more training information.
architecture = technical way the machine is constructed, it puts a ceiling on capabilities even with infinite training data.
Another aspect of this is considering the particle filter case, where a superintelligence tracks n hypotheses for what it believes during a decision making process. Each time you increment n+1, you increase compute needed per decision by O(n) or some cases much worse than that. There’s probably a way to mathematically formalize this and estimate how a superintelligence’s decision making ability scales with compute, since each additional hypothesis you track has diminishing returns. (probably per the same power law for training loss for llm training)
To your point about the particle filter, my whole point is that you can’t just assume the super intelligence can generate an infinite number of particles, because that takes infinite processing. At the end of the day, superintelligence isn’t magic—those hypotheses have to come from somewhere. They have to be built, and they have to be built sequentially. The only way you get to skip steps is by reusing knowledge that came from somewhere else.
Take a look at the game of Go. The computational limits on the number of games that could be simulated made this “try everything” approach essentially impossible. When Go was finally “solved”, it was with an ML algorithm that proposed only a limited number of possible sequences—it was just that the sequences it proposed were better.
But how did it get those better moves? It didn’t pull them out of the air, it used abstractions it had accumulated form playing a huge number of games.
_____
I do agree with some of the things you’re saying about architecture, though. Sometimes inductive bias imposes limitations. In terms of hypotheses, it can and does often put hard limits on which hypotheses you can consider, period.
I also admit I was wrong and was careless in saying that inductive bias is just information you started with. But I don’t think it’s imprecise to say that “information you started with” is just another form of inductive bias, of which ”architecture” is another.
But at a certain point, the line between architecture and information is going to blur. As I’ve pointed out, a transformer without some of the explicit benefits of a CNN’s architecture can still structure itself in a way that learns shift invariance. I also don’t think any of this effects my key arguments.
Lets assume that as part of pondering the three webam frames, the AI thought of the rules of Go- ignoring how likely this is.
In that circumstance, in your framing of the question, would it be allowed to play several million games against itself to see if that helped it explain the arrays of pixels?
I guess so? I’m not sure what point you’re making, so it’s hard for me to address it.
My point is that if you want to build something intelligent, you have to do a lot of processing and there’s no way around it. Playing several million games of Go counts as a lot of processing.