I’m a bit confused about how my comment here relates to your post. If p is the goal (“win at chess”), the simplest search-based agent is just about exactly as complicated as p (“do a minimax search to win at chess”). But RL(p), at least with the usual definition of “RL”, will learn a very complicated computation that contains lots of information about which particular configurations of pieces are advantageous or not, e.g. the ResNet at the core of AlphaZero.
Are you imagining that the policy π is searching or discriminative? If the latter, why are you saying that π is just as simple as p? (Or are you saying that?) “Win at chess” seems a lot simpler than “do the calculations described by the following million-parameter ResNet”, right?
I’m not sure I understand the search vs discriminative distinction. If my hand touches fire and thus immediately moves backwards by reflex, would this be an example of a discriminative policy, because an input signal directly causes an action without being processed in the brain?
About the goal of winning at chess: in the case of minimax search, p generates the complete tree of the game using D and then selects the winning policy; as you said, this is probably the simplest agent (in terms of Kolmogorov complexity, given D) that wins at chess—and actually wins at any game that can be solved using minimax/backward induction. In the case of RL(p), p reads the environmental data D about chess to assign reward 1 to winning states and 0 elsewhere, and RL represents an ideal RL procedure that exploits interaction with the environment to generate the optimal policy π that maximises the reward function created by p. The main feature is that in both cases, when the environment gets bigger and D grows, the description length of the two algorithms given D doesn’t change: you could use minimax or the ideal RL procedure to generate a winning policy even for chess on a larger board, for example. If instead you wanted to use a giant lookup table, you would have to extend your algorithm each time a new state gets added to the environment.
I guess the confusion may come from the fact that D is underspecified. I tried to formalise it more precisely by using logic, but there were some problems and it’s still work in progress.
By the way, thanks for the links! I hope I’ll learn something new about how the brain works, I’m definitely not an expert on cognitive science :)
Hmm, maybe we’re talking past each other. Let’s say I have something like AlphaZero, where 50,000 bytes of machine code trains an AlphaZero-type chess-playing agent, whose core is a 1-billion-parameter ConvNet. The ConvNet takes 1 billion bytes to specify. Meanwhile, the reward-calculator p, which calculates whether checkmate has occurred, is 100 bytes of machine code.
Would you say that the complexity of the trained chess-playing agent is 100 bytes or 50,000 bytes or 1 billion bytes?
I guess you’re going to say 50,000, because you’re imagining a Turing machine that spends a year doing the self-play to calculate the billion-parameter ConvNet and then immediately the same Turning machine starts running that ConvNet it just calculated. From the perspective of Kolmogorov complexity, it doesn’t matter that it spends a year calculating the ConvNet, as long as it does so eventually.
By the same token, you can always turn a search-y agent into an equivalent discriminitive-y agent, given infinite processing time and storage, by training the latter on a googol queries of the former. If you’re thinking about Kolmogorov complexity, then you don’t care about a googol queries, as long as it works eventually.
Therefore, my first comment is not really relevant to what you’re thinking about. Sorry. I was not thinking about algorithms-that-write-arbitrary-code-and-then-immediately-run-it, I was thinking about the complexity of the algorithms that are actually in operation as the agent acts in the world.
If my hand touches fire and thus immediately moves backwards by reflex, would this be an example of a discriminative policy, because an input signal directly causes an action without being processed in the brain?
Yes. But the lack-of-processing-in-the-brain is not the important part. A typical ConvNet image classifier does involve many steps of processing, but is still discriminative, not search-y, because it does not work by trying out different generative models and picks the one that best explains the data. You can build a search-y image classifier that does exactly that, but most people these days don’t.
If we’re talking about algorithmic complexity, there’s a really important distinction, I think. In the space of actions, we have:
SEARCH: Search over an action space for an action that best achieves a goal
DISCRIMINATIVE: There is a function that goes directly from sensory inputs to the appropriate action (possibly with a recurrent state etc.)
Likewise, in the space of passive observations (e.g. classifying images), we have:
SEARCH: Search over a space of generative models for the model that best reproduces the observations (a.k.a. “analysis by synthesis”)
DISCRIMINATIVE: There is a function that goes directly from sensory inputs to understanding / classification.
The search methods are generally:
Algorithmically simpler
More sample-efficient
Slower at run-time
(Incidentally, I think the neocortex does the “search” option in both these cases (and that they aren’t really two separate cases in the brain). Other parts of the brain do the “discriminative” option in certain cases.)
I’m a bit confused about how my comment here relates to your post. If p is the goal (“win at chess”), the simplest search-based agent is just about exactly as complicated as p (“do a minimax search to win at chess”). But RL(p), at least with the usual definition of “RL”, will learn a very complicated computation that contains lots of information about which particular configurations of pieces are advantageous or not, e.g. the ResNet at the core of AlphaZero.
Are you imagining that the policy π is searching or discriminative? If the latter, why are you saying that π is just as simple as p? (Or are you saying that?) “Win at chess” seems a lot simpler than “do the calculations described by the following million-parameter ResNet”, right?
I’m not sure I understand the search vs discriminative distinction. If my hand touches fire and thus immediately moves backwards by reflex, would this be an example of a discriminative policy, because an input signal directly causes an action without being processed in the brain?
About the goal of winning at chess: in the case of minimax search, p generates the complete tree of the game using D and then selects the winning policy; as you said, this is probably the simplest agent (in terms of Kolmogorov complexity, given D) that wins at chess—and actually wins at any game that can be solved using minimax/backward induction. In the case of RL(p), p reads the environmental data D about chess to assign reward 1 to winning states and 0 elsewhere, and RL represents an ideal RL procedure that exploits interaction with the environment to generate the optimal policy π that maximises the reward function created by p. The main feature is that in both cases, when the environment gets bigger and D grows, the description length of the two algorithms given D doesn’t change: you could use minimax or the ideal RL procedure to generate a winning policy even for chess on a larger board, for example. If instead you wanted to use a giant lookup table, you would have to extend your algorithm each time a new state gets added to the environment.
I guess the confusion may come from the fact that D is underspecified. I tried to formalise it more precisely by using logic, but there were some problems and it’s still work in progress.
By the way, thanks for the links! I hope I’ll learn something new about how the brain works, I’m definitely not an expert on cognitive science :)
Hmm, maybe we’re talking past each other. Let’s say I have something like AlphaZero, where 50,000 bytes of machine code trains an AlphaZero-type chess-playing agent, whose core is a 1-billion-parameter ConvNet. The ConvNet takes 1 billion bytes to specify. Meanwhile, the reward-calculator p, which calculates whether checkmate has occurred, is 100 bytes of machine code.
Would you say that the complexity of the trained chess-playing agent is 100 bytes or 50,000 bytes or 1 billion bytes?
I guess you’re going to say 50,000, because you’re imagining a Turing machine that spends a year doing the self-play to calculate the billion-parameter ConvNet and then immediately the same Turning machine starts running that ConvNet it just calculated. From the perspective of Kolmogorov complexity, it doesn’t matter that it spends a year calculating the ConvNet, as long as it does so eventually.
By the same token, you can always turn a search-y agent into an equivalent discriminitive-y agent, given infinite processing time and storage, by training the latter on a googol queries of the former. If you’re thinking about Kolmogorov complexity, then you don’t care about a googol queries, as long as it works eventually.
Therefore, my first comment is not really relevant to what you’re thinking about. Sorry. I was not thinking about algorithms-that-write-arbitrary-code-and-then-immediately-run-it, I was thinking about the complexity of the algorithms that are actually in operation as the agent acts in the world.
Yes. But the lack-of-processing-in-the-brain is not the important part. A typical ConvNet image classifier does involve many steps of processing, but is still discriminative, not search-y, because it does not work by trying out different generative models and picks the one that best explains the data. You can build a search-y image classifier that does exactly that, but most people these days don’t.
No worries, I think your comment still provides good food for thought!