The main difference between our ways of thinking seems to be: You are thinking in terms of, what happens inside the algorithm. I am thinking in terms of, what desiderata does the algorithm satisfy. I trust the latter approach more, because, while we can only speculate how intelligent agents work, we should be able to explain what an intelligent agent is: after all, something made us study intelligent agents in the first place. (Of course we do have some clues about how intelligent agents work from cognitive science and applied ML, but we are far from having the full picture.) In particular, I don’t directly care about the distinction between model-based and model-free.
Specifically, the desiderata I often consider is some type of regret bound. This desiderata doesn’t directly refer to prediction: it only cares about expected utility! Your reasoning would have us conclude it corresponds to model-free RL. However, the actual algorithms I use to prove regret bounds use posterior sampling: a model-based approach! So, not only that predictions enter the picture, but they enter the picture in a way that is clearly justified rather than ad hoc. (That said, the definition of Bayesian regret already refers to a prior, and the coefficients in the regret bounds depend on this prior, so I do introduce a notion of “model” a priori.)
(It should also be noted that a reward-learning framework presumes we get feedback about utility at all. If we get no feedback about reward, then we’re forced to only judge hypotheses by predictions, and make what inferences about utility we will. A dire situation for learning theory, but a situation where we can still talk about rational agency more generally.)
I do not agree that this is a “dire situation for learning theory”. Quite the contrary, this is exactly the situation I formalized using instrumental reward functions. Unless you think instrumental reward functions are not good enough for some reason? (Of course, in my setting, the agent can get feedback about utility, it just doesn’t get it all the time for free. But, the only alternative seems to be “flying spaghetti monster” utility functions that are completely unobservable, and I am not sure that’s an interesting case to study.)
Agents that “make up math puzzles” is not necessarily a bad notion, but we should try to be more precise about what it means in terms of desiderata. We need to specify precisely what advantages “making up math puzzles” is supposed to confer.
However, there’s no special dividing line between decidable and undecidable problems—any particular restriction to a decidable class might rule out some interesting (decidable but non-obviously so) stuff which we could learn from. So we might end up just going with any computations (halting or no).
It might be useful to consider arbitrary programs, but the agent will still not execute any of them for more than some small finite time. You might argue that this is why it needs logic, to prove theorems about computations that it cannot execute. But, I don’t believe this line of thinking leads to anything tractable that is not redundant w.r.t. a simpler logic-less approach.
[EDIT: Renamed “CSRL” to “TRL” 2019-09-19]
One related idea that I came up with recently is what I call “Turing reinforcement learning” (TRL). (It is rather half-baked, but I feel that the relevance is such that I should still bring it up. Also, it was inspired by our discussion, so thank you.) Imagine an RL agent connected to a computer. The agent can ask the computer to execute any program (for some finite amount of time ofc), and its prior already “knows” that such computations produce consistent answers (something about the semantics of programs can also be built into the prior, bringing us closer to logic, but I don’t believe it can be anything nearly as strong as “proving theorems about programs using PA” if we want the algorithm to be even weakly feasible). This allows it to consider incomplete hypotheses that describe some parts of the environment in terms of programs (i.e. some physical phenomenon is hypotheses to behave equivalently to the execution of a particular program). The advantage of this approach: better management of computational resources. In classical RL, the resources are allocated “statically”: any hypothesis is either in the prior or not. In contrast, in TRL the agent can decide itself which programs are worth running and when. Starting from generic regret bounds for incomplete models, this should lead to TRL satisfying some desiderata superior to classical RL with the same amount of resources, although I haven’t worked out the details yet.
This gets back to my earlier intuition about agents having a reasonable chance of getting certain things right on the first try. Learning-theoretic agents don’t get things right on the first try. However, agents who learn from logic have “lots of tries” before their first real try in physical time. If you can successfully determine which logical scenarios are relevantly analogous to your own, you can learn what to do just by thinking.
This seems to me the wrong way to look at it. There is no such thing as “learning-theoretic agents”. There are agents with good learning-theoretic properties. If there is a property that you can prove in learning-theory to be impossible then no agent can have this property. Once again, I am not ruling out this entire line of thought, but I am arguing for looking for desiderata that can formalize it, instead of using it to justify ad hoc algorithms (e.g. algorithms that use logic.)
So, getting back to Vanessa’s point in the comment I quoted: can we solve MIRI-style decision problems by considering the iterated problem, rather than the single-shot version? To a large extent, I think so: in logical time, all games are iterated games. However, I don’t want to have to set an agent up with a training sequence in which it encounters those specific problems many times. For example, finding good strategies in chess via self-play should come naturally from the way the agent thinks about the world, rather than being an explicit training regime which the designer has to implement. Once the rules for chess are understood, the bottleneck should be thinking time rather than (physical) training instances.
But, you have to set an agent with a training sequence, otherwise how does it ever figure out the rules of chess? I definitely agree that in some situations computational complexity is the bottleneck rather the sample complexity. My point was different: if you don’t add the context of a training sequences, you might end up with philosophical questions that are either poor starting points or don’t make sense at all.
The main difference between our ways of thinking seems to be: You are thinking in terms of, what happens inside the algorithm. I am thinking in terms of, what desiderata does the algorithm satisfy. I trust the latter approach more, because, while we can only speculate how intelligent agents work, we should be able to explain what an intelligent agent is: after all, something made us study intelligent agents in the first place. (Of course we do have some clues about how intelligent agents work from cognitive science and applied ML, but we are far from having the full picture.) In particular, I don’t directly care about the distinction between model-based and model-free.
I talked a lot about desiderata in both part (1) and part (2)! Granted, I think about these things somewhat differently than you. And I did not get very concrete with desiderata.
Part (1a) was something like “here are some things which seem like real phenomena when you look at intelligence, and how they relate to learning-theoretic criteria when we formalize them” (ie, regret bounds on reward vs regret bounds on predictive accuracy). Part (1b) was like “here are some other things which seem to me like real phenomena relating to intelligence, which a formal theory of intelligence should not miss out on, and which seem real fol very similar reasons to the more-commonly-discussed stuff from section 1a”.
Really, though, I think in terms of the stuff in (2) much more than the stuff in (1). I’m not usually thinking: “It sure seems like real agents reason using logic. How can we capture that formally in our rationality criteria?” Rather, I’m usually thinking things like: “If agents know a good amount about each other due to access to source code or other such means, then a single-shot game has the character of iterated game theory. How can we capture that in our rationality criteria?”
Perhaps one difference is that I keep talking about modeling something with regret bounds, whereas you are talking about achieving regret bounds. IE, maybe you more know what notion of regret you want to minimize, and are going after algorithms which help you minimize it; and I am more uncertain about what notion of regret is relevant, and am looking for formulations of regret which model what I am interested in.
Specifically, the desiderata I often consider is some type of regret bound. This desiderata doesn’t directly refer to prediction: it only cares about expected utility! Your reasoning would have us conclude it corresponds to model-free RL. However, the actual algorithms I use to prove regret bounds use posterior sampling: a model-based approach! So, not only that predictions enter the picture, but they enter the picture in a way that is clearly justified rather than ad hoc.
I see what you mean about not fitting the model-based/model-free dichotomy the way I set it up. In terms of the argument I try to make in the post, I’m drawing an analogy between prediction and logic. So I’m saying that (if we think of both prediction and logic as “proxies” in the way I describe) it is plausible that formal connections to prediction have analogs in formal connections to logic, because both seem to play a similar role in connection to doing well in terms of action utility.
(That said, the definition of Bayesian regret already refers to a prior, and the coefficients in the regret bounds depend on this prior, so I do introduce a notion of “model” a priori.)
Are you referring to results which require realizability here? (I know it can make sense to get a regret bound in terms of the prior probability w/o assuming that one of the things in the prior has the best asymptotic loss, but, it is common to need such an assumption)
Part of what I’m saying in the second part of the original post is that I find that I have to introduce logic a priori to discuss certain kinds of regret I think about, similarly to how here you are forced to introduce “model” a priori for certain kinds of regret. (I suspect I didn’t explain this very well because I thought my motivations were clear but underestimated the inferential gap.)
I do not agree that this is a “dire situation for learning theory”. Quite the contrary, this is exactly the situation I formalized using instrumental reward functions. Unless you think instrumental reward functions are not good enough for some reason? (Of course, in my setting, the agent can get feedback about utility, it just doesn’t get it all the time for free. But, the only alternative seems to be “flying spaghetti monster” utility functions that are completely unobservable, and I am not sure that’s an interesting case to study.)
I had in mind the setting where no particular assumption is made about the observability of utility at all, like in classic settings such as Savage etc. I’m not claiming it is very relevant to the discussion at hand. It’s just that there is some notion of rationality to be studied in such cases. I also misspoke in “dire situation for learning theory”—I don’t want to prematurely narrow what learning theory can fruitfully discuss. Something more like “dire situation for reinforcement learning” would have been more accurate.
It might be useful to consider arbitrary programs, but the agent will still not execute any of them for more than some small finite time. You might argue that this is why it needs logic, to prove theorems about computations that it cannot execute. But, I don’t believe this line of thinking leads to anything tractable that is not redundant w.r.t. a simpler logic-less approach.
I was definitely not suggesting that the agent would have to use logic in order to get feedback about programs which take too long to execute. (It could do that, and it could be useful, but it isn’t my argument for why the agent uses logic-like reasoning.) Rather, the idea was more that by virtue of testing out ways of reasoning on the programs which can be evaluated, the agent would learn to evaluate more difficult cases which can’t simply be run, which would in turn sometimes prove useful in dealing with the external environment.
One related idea that I came up with recently is what I call “Turing reinforcement learning” (TRL). [...]
Yeah, this would get at some of the idea of learning more about the environment by learning about computations. In addition to the potential for management of computational resources, it would illustrate something about how incomplete models help with logical uncertainty.
I think some of our disagreement has to do with my desire to model things which humans seem to do just fine (eg, heuristically estimate the probability of sentences in PA). Attempting to make models of this easily leads to infeasible things (such as logical induction). My reaction to this is: (1) at least now we have some idea about what it means to do well at the task; (2) but yeah, this means this isn’t how humans are doing it, so there is still something left to understand.
But, you have to set an agent with a training sequence, otherwise how does it ever figure out the rules of chess? I definitely agree that in some situations computational complexity is the bottleneck rather the sample complexity. My point was different: if you don’t add the context of a training sequences, you might end up with philosophical questions that are either poor starting points or don’t make sense at all.
I’m not saying you don’t have a training sequence at all. I’m saying the training sequence is enough to figure out the rules of chess (perhaps from indirect information, not necessarily ever playing chess yet), but not enough to learn to play well from experience.
Obviously in order to understand the rules of chess quickly from a human explanation, the system would need to have already learned quite a lot about the world (or have been specialized to this sort of task).
The point is that there is also something to gain from training to see how computations go (perhaps computations which look like counterfactuals, such as the game tree in chess).
Yeah, I’m now feeling more confident that the difference in perspectives is this: you are thinking of a relatively fixed notion of regret, and want things to be relevant to achieving good bounds in terms of that. I am more wanting a notion of regret that successfully captures something I’m trying to understand. EG, logical induction was a big improvement to my understanding of the phenomenon of conjecture in mathematics—how it can be understood within the context of a rationality criterion. It doesn’t directly have to allow me to improve regret bounds for reward learners. It’s an important question whether it helps us do anything we couldn’t before in terms of reward learning. (For me it certainly did, since I was thinking in terms of Bayesians with full models before—so logical induction showed me how to think about partial models and such. Perhaps it didn’t show you how to do anything you couldn’t already, since you were already thinking about things in optimal predictor terms.) But it is a different question from whether it tells us anything about intelligence at all.
I had in mind the setting where no particular assumption is made about the observability of utility at all, like in classic settings such as Savage etc. I’m not claiming it is very relevant to the discussion at hand. It’s just that there is some notion of rationality to be studied in such cases. I also misspoke in “dire situation for learning theory”—I don’t want to prematurely narrow what learning theory can fruitfully discuss. Something more like “dire situation for reinforcement learning” would have been more accurate.
By the way, from the perspective of what I call “subjective” learnability, we can consider this case as well. In subjective learnability, we define regret with respect to an optimal agent that has all of the user’s knowledge (rather than knowing the actual “true” environment). So, it seems not difficult to imagine communication protocols that will allow the AI learn the user’s knowledge including the unobservable reward function (we assume the user imagines the environment as a POMDP with rewards that are not directly observable but defined as a function of the state). (Incidentally, recently I am leaning towards the position that some of kind of generic communication protocol (with natural language “annotation” + quantilization to avoid detrimental messages) is the best approach to achieve alignment.)
I’m not usually thinking: “It sure seems like real agents reason using logic. How can we capture that formally in our rationality criteria?” Rather, I’m usually thinking things like: “If agents know a good amount about each other due to access to source code or other such means, then a single-shot game has the character of iterated game theory. How can we capture that in our rationality criteria?”
I would study this problem by considering a population of learning agents that are sequentially paired up for one-shot games where the source code of each is revealed to the other. This way they can learn how to reason about source codes. In contrast, the model where the agents try to formally prove theorems seems poorly motivated to me.
Perhaps one difference is that I keep talking about modeling something with regret bounds, whereas you are talking about achieving regret bounds. IE, maybe you more know what notion of regret you want to minimize, and are going after algorithms which help you minimize it; and I am more uncertain about what notion of regret is relevant, and am looking for formulations of regret which model what I am interested in.
Hmm, no, I don’t think this is it. I think that I do both of those. It is definitely important to keep refining our regret criterion to capture more subtle optimality properties.
So I’m saying that (if we think of both prediction and logic as “proxies” in the way I describe) it is plausible that formal connections to prediction have analogs in formal connections to logic, because both seem to play a similar role in connection to doing well in terms of action utility.
Do they though? I am not convinced the logic has a meaningful connection to doing well in terms of action utility. I think that for prediction we can provide natural models that show how prediction connects to doing well, whereas I can’t say the same about logic. (But I would be very interested in being proved wrong.)
Are you referring to results which require realizability here?
I usually assume realizability, but for incomplete model “realizability” just means that the environment satisfies some incomplete models (i.e. has some regularities that the agent can learn).
...the idea was more that by virtue of testing out ways of reasoning on the programs which can be evaluated, the agent would learn to evaluate more difficult cases which can’t simply be run, which would in turn sometimes prove useful in dealing with the external environment.
Yes, I think this idea has merit, and I explained how TRL can address it in another comment. I don’t think we need or should study this using formal logic.
...I was thinking in terms of Bayesians with full models before—so logical induction showed me how to think about partial models and such. Perhaps it didn’t show you how to do anything you couldn’t already, since you were already thinking about things in optimal predictor terms.
Hmm, actually I think logical induction influenced me also to move away from Bayesian purism and towards incomplete models. I just don’t think logic is the important part there. I am also not sure about the significance of this forecasting method, since in RL it is more natural to just do maximin instead. But, maybe it is still important somehow.
...”If agents know a good amount about each other due to access to source code or other such means, then a single-shot game has the character of iterated game theory. How can we capture that in our rationality criteria?”
I would study this problem by considering a population of learning agents that are sequentially paired up for one-shot games where the source code of each is revealed to the other. This way they can learn how to reason about source codes. In contrast, the model where the agents try to formally prove theorems seems poorly motivated to me.
Actually, here’s another thought about this. Consider the following game: each player submits a program, then the programs are given each other as inputs, and executed in anytime mode (i.e. at given moment each program has to have some answer ready). The execution has some probability ϵ to terminate on each time step, so that the total execution time follows the geometric distribution. Once execution is finished, the output of the programs is interpreted as strategies in a normal-form game and the payoffs are accordingly.
This seems quite similar to an iterated game with geometric time discount! In particular, in this version of the Prisoner’s Dilemma, there is the following analogue of the tit-for-tat strategy: start by setting the answer to C, simulate the other agent, once the other agents starts producing answers, set your answer to equal to the other agent’s answer. For sufficiently shallow time discount, this is a Nash equilibrium.
In line with the idea I explained in a previous comment, it seems tempting to look for proper/thermal equilibria in this game with some constraint on the programs. One constraint that seems appealing is: force the program to have O(1) space complexity modulo oracle access to the other program. It is easy to see that a pair of such programs can be executed using O(1) space complexity as well.
Another remark about Turing reinforcement learning (I renamed it so because, it’s a reinforcement learner “plugged” into a universal Turing machine, and it is also related to Neural Turing machines). Here is how we can realize Abram’s idea that “we can learn a whole lot by reasoning before we even observe that situation”.
Imagine that the environment satisfies some hypothesis H1 that contains the evaluation of a program P1, and that P1 is relatively computationally expensive but not prohibitively so. If H1 is a simple hypothesis, and in particular P1 is a short program, then we can expect the agent to learn H1 from a relatively small number of samples. However, because of the computational cost of P1, exploitingH1 might be difficult (because there is a large latency between the time the agent knows it needs to evaluate P1 and the time it gets the answer). Now assume that P1 is actually equivalent (as a function) to a different program P2 which is long but relatively cheap. Then we can use P2 to describe hypothesis H2 which is also true, and which is easily exploitable (i.e. guarantees a higher payoff than H1) but which is complex (because P2 is long). The agent would need a very large number of (physical) samples to learn H2 directly. But, if the agent has plenty of computational resources for use during its “free time”, it can use them to learn the (correct but complex) hypothesis M:="H1=H2". Then, the conjunction of H1 and M guarantees the same high payoff as H2. Thus, TRL converges to exploiting the environment well by the virtue of “reasoning before observing”.
Moreover, we can plausibly translate this story into a rigorous desideratum along the following lines: for any conjunction of a “purely mathematical” hypothesis M and arbitrary (“physical”) hypothesis H, we can learn (=exploit) it given enough “mathematical” samples relatively to the complexity (prior probability) of M and enough “physical” samples relatively to the complexity of H.
Thank you for writing this, Abram.
The main difference between our ways of thinking seems to be: You are thinking in terms of, what happens inside the algorithm. I am thinking in terms of, what desiderata does the algorithm satisfy. I trust the latter approach more, because, while we can only speculate how intelligent agents work, we should be able to explain what an intelligent agent is: after all, something made us study intelligent agents in the first place. (Of course we do have some clues about how intelligent agents work from cognitive science and applied ML, but we are far from having the full picture.) In particular, I don’t directly care about the distinction between model-based and model-free.
Specifically, the desiderata I often consider is some type of regret bound. This desiderata doesn’t directly refer to prediction: it only cares about expected utility! Your reasoning would have us conclude it corresponds to model-free RL. However, the actual algorithms I use to prove regret bounds use posterior sampling: a model-based approach! So, not only that predictions enter the picture, but they enter the picture in a way that is clearly justified rather than ad hoc. (That said, the definition of Bayesian regret already refers to a prior, and the coefficients in the regret bounds depend on this prior, so I do introduce a notion of “model” a priori.)
I do not agree that this is a “dire situation for learning theory”. Quite the contrary, this is exactly the situation I formalized using instrumental reward functions. Unless you think instrumental reward functions are not good enough for some reason? (Of course, in my setting, the agent can get feedback about utility, it just doesn’t get it all the time for free. But, the only alternative seems to be “flying spaghetti monster” utility functions that are completely unobservable, and I am not sure that’s an interesting case to study.)
Agents that “make up math puzzles” is not necessarily a bad notion, but we should try to be more precise about what it means in terms of desiderata. We need to specify precisely what advantages “making up math puzzles” is supposed to confer.
It might be useful to consider arbitrary programs, but the agent will still not execute any of them for more than some small finite time. You might argue that this is why it needs logic, to prove theorems about computations that it cannot execute. But, I don’t believe this line of thinking leads to anything tractable that is not redundant w.r.t. a simpler logic-less approach.
[EDIT: Renamed “CSRL” to “TRL” 2019-09-19]
One related idea that I came up with recently is what I call “Turing reinforcement learning” (TRL). (It is rather half-baked, but I feel that the relevance is such that I should still bring it up. Also, it was inspired by our discussion, so thank you.) Imagine an RL agent connected to a computer. The agent can ask the computer to execute any program (for some finite amount of time ofc), and its prior already “knows” that such computations produce consistent answers (something about the semantics of programs can also be built into the prior, bringing us closer to logic, but I don’t believe it can be anything nearly as strong as “proving theorems about programs using PA” if we want the algorithm to be even weakly feasible). This allows it to consider incomplete hypotheses that describe some parts of the environment in terms of programs (i.e. some physical phenomenon is hypotheses to behave equivalently to the execution of a particular program). The advantage of this approach: better management of computational resources. In classical RL, the resources are allocated “statically”: any hypothesis is either in the prior or not. In contrast, in TRL the agent can decide itself which programs are worth running and when. Starting from generic regret bounds for incomplete models, this should lead to TRL satisfying some desiderata superior to classical RL with the same amount of resources, although I haven’t worked out the details yet.
This seems to me the wrong way to look at it. There is no such thing as “learning-theoretic agents”. There are agents with good learning-theoretic properties. If there is a property that you can prove in learning-theory to be impossible then no agent can have this property. Once again, I am not ruling out this entire line of thought, but I am arguing for looking for desiderata that can formalize it, instead of using it to justify ad hoc algorithms (e.g. algorithms that use logic.)
But, you have to set an agent with a training sequence, otherwise how does it ever figure out the rules of chess? I definitely agree that in some situations computational complexity is the bottleneck rather the sample complexity. My point was different: if you don’t add the context of a training sequences, you might end up with philosophical questions that are either poor starting points or don’t make sense at all.
I talked a lot about desiderata in both part (1) and part (2)! Granted, I think about these things somewhat differently than you. And I did not get very concrete with desiderata.
Part (1a) was something like “here are some things which seem like real phenomena when you look at intelligence, and how they relate to learning-theoretic criteria when we formalize them” (ie, regret bounds on reward vs regret bounds on predictive accuracy). Part (1b) was like “here are some other things which seem to me like real phenomena relating to intelligence, which a formal theory of intelligence should not miss out on, and which seem real fol very similar reasons to the more-commonly-discussed stuff from section 1a”.
Really, though, I think in terms of the stuff in (2) much more than the stuff in (1). I’m not usually thinking: “It sure seems like real agents reason using logic. How can we capture that formally in our rationality criteria?” Rather, I’m usually thinking things like: “If agents know a good amount about each other due to access to source code or other such means, then a single-shot game has the character of iterated game theory. How can we capture that in our rationality criteria?”
Perhaps one difference is that I keep talking about modeling something with regret bounds, whereas you are talking about achieving regret bounds. IE, maybe you more know what notion of regret you want to minimize, and are going after algorithms which help you minimize it; and I am more uncertain about what notion of regret is relevant, and am looking for formulations of regret which model what I am interested in.
I see what you mean about not fitting the model-based/model-free dichotomy the way I set it up. In terms of the argument I try to make in the post, I’m drawing an analogy between prediction and logic. So I’m saying that (if we think of both prediction and logic as “proxies” in the way I describe) it is plausible that formal connections to prediction have analogs in formal connections to logic, because both seem to play a similar role in connection to doing well in terms of action utility.
Are you referring to results which require realizability here? (I know it can make sense to get a regret bound in terms of the prior probability w/o assuming that one of the things in the prior has the best asymptotic loss, but, it is common to need such an assumption)
Part of what I’m saying in the second part of the original post is that I find that I have to introduce logic a priori to discuss certain kinds of regret I think about, similarly to how here you are forced to introduce “model” a priori for certain kinds of regret. (I suspect I didn’t explain this very well because I thought my motivations were clear but underestimated the inferential gap.)
I had in mind the setting where no particular assumption is made about the observability of utility at all, like in classic settings such as Savage etc. I’m not claiming it is very relevant to the discussion at hand. It’s just that there is some notion of rationality to be studied in such cases. I also misspoke in “dire situation for learning theory”—I don’t want to prematurely narrow what learning theory can fruitfully discuss. Something more like “dire situation for reinforcement learning” would have been more accurate.
I was definitely not suggesting that the agent would have to use logic in order to get feedback about programs which take too long to execute. (It could do that, and it could be useful, but it isn’t my argument for why the agent uses logic-like reasoning.) Rather, the idea was more that by virtue of testing out ways of reasoning on the programs which can be evaluated, the agent would learn to evaluate more difficult cases which can’t simply be run, which would in turn sometimes prove useful in dealing with the external environment.
Yeah, this would get at some of the idea of learning more about the environment by learning about computations. In addition to the potential for management of computational resources, it would illustrate something about how incomplete models help with logical uncertainty.
I think some of our disagreement has to do with my desire to model things which humans seem to do just fine (eg, heuristically estimate the probability of sentences in PA). Attempting to make models of this easily leads to infeasible things (such as logical induction). My reaction to this is: (1) at least now we have some idea about what it means to do well at the task; (2) but yeah, this means this isn’t how humans are doing it, so there is still something left to understand.
I’m not saying you don’t have a training sequence at all. I’m saying the training sequence is enough to figure out the rules of chess (perhaps from indirect information, not necessarily ever playing chess yet), but not enough to learn to play well from experience.
Obviously in order to understand the rules of chess quickly from a human explanation, the system would need to have already learned quite a lot about the world (or have been specialized to this sort of task).
The point is that there is also something to gain from training to see how computations go (perhaps computations which look like counterfactuals, such as the game tree in chess).
Yeah, I’m now feeling more confident that the difference in perspectives is this: you are thinking of a relatively fixed notion of regret, and want things to be relevant to achieving good bounds in terms of that. I am more wanting a notion of regret that successfully captures something I’m trying to understand. EG, logical induction was a big improvement to my understanding of the phenomenon of conjecture in mathematics—how it can be understood within the context of a rationality criterion. It doesn’t directly have to allow me to improve regret bounds for reward learners. It’s an important question whether it helps us do anything we couldn’t before in terms of reward learning. (For me it certainly did, since I was thinking in terms of Bayesians with full models before—so logical induction showed me how to think about partial models and such. Perhaps it didn’t show you how to do anything you couldn’t already, since you were already thinking about things in optimal predictor terms.) But it is a different question from whether it tells us anything about intelligence at all.
By the way, from the perspective of what I call “subjective” learnability, we can consider this case as well. In subjective learnability, we define regret with respect to an optimal agent that has all of the user’s knowledge (rather than knowing the actual “true” environment). So, it seems not difficult to imagine communication protocols that will allow the AI learn the user’s knowledge including the unobservable reward function (we assume the user imagines the environment as a POMDP with rewards that are not directly observable but defined as a function of the state). (Incidentally, recently I am leaning towards the position that some of kind of generic communication protocol (with natural language “annotation” + quantilization to avoid detrimental messages) is the best approach to achieve alignment.)
I would study this problem by considering a population of learning agents that are sequentially paired up for one-shot games where the source code of each is revealed to the other. This way they can learn how to reason about source codes. In contrast, the model where the agents try to formally prove theorems seems poorly motivated to me.
Hmm, no, I don’t think this is it. I think that I do both of those. It is definitely important to keep refining our regret criterion to capture more subtle optimality properties.
Do they though? I am not convinced the logic has a meaningful connection to doing well in terms of action utility. I think that for prediction we can provide natural models that show how prediction connects to doing well, whereas I can’t say the same about logic. (But I would be very interested in being proved wrong.)
I usually assume realizability, but for incomplete model “realizability” just means that the environment satisfies some incomplete models (i.e. has some regularities that the agent can learn).
Yes, I think this idea has merit, and I explained how TRL can address it in another comment. I don’t think we need or should study this using formal logic.
Hmm, actually I think logical induction influenced me also to move away from Bayesian purism and towards incomplete models. I just don’t think logic is the important part there. I am also not sure about the significance of this forecasting method, since in RL it is more natural to just do maximin instead. But, maybe it is still important somehow.
Actually, here’s another thought about this. Consider the following game: each player submits a program, then the programs are given each other as inputs, and executed in anytime mode (i.e. at given moment each program has to have some answer ready). The execution has some probability ϵ to terminate on each time step, so that the total execution time follows the geometric distribution. Once execution is finished, the output of the programs is interpreted as strategies in a normal-form game and the payoffs are accordingly.
This seems quite similar to an iterated game with geometric time discount! In particular, in this version of the Prisoner’s Dilemma, there is the following analogue of the tit-for-tat strategy: start by setting the answer to C, simulate the other agent, once the other agents starts producing answers, set your answer to equal to the other agent’s answer. For sufficiently shallow time discount, this is a Nash equilibrium.
In line with the idea I explained in a previous comment, it seems tempting to look for proper/thermal equilibria in this game with some constraint on the programs. One constraint that seems appealing is: force the program to have O(1) space complexity modulo oracle access to the other program. It is easy to see that a pair of such programs can be executed using O(1) space complexity as well.
Another remark about Turing reinforcement learning (I renamed it so because, it’s a reinforcement learner “plugged” into a universal Turing machine, and it is also related to Neural Turing machines). Here is how we can realize Abram’s idea that “we can learn a whole lot by reasoning before we even observe that situation”.
Imagine that the environment satisfies some hypothesis H1 that contains the evaluation of a program P1, and that P1 is relatively computationally expensive but not prohibitively so. If H1 is a simple hypothesis, and in particular P1 is a short program, then we can expect the agent to learn H1 from a relatively small number of samples. However, because of the computational cost of P1, exploiting H1 might be difficult (because there is a large latency between the time the agent knows it needs to evaluate P1 and the time it gets the answer). Now assume that P1 is actually equivalent (as a function) to a different program P2 which is long but relatively cheap. Then we can use P2 to describe hypothesis H2 which is also true, and which is easily exploitable (i.e. guarantees a higher payoff than H1) but which is complex (because P2 is long). The agent would need a very large number of (physical) samples to learn H2 directly. But, if the agent has plenty of computational resources for use during its “free time”, it can use them to learn the (correct but complex) hypothesis M:="H1=H2". Then, the conjunction of H1 and M guarantees the same high payoff as H2. Thus, TRL converges to exploiting the environment well by the virtue of “reasoning before observing”.
Moreover, we can plausibly translate this story into a rigorous desideratum along the following lines: for any conjunction of a “purely mathematical” hypothesis M and arbitrary (“physical”) hypothesis H, we can learn (=exploit) it given enough “mathematical” samples relatively to the complexity (prior probability) of M and enough “physical” samples relatively to the complexity of H.