Do Sufficiently Advanced Agents Use Logic?
This is a continuation of a discussion with Vanessa from the MIRIxDiscord group. I’ll make some comments on things Vanessa has said, but those should not be considered a summary of the discussion so far. My comments here are also informed by discussion with Sam.
1: Logic as Proxy
1a: The Role of Prediction
Vanessa has said that predictive accuracy is sufficient; consideration of logic is not needed to judge (partial) models. A hypothesis should ultimately ground out to perceptual information. So why is there any need to consider other sorts of “predictions” it can make? (IE, why should we think of it as possessing internal propositions which have a logic of their own?)
But similarly, why should agents use predictive accuracy to learn? What’s the argument for it? Ultimately, predicting perceptions ahead of time should only be in service of achieving higher reward.
We could instead learn from reward feedback alone. A (partial) “hypothesis” would really be a (partial) strategy, helping us to generate actions. We would judge strategies on (something like) average reward achieved, not even trying to predict precise reward signals. The agent still receives incoming perceptual information, and strategies can use it to update internal states and to inform actions. However, strategies are not asked to produce any predictions. (The framework I’m describing is, of course, model-free RL.)
Intuitively, it seems as if this is missing something. A model-based agent can learn a lot about the world just by watching, taking no actions. However, individual strategies can implement prediction-based learning within themselves. So, it seems difficult to say what benefit model-based RL provides beyond model-free RL, besides a better prior over strategies.
It might be that we can’t say anything recommending model-based learning over model free in a standard bounded-regret framework. (I actually haven’t thought about it much—but the argument that model-free strategies can implement models internally seems potentially strong. Perhaps you just can’t get much in the AIXI framework because there are no good loss bounds in that framework at all, as Vanessa mentions.) However, if so, this seems like a weakness of standard bounded-regret frameworks. Predicting the world seems to be a significant aspect of intelligence; we should be able to talk about this formally somehow.
Granted, it doesn’t make sense for bounded agents to pursue predictive accuracy above all else. There is a computational trade-off, and you don’t need to predict something which isn’t important. My claim is something like, you should try and predict when you don’t yet have an effective strategy. After you have an effective strategy, you don’t really need to generate predictions. Before that, you need to generate predictions because you’re still grappling with the world, trying to understand what’s basically going on.
If we’re trying to understand intelligence, the idea that model-free learners can internally manage these trade-offs (by choosing strategies which judiciously choose to learn from predictions when it is efficacious to do so) seems less satisfying than a proper theory of learning from prediction. What is fundamental vs non-fundamental to intelligence can get fuzzy, but learning from prediction seems like something we expect any sufficiently intelligent agent to do (whether it was built-in or learned behavior).
On the other hand, judging hypotheses on their predictive accuracy is kind of a weird thing to do if what you ultimately want a hypothesis to do for you is generate actions. It’s like this: You’ve got two tasks; task A, and task B. Task A is what you really care about, but it might be quite difficult to tackle on its own. Task B is really very different from task A, but you can get a lot of feedback on task B. So you ask your hypotheses to compete on task B, and judge them on that in addition to task A. Somehow you’re expecting to get a lot of information about task A from performance on task B. And indeed, it seems you do: predictive accuracy of a hypothesis is somehow a useful proxy for efficacy of that hypothesis in guiding action.
(It should also be noted that a reward-learning framework presumes we get feedback about utility at all. If we get no feedback about reward, then we’re forced to only judge hypotheses by predictions, and make what inferences about utility we will. A dire situation for learning theory, but a situation where we can still talk about rational agency more generally.)
1b: The Analogy to Logic
My argument is going to be that if achieving high reward is task A, and predicting perception is task B, logic can be task C. Like task B, it is very different from task A. Like task B, it nonetheless provides useful information. Like task B, it seems to me that a theory of (boundedly) rational agency is missing something without it.
The basic picture is this. Perceptual prediction provides a lot of good feedback about the quality of cognitive algorithms. But if you really want to train up some good cognitive algorithms for yourself, it is helpful to do some imaginative play on the side.
One way to visualize this is an agent making up math puzzles in order to strengthen its reasoning skills. This might suggest a picture where the puzzles are always well-defined (terminating) computations. However, there’s no special dividing line between decidable and undecidable problems—any particular restriction to a decidable class might rule out some interesting (decidable but non-obviously so) stuff which we could learn from. So we might end up just going with any computations (halting or no).
Similarly, we might not restrict ourselves to entirely well-defined propositions. It makes a lot of sense to test cognitive heuristics on scenarios closer to life.
Why do I think sufficiently advanced agents are likely to do this?
Well, just as it seems important that we can learn a whole lot from prediction before we ever take an action in a given type of situation, it seems important that we can learn a whole lot by reasoning before we even observe that situation. I’m not formulating a precise learning-theoretic conjecture, but intuitively, it is related to whether we could reasonably expect the agent to get something right on the first try. Good perceptual prediction alone does not guarantee that we can correctly anticipate the effects of actions we have never tried before, but if I see an agent generate an effective strategy in a situation it has never intervened in before (but has had opportunity to observe), I expect that internally it is learning from perception at some level (even if it is model-free in overall architecture). Similarly, if I see an agent quickly pick up a reasoning-heavy game like chess, then I suspect it of learning from hypothetical simulations at some level.
Again, “on the first try” is not supposed to be a formal learning-theoretic requirement; I realize you can’t exactly expect anything to work on the first try with learning agents. What I’m getting at has something to do with generalization.
2: Learning-Theoretic Criteria
Part of the frame has been learning-theory-vs-logic. One might interpret my closing remarks from the previous section that way; I don’t know how to formulate my intuition learning-theoretically, but I expect that reasoning helps agents in particular situations. It may be that the phenomena of the previous section cannot be understood learning-theoretically, and only amount to a “better prior over strategies” as I mentioned. However, I don’t want it to be a learning-theory-vs-logic argument. I would hope that something learning-theoretic can be said in favor of learning from perception, and in favor of learning from logic. Even if it can’t, learning theory is still an important component here, regardless of the importance of logic.
I’ll try to say something about how I think learning theory should interface with logic.
Vanessa said some relevant things in a comment, which I’ll quote in full:
Heterodox opinion: I think the entire MIRIesque (and academic philosophy) approach to decision theory is confused. The basic assumption seems to be, that we can decouple the problem of learning a model of the world from the problem of taking a decision given such a model. We then ignore the first problem, and assume a particular shape for the model (for example, causal network) which allows us to consider decision theories such as CDT, EDT etc. However, in reality the two problems cannot be decoupled. This is because the type signature of a world model is only meaningful if it comes with an algorithm for how to learn a model of this type.
For example, consider Newcomb’s paradox. The agent makes a decision under the assumption that Omega behaves in a certain way. But, where did the assumption come from? Realistic agents have to learn everything they know. Learning normally requires a time sequence. For example, we can consider the iterated Newcomb’s paradox (INP). In INP, any reinforcement learning (RL) algorithm will converge to one-boxing, simply because one-boxing gives it the money. This is despite RL naively looking like CDT. Why does it happen? Because in the learned model, the “causal” relationships are not physical causality. The agent comes to believe that taking the one box causes the money to appear there.
In Newcomb’s paradox EDT succeeds but CDT fails. Let’s consider an example where CDT succeeds and EDT fails: the XOR blackmail. The iterated version would be IXB. In IXB, classical RL doesn’t guarantee much because the environment is more complex than the agent (it contains Omega). To overcome this, we can use RL with incomplete models. I believe that this indeed solves both INP and IXB.
Then we can consider e.g. counterfactual mugging. In counterfactual mugging, RL with incomplete models doesn’t work. That’s because the assumption that Omega responds in a way that depends on a counterfactual world is not in the space of models at all. Indeed, it’s unclear how can any agent learn such a fact from empirical observations. One way to fix it is by allowing the agent to precommit. Then the assumption about Omega becomes empirically verifiable. But, if we do this, then RL with incomplete models can solve the problem again.
The only class of problems that I’m genuinely unsure how to deal with is game-theoretic superrationality. However, I also don’t see much evidence the MIRIesque approach has succeeded on that front. We probably need to start with just solving the grain of truth problem in the sense of converging to ordinary Nash (or similar) equilibria (which might be possible using incomplete models). Later we can consider agents that observe each other’s source code, and maybe something along the lines of this can apply.
Besides the MIRI-vs-learning frame, I agree with a lot of this. I wrote a comment elsewhere making some related points about the need for a learning-theoretic approach. Some of the points also relate to my CDT=EDT sequence; I have been arguing that CDT and EDT don’t behave as people broadly imagine (often not having the bad behavior which people broadly imagine). Some of those arguments were learning-theoretic while others were not, but the conclusions were similar either way.
In any case, I think the following criterion (originally mentioned to me by Jack Gallagher) makes sense:
A decision problem should be conceived as a sequence, but the algorithm deciding what to do on a particular element of the sequence should not know/care what the whole sequence is.
Asymptotic decision theory was the first major proposal to conceive of decision problems as sequences in this way. Decision-problem-as-sequence allows decision theory to be addressed learning-theoretically; we can’t expect a learning agent to necessarily do well in any particular case (because it could have a sufficiently poor prior, and so still be learning in that particular case), but we can expect it to eventually perform well (provided the problem meets some “fairness” conditions which make it learnable).
As for the second part of the criterion, requiring that the agent is ignorant of the overall sequence when deciding what to do on an instance: this captures the idea of learning from logic. Providing the agent with the sequence is cheating, because you’re essentially giving the agent your interpretation of the situation.
Jack mentioned this criterion to me in a discussion of averaging decision theory (AvDT), in order to explain why AvDT was cheating.
AvDT is based on a fairly simple idea: look at the average performance of a strategy so far, rather than its expected performance on this particular problem. Unfortunately, “performance so far” requires things to be defined in terms of a training sequence (counter to the logical-induction philosophy of non-sequential learning).
I created AvDT to try and address some shortcomings of asymptotic decision theory (let’s call it AsDT). Specifically, AsDT does not do well in counterlogical mugging. AvDT is capable of doing well in counterfactual mugging. However, it depends on the training sequence. Counterlogical mugging requires the agent to decide on the “probability” of Omega asking for money vs paying up, to figure out whether participation is worth it overall. AvDT solves this problem by looking at the training sequence to see how often Omega pays up. So, the problem of doing well in decision problems is “reduced” to specifying good training sequences. This (1) doesn’t obviously make things easier, and (2) puts the work on the human trainers.
Jack is saying that the system should be looking through logic on its own to find analogous scenarios to generalize from. When judging whether a system gets counterlogical mugging right, we have to define counterlogical mugging as a sequence to enable learning-theoretic analysis; but the agent has to figure things out on its own.
This is a somewhat subtle point. A realistic agent experiences the world sequentially, and learns by treating its history as a training sequence of sorts. This is physical time. I have no problem with this. What I’m saying is that if an agent is also learning from analogous circumstances within logic, as I suggested sophisticated agents will do in the first part, then Jack’s condition should come into play. We aren’t handed, from on high, a sequence of logically defined scenarios which we can locate ourselves within. We only have regular physical time, plus a bunch of hypothetical scenarios which we can define and whose relevance we have to determine.
This gets back to my earlier intuition about agents having a reasonable chance of getting certain things right on the first try. Learning-theoretic agents don’t get things right on the first try. However, agents who learn from logic have “lots of tries” before their first real try in physical time. If you can successfully determine which logical scenarios are relevantly analogous to your own, you can learn what to do just by thinking. (Of course, you still need a lot of physical-time learning to know enough about your situation to do that!)
So, getting back to Vanessa’s point in the comment I quoted: can we solve MIRI-style decision problems by considering the iterated problem, rather than the single-shot version? To a large extent, I think so: in logical time, all games are iterated games. However, I don’t want to have to set an agent up with a training sequence in which it encounters those specific problems many times. For example, finding good strategies in chess via self-play should come naturally from the way the agent thinks about the world, rather than being an explicit training regime which the designer has to implement. Once the rules for chess are understood, the bottleneck should be thinking time rather than (physical) training instances.
- [AN #76]: How dataset size affects robustness, and benchmarking safe exploration by measuring constraint violations by 4 Dec 2019 18:10 UTC; 14 points) (
- Motivated Cognition and the Multiverse of Truth by 22 Nov 2022 12:51 UTC; 8 points) (
- 20 Sep 2021 0:16 UTC; 2 points) 's comment on Vanessa Kosoy’s Shortform by (
- 25 Sep 2019 2:01 UTC; 2 points) 's comment on This is a test post by (
Thank you for writing this, Abram.
The main difference between our ways of thinking seems to be: You are thinking in terms of, what happens inside the algorithm. I am thinking in terms of, what desiderata does the algorithm satisfy. I trust the latter approach more, because, while we can only speculate how intelligent agents work, we should be able to explain what an intelligent agent is: after all, something made us study intelligent agents in the first place. (Of course we do have some clues about how intelligent agents work from cognitive science and applied ML, but we are far from having the full picture.) In particular, I don’t directly care about the distinction between model-based and model-free.
Specifically, the desiderata I often consider is some type of regret bound. This desiderata doesn’t directly refer to prediction: it only cares about expected utility! Your reasoning would have us conclude it corresponds to model-free RL. However, the actual algorithms I use to prove regret bounds use posterior sampling: a model-based approach! So, not only that predictions enter the picture, but they enter the picture in a way that is clearly justified rather than ad hoc. (That said, the definition of Bayesian regret already refers to a prior, and the coefficients in the regret bounds depend on this prior, so I do introduce a notion of “model” a priori.)
I do not agree that this is a “dire situation for learning theory”. Quite the contrary, this is exactly the situation I formalized using instrumental reward functions. Unless you think instrumental reward functions are not good enough for some reason? (Of course, in my setting, the agent can get feedback about utility, it just doesn’t get it all the time for free. But, the only alternative seems to be “flying spaghetti monster” utility functions that are completely unobservable, and I am not sure that’s an interesting case to study.)
Agents that “make up math puzzles” is not necessarily a bad notion, but we should try to be more precise about what it means in terms of desiderata. We need to specify precisely what advantages “making up math puzzles” is supposed to confer.
It might be useful to consider arbitrary programs, but the agent will still not execute any of them for more than some small finite time. You might argue that this is why it needs logic, to prove theorems about computations that it cannot execute. But, I don’t believe this line of thinking leads to anything tractable that is not redundant w.r.t. a simpler logic-less approach.
[EDIT: Renamed “CSRL” to “TRL” 2019-09-19]
One related idea that I came up with recently is what I call “Turing reinforcement learning” (TRL). (It is rather half-baked, but I feel that the relevance is such that I should still bring it up. Also, it was inspired by our discussion, so thank you.) Imagine an RL agent connected to a computer. The agent can ask the computer to execute any program (for some finite amount of time ofc), and its prior already “knows” that such computations produce consistent answers (something about the semantics of programs can also be built into the prior, bringing us closer to logic, but I don’t believe it can be anything nearly as strong as “proving theorems about programs using PA” if we want the algorithm to be even weakly feasible). This allows it to consider incomplete hypotheses that describe some parts of the environment in terms of programs (i.e. some physical phenomenon is hypotheses to behave equivalently to the execution of a particular program). The advantage of this approach: better management of computational resources. In classical RL, the resources are allocated “statically”: any hypothesis is either in the prior or not. In contrast, in TRL the agent can decide itself which programs are worth running and when. Starting from generic regret bounds for incomplete models, this should lead to TRL satisfying some desiderata superior to classical RL with the same amount of resources, although I haven’t worked out the details yet.
This seems to me the wrong way to look at it. There is no such thing as “learning-theoretic agents”. There are agents with good learning-theoretic properties. If there is a property that you can prove in learning-theory to be impossible then no agent can have this property. Once again, I am not ruling out this entire line of thought, but I am arguing for looking for desiderata that can formalize it, instead of using it to justify ad hoc algorithms (e.g. algorithms that use logic.)
But, you have to set an agent with a training sequence, otherwise how does it ever figure out the rules of chess? I definitely agree that in some situations computational complexity is the bottleneck rather the sample complexity. My point was different: if you don’t add the context of a training sequences, you might end up with philosophical questions that are either poor starting points or don’t make sense at all.
I talked a lot about desiderata in both part (1) and part (2)! Granted, I think about these things somewhat differently than you. And I did not get very concrete with desiderata.
Part (1a) was something like “here are some things which seem like real phenomena when you look at intelligence, and how they relate to learning-theoretic criteria when we formalize them” (ie, regret bounds on reward vs regret bounds on predictive accuracy). Part (1b) was like “here are some other things which seem to me like real phenomena relating to intelligence, which a formal theory of intelligence should not miss out on, and which seem real fol very similar reasons to the more-commonly-discussed stuff from section 1a”.
Really, though, I think in terms of the stuff in (2) much more than the stuff in (1). I’m not usually thinking: “It sure seems like real agents reason using logic. How can we capture that formally in our rationality criteria?” Rather, I’m usually thinking things like: “If agents know a good amount about each other due to access to source code or other such means, then a single-shot game has the character of iterated game theory. How can we capture that in our rationality criteria?”
Perhaps one difference is that I keep talking about modeling something with regret bounds, whereas you are talking about achieving regret bounds. IE, maybe you more know what notion of regret you want to minimize, and are going after algorithms which help you minimize it; and I am more uncertain about what notion of regret is relevant, and am looking for formulations of regret which model what I am interested in.
I see what you mean about not fitting the model-based/model-free dichotomy the way I set it up. In terms of the argument I try to make in the post, I’m drawing an analogy between prediction and logic. So I’m saying that (if we think of both prediction and logic as “proxies” in the way I describe) it is plausible that formal connections to prediction have analogs in formal connections to logic, because both seem to play a similar role in connection to doing well in terms of action utility.
Are you referring to results which require realizability here? (I know it can make sense to get a regret bound in terms of the prior probability w/o assuming that one of the things in the prior has the best asymptotic loss, but, it is common to need such an assumption)
Part of what I’m saying in the second part of the original post is that I find that I have to introduce logic a priori to discuss certain kinds of regret I think about, similarly to how here you are forced to introduce “model” a priori for certain kinds of regret. (I suspect I didn’t explain this very well because I thought my motivations were clear but underestimated the inferential gap.)
I had in mind the setting where no particular assumption is made about the observability of utility at all, like in classic settings such as Savage etc. I’m not claiming it is very relevant to the discussion at hand. It’s just that there is some notion of rationality to be studied in such cases. I also misspoke in “dire situation for learning theory”—I don’t want to prematurely narrow what learning theory can fruitfully discuss. Something more like “dire situation for reinforcement learning” would have been more accurate.
I was definitely not suggesting that the agent would have to use logic in order to get feedback about programs which take too long to execute. (It could do that, and it could be useful, but it isn’t my argument for why the agent uses logic-like reasoning.) Rather, the idea was more that by virtue of testing out ways of reasoning on the programs which can be evaluated, the agent would learn to evaluate more difficult cases which can’t simply be run, which would in turn sometimes prove useful in dealing with the external environment.
Yeah, this would get at some of the idea of learning more about the environment by learning about computations. In addition to the potential for management of computational resources, it would illustrate something about how incomplete models help with logical uncertainty.
I think some of our disagreement has to do with my desire to model things which humans seem to do just fine (eg, heuristically estimate the probability of sentences in PA). Attempting to make models of this easily leads to infeasible things (such as logical induction). My reaction to this is: (1) at least now we have some idea about what it means to do well at the task; (2) but yeah, this means this isn’t how humans are doing it, so there is still something left to understand.
I’m not saying you don’t have a training sequence at all. I’m saying the training sequence is enough to figure out the rules of chess (perhaps from indirect information, not necessarily ever playing chess yet), but not enough to learn to play well from experience.
Obviously in order to understand the rules of chess quickly from a human explanation, the system would need to have already learned quite a lot about the world (or have been specialized to this sort of task).
The point is that there is also something to gain from training to see how computations go (perhaps computations which look like counterfactuals, such as the game tree in chess).
Yeah, I’m now feeling more confident that the difference in perspectives is this: you are thinking of a relatively fixed notion of regret, and want things to be relevant to achieving good bounds in terms of that. I am more wanting a notion of regret that successfully captures something I’m trying to understand. EG, logical induction was a big improvement to my understanding of the phenomenon of conjecture in mathematics—how it can be understood within the context of a rationality criterion. It doesn’t directly have to allow me to improve regret bounds for reward learners. It’s an important question whether it helps us do anything we couldn’t before in terms of reward learning. (For me it certainly did, since I was thinking in terms of Bayesians with full models before—so logical induction showed me how to think about partial models and such. Perhaps it didn’t show you how to do anything you couldn’t already, since you were already thinking about things in optimal predictor terms.) But it is a different question from whether it tells us anything about intelligence at all.
By the way, from the perspective of what I call “subjective” learnability, we can consider this case as well. In subjective learnability, we define regret with respect to an optimal agent that has all of the user’s knowledge (rather than knowing the actual “true” environment). So, it seems not difficult to imagine communication protocols that will allow the AI learn the user’s knowledge including the unobservable reward function (we assume the user imagines the environment as a POMDP with rewards that are not directly observable but defined as a function of the state). (Incidentally, recently I am leaning towards the position that some of kind of generic communication protocol (with natural language “annotation” + quantilization to avoid detrimental messages) is the best approach to achieve alignment.)
I would study this problem by considering a population of learning agents that are sequentially paired up for one-shot games where the source code of each is revealed to the other. This way they can learn how to reason about source codes. In contrast, the model where the agents try to formally prove theorems seems poorly motivated to me.
Hmm, no, I don’t think this is it. I think that I do both of those. It is definitely important to keep refining our regret criterion to capture more subtle optimality properties.
Do they though? I am not convinced the logic has a meaningful connection to doing well in terms of action utility. I think that for prediction we can provide natural models that show how prediction connects to doing well, whereas I can’t say the same about logic. (But I would be very interested in being proved wrong.)
I usually assume realizability, but for incomplete model “realizability” just means that the environment satisfies some incomplete models (i.e. has some regularities that the agent can learn).
Yes, I think this idea has merit, and I explained how TRL can address it in another comment. I don’t think we need or should study this using formal logic.
Hmm, actually I think logical induction influenced me also to move away from Bayesian purism and towards incomplete models. I just don’t think logic is the important part there. I am also not sure about the significance of this forecasting method, since in RL it is more natural to just do maximin instead. But, maybe it is still important somehow.
Actually, here’s another thought about this. Consider the following game: each player submits a program, then the programs are given each other as inputs, and executed in anytime mode (i.e. at given moment each program has to have some answer ready). The execution has some probability ϵ to terminate on each time step, so that the total execution time follows the geometric distribution. Once execution is finished, the output of the programs is interpreted as strategies in a normal-form game and the payoffs are accordingly.
This seems quite similar to an iterated game with geometric time discount! In particular, in this version of the Prisoner’s Dilemma, there is the following analogue of the tit-for-tat strategy: start by setting the answer to C, simulate the other agent, once the other agents starts producing answers, set your answer to equal to the other agent’s answer. For sufficiently shallow time discount, this is a Nash equilibrium.
In line with the idea I explained in a previous comment, it seems tempting to look for proper/thermal equilibria in this game with some constraint on the programs. One constraint that seems appealing is: force the program to have O(1) space complexity modulo oracle access to the other program. It is easy to see that a pair of such programs can be executed using O(1) space complexity as well.
Another remark about Turing reinforcement learning (I renamed it so because, it’s a reinforcement learner “plugged” into a universal Turing machine, and it is also related to Neural Turing machines). Here is how we can realize Abram’s idea that “we can learn a whole lot by reasoning before we even observe that situation”.
Imagine that the environment satisfies some hypothesis H1 that contains the evaluation of a program P1, and that P1 is relatively computationally expensive but not prohibitively so. If H1 is a simple hypothesis, and in particular P1 is a short program, then we can expect the agent to learn H1 from a relatively small number of samples. However, because of the computational cost of P1, exploiting H1 might be difficult (because there is a large latency between the time the agent knows it needs to evaluate P1 and the time it gets the answer). Now assume that P1 is actually equivalent (as a function) to a different program P2 which is long but relatively cheap. Then we can use P2 to describe hypothesis H2 which is also true, and which is easily exploitable (i.e. guarantees a higher payoff than H1) but which is complex (because P2 is long). The agent would need a very large number of (physical) samples to learn H2 directly. But, if the agent has plenty of computational resources for use during its “free time”, it can use them to learn the (correct but complex) hypothesis M:="H1=H2". Then, the conjunction of H1 and M guarantees the same high payoff as H2. Thus, TRL converges to exploiting the environment well by the virtue of “reasoning before observing”.
Moreover, we can plausibly translate this story into a rigorous desideratum along the following lines: for any conjunction of a “purely mathematical” hypothesis M and arbitrary (“physical”) hypothesis H, we can learn (=exploit) it given enough “mathematical” samples relatively to the complexity (prior probability) of M and enough “physical” samples relatively to the complexity of H.
Here is my attempt at a summary of (a standalone part of) the reasoning in this post.
An agent trying to get a lot of reward can get stuck (or at least waste data) when the actions that seem good don’t plug into the parts of the world/data stream that contain information about which actions are in fact good. That is, an agent that restricts its information about the reward+dynamics of the world to only its reward feedback will get less reward
One way an agent can try and get additional information is by deductive reasoning from propositions (if they can relate sense data to world models to propositions). Sometimes the deductive reasoning they need to do will only become apparent shortly before the result of the reasoning is required (so the reasoning should be fast)
The nice thing about logic is that you don’t need fresh data to produce test cases: you can make up puzzles! As an agent will need fast deductive reasoning strategies, they may want to try out the goodness of their reasoning strategies on puzzles they invent (to make sure they’re fast and reliable (if they hadn’t proved reliability))
In general, we should model things that we think agents are going to do, because that gives us a handle on reasoning about advanced agents. It is good to be able to establish what we can about the behaviour of advanced boundedly-rational agents so that we can make progress on the alignment problem etc
To the extent this is a correct summary, I note that it’s not obvious to me that agents would sharpen their reasoning skills via test cases rather than establishing proofs on bounds of performance and so on. Though I suppose either way they are using logic, so it doesn’t affect the claims of the post
Are “A” and “B” backwards here, or am I not following?
Backwards, thanks!