The Löbian obstacle is about trust in the reasoning performed by a subordinate agent; the promise of subsequent actions taken based on that reasoning are just a pretext for considering that formal problem. So if A constructs B to produce proofs in L, it doesn’t matter what B’s beliefs are, or even if B has any beliefs; B could just be manipulating these as formal expressions. If you insist that B’s beliefs be incorporated into its reasoning, as you seem to want to (more on that below), then I’m saying it doesn’t matter what extension of L the agent B is equipped with; it can even be an inconsistent extension, as long as only valid L-proofs are produced.
I’m not insisting on anything about the agent design; I simply remain puzzled how your proposal differs from the one in the paper, and so, continue to assume it’s like the one in the paper except those differences which I understand. Your statements have been reading almost like you think you’ve refuted formal arguments in the paper (implying you don’t think the formal agent designs do what I think they do), rather than providing a new agent design which does better. This contributes to my feeling that you are really just wrestling with Lob’s theorem itself. Nor do I recall you saying much to dissuade me of this. In your OP, you use the language “Under the belief that the Löbian Obstacle was a genuine problem to be avoided”, as if the original idea was simply mistaken.
So, to be absolutely clear on this: do you accept the mathematical proofs in the original paper (but propose some way around them), or do you think they are actually mistaken? Do you accept the proof of Lob’s theorem itself, or do you think it is mistaken?
So if A constructs B to produce proofs in L, it doesn’t matter what B’s beliefs are, or even if B has any beliefs;
But using proofs in L is precisely what I understood from “belief”; so, if you have something else in mind, I will need clarification.
The word “normative” sticks out to me as potential common ground here, so I’ll use that language. The specified semantic map determines what is “actually” true, but its content is not a priori knowledge. As such, the only way for A’s reasoning in L to have any practical value is if A works under the (normative) assumption that provability implies/will imply truth under S.
If this sounds farfetched, consider how police dramas employ the normative nature of truth for dramatic effect on a regular basis. A rogue detective uses principles of reasoning that their superiors deem invalid, so that the latter do not expect the detective’s deductions to yield results in reality… Or perhaps the practice of constructive mathematics would be a more comfortable example, where the axiom of choice is rejected in favour of weaker deductive principles. A dedicated constructivist could reasonably make the claim that their chosen fragment of logic determines what “ought” to be true.
I think I haven’t understood much of this.
As such, the only way for A’s reasoning in L to have any practical value is if A works under the (normative) assumption that provability implies/will imply truth under S.
I would argue that the only way for A’s reasoning in L to have practical value is for A to act on such reasoning. The only way for A’s reasoning to have practical value for some other agent B is for B to be working under the assumption that provability-in-L implies truth under S. Indeed, this seems in line with your detective drama example. If the superiors were put in the position of the rogue detective, often they would have come to the same conclusion. Yet, looking at the problem from a removed position, they cannot accept the conclusion. But this is precisely the lobstacle: the condition for reasoning to have value to an agent A is different from the condition for A’s reasoning to have value to another agent B.
Let me therefore make a (not totally formal) distinction based on the alief/belief idea:
The aliefs of a logic-based agent are the axioms/theorems which inform action;
The beliefs of a logic-based agent are the axioms/theorems which the agent explicitly endorses.
This is also a version of the distinction between tacit knowledge (eg knowing how to ride a bike) vs explicit knowledge (eg being able to explain scientifically how bike-riding works). To say that an agent “alieves” a logic L is to say that it tacitly knows how to reason in L.
Roughly, I think you have been talking about beliefs, where I have been talking about aliefs.
Now, a “naive logical agent” (ie the simple agent design which falls prey to the Lobstacle, in the paper) has a few important properties:
Goedel’s incompleteness theorems tells us that aliefs cannot equal beliefs; a logic never explicitly endorses itself, but rather, only explicitly endorses weaker logics.
An agent constructed based on the logic L alieves L, but does not believe L (this is a re-statement of the lobstacle).
An agent can only trust another agent if it believes that agent’s aliefs.
An agent therefore cannot belief its own aliefs. (This is another restatement of the lobstacle.)
These statements are not necessarily true of logical agents in general, of course, since the paper gives some counterexamples (agents who can trust the same logic which they base their actions on). But those counterexamples all have some oddities. I presume your goal is to get rid of those oddities while still avoiding the lobstacle.
You seem exasperated that I’m not incorporating the meta-logical beliefs into the formal system, but this is because in practice meta-logical beliefs are entirely implicit, and do not appear in the reasoning system used by agents.
The thing I’m exasperated by is, rather, that I don’t understand what you are doing with the meta-logical beliefs if you aren’t including them in the formal system. It’s fine if you don’t include them in the formal system; just tell me what you are doing instead! This was the point of my “I have the semantic map here in my back pocket” argument: fine, I’m willing to grant that the agent ‘has’ these meta-logical beliefs, I just don’t yet understand how you propose the agent use them.
I mean, I’m particularly exasperated because you explicitly state that the meta-logical beliefs don’t change the agent’s reasoning at all; it seems like you’re boldly asserting that they are just a talisman which the agent keeps in its back pocket, and that this somehow solves everything.
in practice meta-logical beliefs are entirely implicit, and do not appear in the reasoning system used by agents.
In the new terminology I am proposing, I think perhaps what you are saying is that the meta-logical stuff is aliefs and thus not apparent in the beliefs of the system. What do you think of that interpretation?
(Obviously, I don’t think your proposal solves the Lobstacle in that case, but, that is a separate question.)
If I build a chess-playing robot, its program will not explicitly include the assumption that the rules of chess it carries are the correct ones,
Right. To me this overtly reads as “a chess playing robot only alieves the rules of chess. It has no explicit endorsement of those rules.” Which makes perfect sense to me.
, nor will it use such assumptions in determining its moves, because such determinations are made without any reference to a semantic map.
But then, this part of the sentence seems to go completely off the rails for me; a chess-playing robot will be no good at all if the rules of chess don’t bear on its moves. Can you clarify?
This is why I said the agent doesn’t change its reasoning.
I still don’t understand this part at all; could you go into more depth?
The metalogical beliefs thus are only relevant when an agent applies its (completed) reasoning to its environment via the semantic map.
This helps clarify a bit; it sounds like maybe we were conflating different notions of “reasoning”. IE, I interpreted you as saying the semantic map does not enter into the agent’s algorithm anywhere, not in the logical deduction system nor in the reasoning which the agent uses to decide on actions based on those logical deductions; whereas you were saying the semantic map does not enter into the agent’s logical deductions, but rather, has to do with how those deductions are used to make decisions.
But I am still confused by your proposal:
We can formalize that situation via the extended logic L’ if you like, in which case this is probably the answer that you keep demanding:
(It seems worth clarifying that I’m only demanding that the meta-logical reasoning be formalized for us, ie that you give a formal statement of the agent design which I can understand, not formalized for the agents, ie put into their formal system)
Both A and B “reason” in L’ (B could even work in a distinct extension of L), but will only accept proofs in the fragment L. Since the axioms in L’ extending L are identifiable by their reference to the semantic mapping, there is no risk of “belief contamination” in proofs, if that were ever a concern.
Lob’s theorem says that the conclusions X where L can believe its own reasoning to be sound are precisely the cases where L can prove X, that is, □(□X→X)→□X. So normally, L’ can accept proofs in L’ just fine, if they are concretely given. The lobstacle has to do with a logic trusting itself in the abstract, that is, trusting proofs before you see them. (Hence, the lobstacle doesn’t block an agent from trusting its future self if it can concretely simulate what its future self can do, because there will be no specific reasoning step which it will reject; what the lobstacle blocks is for an agent to trust its future self with unforseen chains of reasoning.)
So, one interpretation: you propose to modify this situation, by restricting L’ to only accept proofs given in L.
This interpretation seems to conform to my “back pocket” argument. Both A and B “reason” in L’, but really, only accept L. (This explains why you said ‘”reason”’ in scare quotes.) But, I’m left wondering where/how L’ matters here, and how this proposal differs from just building naive logical agents who use L (and fall prey to the lobstacle).
Thus, I’m probably better off interpreting you as referring to “accepting proofs” in the abstract, rather than concretely given proofs. With this interpretation, it seems you mean to say that A and B alieve L’ (that is, they use L’ internally), but only explicitly endorse proofs in L (that is, they can only accept principles of reasoning which warrant L-proofs, not L’-proofs, when they abstractly consider which proofs are normative).
But then, your bolded statement seems to just be a re-statement of the Lobstacle: logical agents can’t explicitly endorse their own logic L’ which they use to reason, but rather, can only generally accept reasoning in some weaker fragment L.
Without an intended semantics, what the probabilities assigned to formulas can only be interpreted as beliefs/levels of certainty about provability of statements from axioms (which it also believes with more or less certainty). This is great, but as soon as you want your logical inductor to reason about a particular mathematical object, the only way to turn those beliefs about provability into beliefs about truth in the model, you need to extend the inductor (explicitly or implicitly) with meta-logical beliefs about the soundness of that map, since it will update its beliefs based on provability even if its proof methods aren’t sound.
This part doesn’t seem totally relevant any more, but to respond: sure, but the standard model is an infinite object, so we don’t really know how to extend the inductor to (implicitly or explicitly) know the intended semantic map. (Although this is a subject of some interest for me.) It’s plausible that there’s no way. But then, it also becomes plausible that humans don’t do this (ie don’t actually possess a full semantic map for peano arithmetic).
Instead, I’ve come to think about us (like logical inductors) as being uncertain about the model (ie uncertain whether numbers with specific properties exist, rather than possessing any tool which would fully differentiate between standard and non-standard numbers).
As such, I feel you’ve misunderstood me here. I don’t want semantics-independent reasoning at all, if anything the opposite: reasoning that prioritises verifying soundness of a logic wrt specified semantic maps in an empirical way, and which can adapt its reasoning system when soundness is shown to fail.
Your critique of logical induction in the post was “fixed semantics are inflexible”. But now it seems like you are the one who proposes to have a fixed semantics (carrying a specific semantic map, eg, the standard model), and modifying the logic when that semantic map shows the logic to be unsound. In contrast, logical induction is the one who proposes to have no fixed semantic map (instead being uncertain about the model), and modifying those beliefs based on what the logic proves.
My critique of this is only that I don’t know how to “have the standard model”. In general, you propose that the agent has a semantic map which can be used to check the soundness of the logic. But I am skeptical: a semantic map will usually refer to things we cannot check (EG the external world, or uncomputable facts about numbers). Don’t get me wrong: I would love to be able to check those things more properly. It’s just that I don’t see a way (beyond what is already possible with logical reasoning, or more haphazardly but more powerfully, with the probabilistic extension of logical reasoning that logical induction gives us).
(And I also don’t yet see what that part has to do with getting around the Lobstacle.)
I like the alief/belief distinction, this seems to carry the distinction I was after. To make it more formal, I’ll use “belief” to refer to ‘things which an agent can prove in its reasoning engine/language (L)‘, and “alief” to refer to beliefs plus ‘additional assumptions which the agent makes about the bearing of that reasoning on the environment’, which together constitute a larger logic (L’). Does that match the distinction you intended between these terms?
An immediate pedagogical problem with this terminology is that we have to be careful not to conflate this notion of belief with the usual one: an agent will still be able to prove things in L even if it doesn’t believe (in the conventional sense) that the L-proof involved is valid.
There is a more serious formalization issue at play, though, which is the problem of expressing a negative alief. How does one express that an agent “does not alieve that a proof of X in L implies that X is true”? ¬(□X→X) is classically equivalent to □X∧¬X, which in particular is an assertion of both the existence of a proof of X and the falsehood of X, which is clearly far stronger than the intended claim. This is going off on a tangent, so for now I will just assume that it is possible to express disaliefs by introducing some extra operators in L’ and get on with it.
So, to be absolutely clear on this: do you accept the mathematical proofs in the original paper (but propose some way around them), or do you think they are actually mistaken? Do you accept the proof of Lob’s theorem itself, or do you think it is mistaken?
Yes. The mathematical proofs and Löb’s theorem are absolutely fine. What I’m refuting is their relevance; specifically the validity of this claim:
An agent can only trust another agent if it believes that agent’s aliefs.
My position is that *when their beliefs are sound* an agent only ever needs to *alieve* another agents *beliefs* in order to trust them. A definition of trust which fails to be reflexive is clearly a bad definition, and with this modified definition there is no obstacle because “beliefs being strictly weaker than aliefs” is the default (indeed, by Löb’s theorem, the only reasonable possibility), and can be implemented symmetrically between multiple agents; no heirarchy of logics or other such solution is needed.
Note that following the construction in the article, the secondary agent B can only act on the basis of a valid L-proof, so there is no need to distinguish between trusting what B says (the L-proofs B produces) and what B does (their subsequent action upon producing an L-proof). Attaining this guarantee in practice, so as to be able to trust that B will do what they have promised to do, is a separate but important problem. In general, the above notion of trust will only apply to what another agent says, or more precisely to the proofs they produce.
> So if A constructs B to produce proofs in L, it doesn’t matter what B’s beliefs are, or even if B has any beliefs;
But using proofs in L is precisely what I understood from “belief”; so, if you have something else in mind, I will need clarification.
In the language you introduce, and bearing in mind what I said above, this would be restated as “it doesn’t matter what B’s aliefs are”. I can use this to illustrate why I needed to include the condition “when their beliefs are sound” in the above: suppose A and B have have differing/imperfect knowledge about the state of the environment, to the effect that B may disalieve the soundness of L, while A alieves it. The result is that B might alieve they are misleading A by communicating an L-proof which B alieves to be invalid. But if A’s application of L is in fact sound (so B is in the wrong) then A can still trust and successfully apply the L-proof supplied by B.
> …nor will it use such assumptions in determining its moves, because such determinations are made without any reference to a semantic map.
But then, this part of the sentence seems to go completely off the rails for me; a chess-playing robot will be no good at all if the rules of chess don’t bear on its moves. Can you clarify?
You’re right, this part was a mistake. Such a robot will optimize play with respect to what it alieves are the rules, and so will become very good at what it implicitly alieves chess to be.
> This is why I said the agent doesn’t change its reasoning.
I still don’t understand this part at all; could you go into more depth?
I was taking “reasoning” here to mean “applying the logic L” (so manipulating statements of belief), since any assumptions lying strictly in L’ are only applied passively. It feels strange to me to extend “reasoning” to include this implicit stuff, even if we are including it in our formal model of the agent’s behaviour.
But then, your bolded statement seems to just be a re-statement of the Löbstacle: logical agents can’t explicitly endorse their own logic L’ which they use to reason, but rather, can only generally accept reasoning in some weaker fragment L.
It’s certainly a restatement of Löb’s theorem. My assertion is that there is no resultant obstacle.
Re the rest,
(And I also don’t yet see what that part has to do with getting around the Löbstacle.)
It’s not relevant to getting around the Löbstacle; this part of the discussion was the result of me proposing a possible advantage of the perspective shift which (I believe, but have yet to fully convince you) resolves the Löbstacle. I agree that this part is distracting, but it’s also interesting, so please direct message me (via whatever means is available on LW, or by finding me on the MIRIx Discord server or AI alignment Slack) if you have time to discuss it some more.
I like the alief/belief distinction, this seems to carry the distinction I was after. To make it more formal, I’ll use “belief” to refer to ‘things which an agent can prove in its reasoning engine/language (L)‘, and “alief” to refer to beliefs plus ‘additional assumptions which the agent makes about the bearing of that reasoning on the environment’, which together constitute a larger logic (L’). Does that match the distinction you intended between these terms?
Unfortunately, this seems almost opposite to the way I was defining the terminology. I had it that the aliefs are precisely what is proven by the formal system, and the beliefs are what the agent would explicitly endorse if asked. Aliefs are what you feel in your bones. So if the “bones” of the agent are the formal system, that’s the aliefs.
Note also that your definition implies that if an agent alieves something, it must also believe it. In contrast, part of the point for me is that an agent can alieve things without believing them. I would also allow the opposite, for humans and other probabilistic reasoners, though for pure-logic agents this would have to correspond to unsoundness. But pure-logical agents have to have the freedom to alieve without believing, on pain of inconsistency, even if we can’t model belief-without-alief in pure logic.
I find it interesting that you (seemingly) nodded along with my descriptions, but then proposed a definition which was almost opposite mine! I think there’s probably a deep reason for that (having to do with how difficult it is to reliably distinguish alief/belief), but I’m not grasping it for now. It is a symptom of my confusion in this regard that I’m not even sure we’re pointing to different notions of belief/alief even though your definition sounds almost opposite to me. It is well within the realm of possibility that we mean the same thing, and are just choosing very different ways to talk about it.
Specifically, your definition seems fine if L is not the formal language which the agent is hard-wired with, but rather, some logic which the agent explicitly endorses (like the relationship that we have with Peano Arithmetic). Then, yeah, “belief” is about provability in L, while “alief” implies that the agent has some “additional assumptions about the bearing of that reasoning on the environment”. Totally! But then, this suggests that those additional assumptions are somehow represented in some other subsystem of the agent (outside of L). The logic of that other subsystem is what I’m interested in. If that other subsystem uses L’, then it makes sense that the agent explicitly believes L. But now the aliefs of the agent are represented by L’. That is: L’ is the logic within the agent’s bones. So it’s L’ that I’m talking about when I define “aliefs” as the axiomatic system, and “beliefs” as more complicated (opposite to your definition).
Over this discussion, a possible interpretation of what you’re saying that’s been in the back of my mind has been that you think agents should not rely on formal logic in their bones, but rather, should use formal logic only as part of their explicit thinking IE a form of thinking which other, “more basic” reasoning systems use as a tool (a tool which they can choose to endorse or not). For example, you might believe that an RL system decides whether to employ logical reasoning. Or deep learning. Etc. In this case, you might say that there is no L’ to find (no logical system represents the aliefs).
To be clear, I do think this is a kind of legitimate response to the Lobstacle: the response of rejecting logic (as a tool for understanding what’s going on in an agent’s “bones” IE their basic decision architecture). This approach says: “Look what happens when you try to make the agent overly reliant on logic! Don’t do that!”
However, the Lobstacle is specifically about using logic to describe or design decision-making procedures. So, this ‘solution’ will not be very satisfying for people trying to do that. The puzzling nature of the Lobstacle remains: the claim is that RL (or something) has to basically solve the problem; we can’t use logic. But why is this? Is it because agents have to “be irrational” at some level (ie, their basic systems can’t conform to the normative content of logic)?
Anyway, this may or may not resemble your view. You haven’t explicitly come out and said anything like that, although it does seem like you think there should be a level beyond logic in some sense.
An immediate pedagogical problem with this terminology is that we have to be careful not to conflate this notion of belief with the usual one: an agent will still be able to prove things in L even if it doesn’t believe (in the conventional sense) that the L-proof involved is valid.
It seems like you think this property is so important that it’s almost definitional, and so, a notion of belief which doesn’t satisfy it is in conflict with the usual notion of belief. I just don’t have this intuition. My notion of belief-in-contrast-to-alief does contrast with the informal notion, but I would emphasize other differences:
“belief” contrasts with the intuitive notion of belief in that it much less implies that you’ll act on a belief; our intuitive notion more often predicts that you’ll act on something if you really believe it.
“alief” contrasts with the intuitive notion of belief in that it much less implies that you’ll explicitly endorse something, or even be able to articulate it in language at all.
In other words, I see the intuitive notion of belief as really consisting of two parts, belief and alief, which are intuitively assumed to go together but which we are splitting apart here.
The property you mention is derived from this conflation, because in fact we need to alieve a reasoning process in order to believe its outputs; so if we’re conflating alief and belief, then it seems like we need to believe that L-proofs are valid in order to see L-proofs and (as a direct result) come to believe what’s been proved.
But this is precisely what’s at stake, in making the belief/alief distinction: we want to point out that this isn’t a healthy relationship to have with logic (indeed, Godel shows how it leads to inconsistency).
There is a more serious formalization issue at play, though, which is the problem of expressing a negative alief. How does one express that an agent “does not alieve that a proof of X in L implies that X is true”? ¬(□X→X) is classically equivalent to □X∧¬X, which in particular is an assertion of both the existence of a proof of X and the falsehood of X, which is clearly far stronger than the intended claim. This is going off on a tangent, so for now I will just assume that it is possible to express disaliefs by introducing some extra operators in L’ and get on with it.
I like that you pointed this out, but yeah, it doesn’t seem especially important to our discussion. In any case, I would define disalief as something like this:
The agent’s mental architecture lacks a deduction rule accepting proofs in L, or anything tantamount to such a rule.
Note that the agent might also disbelieve in such a rule, IE, might expect some such proofs to have false conclusions. But this is, of course, not necessary. In particular it would not be true in the case of logics which the agent explicitly endorses (and therefore must not alieve).
Yes. The mathematical proofs and Löb’s theorem are absolutely fine. What I’m refuting is their relevance; specifically the validity of this claim:
An agent can only trust another agent if it believes that agent’s aliefs.
My position is that *when their beliefs are sound* an agent only ever needs to *alieve* another agents *beliefs* in order to trust them.
Hrm. I suspect language has failed us, and we need to make some more distinctions (but I am not sure what they are).
In my terms, if A alieves B’s beliefs, then (for example) if B explicitly endorses ZFC, then A must be using ZFC “in its bones”. But if B explicitly endorses ZFC, then the logic which B is using must be more powerful than that. So A might be logically weaker than B (if A only alieves ZFC, and not the stronger system which B used to decide that ZFC is sound). If so, A cannot trust B (A does not alieve that B’s reasoning is sound, that is, A does not believe B).
I have to confess that I’m confusing myself a bit, and am tempted to give yet another (different, incompatible) definition for the alief/belief split. I’ll hold off for now, but I hope it’s very clear that I’m not accusing all confusion about this as coming from you—I’m aiming to minimize confusion, but I still worry that I’ve introduced contradictory ideas in this conversation. (I’m actually tempted to start distinguishing 3 levels rather than 2! Alief/belief are relative terms, and I worry we’re actually conflating important subtleties by using only 2 terms rather than having multple levels...)
A definition of trust which fails to be reflexive is clearly a bad definition, and with this modified definition there is no obstacle
This goes back to the idea that you seem to think “belief in X implies belief in the processes whereby you came to believe in X” is so important as to be definitional, where I think this property has to be a by-product of other things.
In my understanding, the definition of “trust” should not explicitly allow or disallow this, if we’re going to be true to what “trust” means. Rather, for the Lobstacle, “A trusts B” has to be defined as “A willingly relies on B to perform mission-critical tasks”. This definition does indeed fail to be true for naive logical agents. But this should be an argument against naive logical agents, not our notion of trust.
Hence my perception that you do indeed have to question the theorems themselves, in order to dispute their “relevance” to the situation. The definition of trust seems fixed in place to me; indeed, I would instead have to question the relevance of your alternative definition, since what I actually want is the thing studied in the paper (IE being able to delegate critical tasks to another agent).
Note that following the construction in the article, the secondary agent B can only act on the basis of a valid L-proof, so there is no need to distinguish between trusting what B says (the L-proofs B produces) and what B does (their subsequent action upon producing an L-proof).
Ok, but if agent B can only act on valid L-proofs, it seems like agent B has been given a frontal lobotomy (IE this is just the “make sure my future self is dumber” style of solution to the problem).
Or, on the other hand, if the agent A also respects this same restriction, then A cannot delegate tasks to B (because A can’t prove that it’s OK to do so, at least not in L, the logic which it has been restricted to use when it comes to deciding how to act).
Which puts us back in the same lobstacle.
Attaining this guarantee in practice, so as to be able to trust that B will do what they have promised to do, is a separate but important problem. In general, the above notion of trust will only apply to what another agent says, or more precisely to the proofs they produce.
Is this a crux for you? My thinking is that this is going to be a deadly sticking point. It seems like you’re admitting that your approach has this problem, but, you think there’s value in what you’ve done so far because you’ve solved one part of the problem and you think this other part could also work with time. Is that what you’re intending to say? Whereas to me, it looks like this other part is just doomed to fail, so I don’t see what the value in your proposal could be.
For me, solving the Lobstacle means being able to actually decide to delegate.
I was taking “reasoning” here to mean “applying the logic L” (so manipulating statements of belief), since any assumptions lying strictly in L’ are only applied passively. It feels strange to me to extend “reasoning” to include this implicit stuff, even if we are including it in our formal model of the agent’s behaviour.
I think I get what you’re saying here. But if the assumptions lying strictly in L’ are only applied passively, how does it help us? I’m thinking your current answer is (as you’ve already said) “trusting that B will do what they’ve said they’ll do is a separate problem”—IE you aren’t even trying to build the full bridge between B thinking something is a good idea and A trusting B with such tasks.
Both A and B “reason” in L’ (B could even work in a distinct extension of L), but will only accept proofs in the fragment L.
[...]
But then, your bolded statement seems to just be a re-statement of the Löbstacle: logical agents can’t explicitly endorse their own logic L’ which they use to reason, but rather, can only generally accept reasoning in some weaker fragment L.
It’s certainly a restatement of Löb’s theorem. My assertion is that there is no resultant obstacle.
I still really, really don’t get why your language is stuff like “there’s no resultant obstacle” and “what I’m refuting is their relevance”. Your implication is “there was not a problem to begin with” rather than “I have solved the problem”. I asked whether you objected to details of the math in the original paper, and you said no—so apparently you would agree with the result that naive logical agents fail to trust their future self (which is the lobstacle!). Solving the lobstacle would presumably involve providing an improved agent design which would avoid the problem. Yet this seems like it’s not what you want to do—instead you claim something else, along the lines of claiming that there’s not actually a problem?
So, basically, what do you claim to accomplish? I suspect I’m still really misunderstanding that part of it.
Re the rest,
(And I also don’t yet see what that part has to do with getting around the Löbstacle.)
It’s not relevant to getting around the Löbstacle; this part of the discussion was the result of me proposing a possible advantage of the perspective shift which (I believe, but have yet to fully convince you) resolves the Löbstacle. I agree that this part is distracting, but it’s also interesting, so please direct message me (via whatever means is available on LW, or by finding me on the MIRIx Discord server or AI alignment Slack) if you have time to discuss it some more.
Since you’re now identifying it as another part of the perspective shift which you’re trying to communicate (rather than just some technical distraction), it sounds like it might actually be pretty helpful toward me understanding what you’re trying to get at. But, there are already a lot of points floating around in this conversation, so maybe I’ll let it drop.
I’m somewhat curious if you think you’ve communicated your perspective shift to any other person; so far, I’m like “there just doesn’t seem to be anything real here”, but maybe there are other people who get what you’re trying to say?
Note also that your definition implies that if an agent alieves something, it must also believe it.
I find it interesting that you (seemingly) nodded along with my descriptions, but then proposed a definition which was almost opposite mine!
I don’t know how you so misread what I said; I explicitly wrote that aliefs constitute the larger logic, so that beliefs are contained in aliefs (which I’m pretty sure is what you were going for!) and not vice versa. Maybe you got confused because I put beliefs first in this description, or because I described the smaller of the two logics as the “reasoning engine” (for the reason I subsequently provided)? What you said almost convinced me that our definitions actually align, until I reached the point where you said that that beliefs could be “more complicated” than aliefs, which made me unsure.
Anyway, since you keep taking the time to thoroughly reply in good faith, I’ll do my best to clarify and address some of the rest of what you’ve said. However, thanks to the discussion we’ve had so far, a more formal presentation of my ideas is crystallizing in my mind; I prefer to save that for another proper post, since I anticipate it will involve rejigging the terminology again, and I don’t want to muddy the waters further!
Rather, for the Lobstacle, “A trusts B” has to be defined as “A willingly relies on B to perform mission-critical tasks”. This definition does indeed fail to be true for naive logical agents. But this should be an argument against naive logical agents, not our notion of trust.
Hence my perception that you do indeed have to question the theorems themselves, in order to dispute their “relevance” to the situation. The definition of trust seems fixed in place to me; indeed, I would instead have to question the relevance of your alternative definition, since what I actually want is the thing studied in the paper (IE being able to delegate critical tasks to another agent).
Perhaps we have different intuitive notions of trust, since I certainly trust myself (to perform “mission-critical tasks”), at least as far as my own logical reasoning is concerned, and an agent that doesn’t trust itself is going to waste a lot of time second-guessing its own actions. So I don’t think you’ve addressed my argument that the definition of trust that leads to the Löbstacle is faulty because it fails to be reflexive.
Attaining this guarantee in practice, so as to be able to trust that B will do what they have promised to do, is a separate but important problem. In general, the above notion of trust will only apply to what another agent says, or more precisely to the proofs they produce.
Is this a crux for you? My thinking is that this is going to be a deadly sticking point. It seems like you’re admitting that your approach has this problem, but, you think there’s value in what you’ve done so far because you’ve solved one part of the problem and you think this other part could also work with time. Is that what you’re intending to say? Whereas to me, it looks like this other part is just doomed to fail, so I don’t see what the value in your proposal could be. For me, solving the Lobstacle means being able to actually decide to delegate.
There are two separate issues here, and this response makes it apparent that you are conflating them. The fact that the second agent in the original Löbstacle paper is constrained to act only once it has produced a provably effective strategy and is constrained to follow that strategy means that the Löbstacle only applies to questions concerning the (formal) reasoning of a subordinate agent. Whether or not I manage to convince you that the Löbstacle doesn’t exist (because it’s founded on an untenable definition of trust), you have to acknowledge that the argument as presented there doesn’t address the following second problem. Suppose I can guarantee that my subordinate uses reasoning that I believe to be valid. How can I guarantee that it will act on that reasoning in a way I approve of? This is (obviously) a rather general version of the alignment problem. If you’re claiming that Löb’s theorem has a bearing on this, then that would be big news, especially if it vindicates your opinion that it is “doomed to fail”.
The reason I see my post as progress is that currently the Löbstacle is blocking serious research in using simple systems of formal agents to investigate such important problems as the alignment problem.
Your implication is “there was not a problem to begin with” rather than “I have solved the problem”. I asked whether you objected to details of the math in the original paper, and you said no—so apparently you would agree with the result that naive logical agents fail to trust their future self (which is the lobstacle!).
Taking the revised definition of trust I described, that last sentence is no longer the content of any formal mathematical result in that paper, so I do not agree with it, and I stand by what I said.
Indeed, my claim boils down to saying that there is no problem. But I don’t see why that doesn’t constitute a solution to the apparent problem. It’s like the Missing Dollar Riddle; explaining why there’s no problem is the solution.
I’m somewhat curious if you think you’ve communicated your perspective shift to any other person; so far, I’m like “there just doesn’t seem to be anything real here”, but maybe there are other people who get what you’re trying to say?
There’s no real way for me to know. Everyone who I’ve spoken to about this in person has gotten it, but that only amounts to a handful of people. It’s hard to find an audience; I hoped LW would supply one, but so far it seems not. Hopefully a more formal presentation will improve the situation.
Anyway, since you keep taking the time to thoroughly reply in good faith, I’ll do my best to clarify and address some of the rest of what you’ve said. However, thanks to the discussion we’ve had so far, a more formal presentation of my ideas is crystallizing in my mind; I prefer to save that for another proper post, since I anticipate it will involve rejigging the terminology again, and I don’t want to muddy the waters further!
Looks like I forgot about this discussion! Did you post a more formal treatment?
I don’t know how you so misread what I said; I explicitly wrote that aliefs constitute the larger logic, so that beliefs are contained in aliefs (which I’m pretty sure is what you were going for!) and not vice versa. Maybe you got confused because I put beliefs first in this description, or because I described the smaller of the two logics as the “reasoning engine” (for the reason I subsequently provided)? What you said almost convinced me that our definitions actually align, until I reached the point where you said that that beliefs could be “more complicated” than aliefs, which made me unsure.
Sorry for the confusion here! I haven’t re-oriented myself to the whole context, but it sounds like I did invent a big disagreement that didn’t exist. This has to do with my continued confusion about your approach. But in retrospect I do think your early accusation that I was insisting on some rigid assumptions holds water; I needed to go a bit further afield to try and interpret what you were getting at.
Whether or not I manage to convince you that the Löbstacle doesn’t exist (because it’s founded on an untenable definition of trust), you have to acknowledge that the argument as presented there doesn’t address the following second problem.
Again, I haven’t yet understood your approach or even re-read the whole conversation here, but it now seems to me that I was doing something wrong and silly by insisting on a definition of trust that forces the Löbstacle. The original paper is careful to only state that Löb naively seems to present an obstacle, not that it really truly does so. It looks to me like I was repeatedly stubborn on this point in an unproductive way.
I’m not insisting on anything about the agent design; I simply remain puzzled how your proposal differs from the one in the paper, and so, continue to assume it’s like the one in the paper except those differences which I understand. Your statements have been reading almost like you think you’ve refuted formal arguments in the paper (implying you don’t think the formal agent designs do what I think they do), rather than providing a new agent design which does better. This contributes to my feeling that you are really just wrestling with Lob’s theorem itself. Nor do I recall you saying much to dissuade me of this. In your OP, you use the language “Under the belief that the Löbian Obstacle was a genuine problem to be avoided”, as if the original idea was simply mistaken.
So, to be absolutely clear on this: do you accept the mathematical proofs in the original paper (but propose some way around them), or do you think they are actually mistaken? Do you accept the proof of Lob’s theorem itself, or do you think it is mistaken?
But using proofs in L is precisely what I understood from “belief”; so, if you have something else in mind, I will need clarification.
I think I haven’t understood much of this.
I would argue that the only way for A’s reasoning in L to have practical value is for A to act on such reasoning. The only way for A’s reasoning to have practical value for some other agent B is for B to be working under the assumption that provability-in-L implies truth under S. Indeed, this seems in line with your detective drama example. If the superiors were put in the position of the rogue detective, often they would have come to the same conclusion. Yet, looking at the problem from a removed position, they cannot accept the conclusion. But this is precisely the lobstacle: the condition for reasoning to have value to an agent A is different from the condition for A’s reasoning to have value to another agent B.
Let me therefore make a (not totally formal) distinction based on the alief/belief idea:
The aliefs of a logic-based agent are the axioms/theorems which inform action;
The beliefs of a logic-based agent are the axioms/theorems which the agent explicitly endorses.
This is also a version of the distinction between tacit knowledge (eg knowing how to ride a bike) vs explicit knowledge (eg being able to explain scientifically how bike-riding works). To say that an agent “alieves” a logic L is to say that it tacitly knows how to reason in L.
Roughly, I think you have been talking about beliefs, where I have been talking about aliefs.
Now, a “naive logical agent” (ie the simple agent design which falls prey to the Lobstacle, in the paper) has a few important properties:
Goedel’s incompleteness theorems tells us that aliefs cannot equal beliefs; a logic never explicitly endorses itself, but rather, only explicitly endorses weaker logics.
An agent constructed based on the logic L alieves L, but does not believe L (this is a re-statement of the lobstacle).
An agent can only trust another agent if it believes that agent’s aliefs.
An agent therefore cannot belief its own aliefs. (This is another restatement of the lobstacle.)
These statements are not necessarily true of logical agents in general, of course, since the paper gives some counterexamples (agents who can trust the same logic which they base their actions on). But those counterexamples all have some oddities. I presume your goal is to get rid of those oddities while still avoiding the lobstacle.
The thing I’m exasperated by is, rather, that I don’t understand what you are doing with the meta-logical beliefs if you aren’t including them in the formal system. It’s fine if you don’t include them in the formal system; just tell me what you are doing instead! This was the point of my “I have the semantic map here in my back pocket” argument: fine, I’m willing to grant that the agent ‘has’ these meta-logical beliefs, I just don’t yet understand how you propose the agent use them.
I mean, I’m particularly exasperated because you explicitly state that the meta-logical beliefs don’t change the agent’s reasoning at all; it seems like you’re boldly asserting that they are just a talisman which the agent keeps in its back pocket, and that this somehow solves everything.
In the new terminology I am proposing, I think perhaps what you are saying is that the meta-logical stuff is aliefs and thus not apparent in the beliefs of the system. What do you think of that interpretation?
(Obviously, I don’t think your proposal solves the Lobstacle in that case, but, that is a separate question.)
Right. To me this overtly reads as “a chess playing robot only alieves the rules of chess. It has no explicit endorsement of those rules.” Which makes perfect sense to me.
But then, this part of the sentence seems to go completely off the rails for me; a chess-playing robot will be no good at all if the rules of chess don’t bear on its moves. Can you clarify?
I still don’t understand this part at all; could you go into more depth?
This helps clarify a bit; it sounds like maybe we were conflating different notions of “reasoning”. IE, I interpreted you as saying the semantic map does not enter into the agent’s algorithm anywhere, not in the logical deduction system nor in the reasoning which the agent uses to decide on actions based on those logical deductions; whereas you were saying the semantic map does not enter into the agent’s logical deductions, but rather, has to do with how those deductions are used to make decisions.
But I am still confused by your proposal:
(It seems worth clarifying that I’m only demanding that the meta-logical reasoning be formalized for us, ie that you give a formal statement of the agent design which I can understand, not formalized for the agents, ie put into their formal system)
Lob’s theorem says that the conclusions X where L can believe its own reasoning to be sound are precisely the cases where L can prove X, that is, □(□X→X)→□X. So normally, L’ can accept proofs in L’ just fine, if they are concretely given. The lobstacle has to do with a logic trusting itself in the abstract, that is, trusting proofs before you see them. (Hence, the lobstacle doesn’t block an agent from trusting its future self if it can concretely simulate what its future self can do, because there will be no specific reasoning step which it will reject; what the lobstacle blocks is for an agent to trust its future self with unforseen chains of reasoning.)
So, one interpretation: you propose to modify this situation, by restricting L’ to only accept proofs given in L.
This interpretation seems to conform to my “back pocket” argument. Both A and B “reason” in L’, but really, only accept L. (This explains why you said ‘”reason”’ in scare quotes.) But, I’m left wondering where/how L’ matters here, and how this proposal differs from just building naive logical agents who use L (and fall prey to the lobstacle).
Thus, I’m probably better off interpreting you as referring to “accepting proofs” in the abstract, rather than concretely given proofs. With this interpretation, it seems you mean to say that A and B alieve L’ (that is, they use L’ internally), but only explicitly endorse proofs in L (that is, they can only accept principles of reasoning which warrant L-proofs, not L’-proofs, when they abstractly consider which proofs are normative).
But then, your bolded statement seems to just be a re-statement of the Lobstacle: logical agents can’t explicitly endorse their own logic L’ which they use to reason, but rather, can only generally accept reasoning in some weaker fragment L.
This part doesn’t seem totally relevant any more, but to respond: sure, but the standard model is an infinite object, so we don’t really know how to extend the inductor to (implicitly or explicitly) know the intended semantic map. (Although this is a subject of some interest for me.) It’s plausible that there’s no way. But then, it also becomes plausible that humans don’t do this (ie don’t actually possess a full semantic map for peano arithmetic).
Instead, I’ve come to think about us (like logical inductors) as being uncertain about the model (ie uncertain whether numbers with specific properties exist, rather than possessing any tool which would fully differentiate between standard and non-standard numbers).
Your critique of logical induction in the post was “fixed semantics are inflexible”. But now it seems like you are the one who proposes to have a fixed semantics (carrying a specific semantic map, eg, the standard model), and modifying the logic when that semantic map shows the logic to be unsound. In contrast, logical induction is the one who proposes to have no fixed semantic map (instead being uncertain about the model), and modifying those beliefs based on what the logic proves.
My critique of this is only that I don’t know how to “have the standard model”. In general, you propose that the agent has a semantic map which can be used to check the soundness of the logic. But I am skeptical: a semantic map will usually refer to things we cannot check (EG the external world, or uncomputable facts about numbers). Don’t get me wrong: I would love to be able to check those things more properly. It’s just that I don’t see a way (beyond what is already possible with logical reasoning, or more haphazardly but more powerfully, with the probabilistic extension of logical reasoning that logical induction gives us).
(And I also don’t yet see what that part has to do with getting around the Lobstacle.)
I like the alief/belief distinction, this seems to carry the distinction I was after. To make it more formal, I’ll use “belief” to refer to ‘things which an agent can prove in its reasoning engine/language (L)‘, and “alief” to refer to beliefs plus ‘additional assumptions which the agent makes about the bearing of that reasoning on the environment’, which together constitute a larger logic (L’). Does that match the distinction you intended between these terms?
An immediate pedagogical problem with this terminology is that we have to be careful not to conflate this notion of belief with the usual one: an agent will still be able to prove things in L even if it doesn’t believe (in the conventional sense) that the L-proof involved is valid.
There is a more serious formalization issue at play, though, which is the problem of expressing a negative alief. How does one express that an agent “does not alieve that a proof of X in L implies that X is true”? ¬(□X→X) is classically equivalent to □X∧¬X, which in particular is an assertion of both the existence of a proof of X and the falsehood of X, which is clearly far stronger than the intended claim. This is going off on a tangent, so for now I will just assume that it is possible to express disaliefs by introducing some extra operators in L’ and get on with it.
Yes. The mathematical proofs and Löb’s theorem are absolutely fine. What I’m refuting is their relevance; specifically the validity of this claim:
My position is that *when their beliefs are sound* an agent only ever needs to *alieve* another agents *beliefs* in order to trust them. A definition of trust which fails to be reflexive is clearly a bad definition, and with this modified definition there is no obstacle because “beliefs being strictly weaker than aliefs” is the default (indeed, by Löb’s theorem, the only reasonable possibility), and can be implemented symmetrically between multiple agents; no heirarchy of logics or other such solution is needed.
Note that following the construction in the article, the secondary agent B can only act on the basis of a valid L-proof, so there is no need to distinguish between trusting what B says (the L-proofs B produces) and what B does (their subsequent action upon producing an L-proof). Attaining this guarantee in practice, so as to be able to trust that B will do what they have promised to do, is a separate but important problem. In general, the above notion of trust will only apply to what another agent says, or more precisely to the proofs they produce.
In the language you introduce, and bearing in mind what I said above, this would be restated as “it doesn’t matter what B’s aliefs are”. I can use this to illustrate why I needed to include the condition “when their beliefs are sound” in the above: suppose A and B have have differing/imperfect knowledge about the state of the environment, to the effect that B may disalieve the soundness of L, while A alieves it. The result is that B might alieve they are misleading A by communicating an L-proof which B alieves to be invalid. But if A’s application of L is in fact sound (so B is in the wrong) then A can still trust and successfully apply the L-proof supplied by B.
You’re right, this part was a mistake. Such a robot will optimize play with respect to what it alieves are the rules, and so will become very good at what it implicitly alieves chess to be.
I was taking “reasoning” here to mean “applying the logic L” (so manipulating statements of belief), since any assumptions lying strictly in L’ are only applied passively. It feels strange to me to extend “reasoning” to include this implicit stuff, even if we are including it in our formal model of the agent’s behaviour.
It’s certainly a restatement of Löb’s theorem. My assertion is that there is no resultant obstacle.
Re the rest,
It’s not relevant to getting around the Löbstacle; this part of the discussion was the result of me proposing a possible advantage of the perspective shift which (I believe, but have yet to fully convince you) resolves the Löbstacle. I agree that this part is distracting, but it’s also interesting, so please direct message me (via whatever means is available on LW, or by finding me on the MIRIx Discord server or AI alignment Slack) if you have time to discuss it some more.
Unfortunately, this seems almost opposite to the way I was defining the terminology. I had it that the aliefs are precisely what is proven by the formal system, and the beliefs are what the agent would explicitly endorse if asked. Aliefs are what you feel in your bones. So if the “bones” of the agent are the formal system, that’s the aliefs.
Note also that your definition implies that if an agent alieves something, it must also believe it. In contrast, part of the point for me is that an agent can alieve things without believing them. I would also allow the opposite, for humans and other probabilistic reasoners, though for pure-logic agents this would have to correspond to unsoundness. But pure-logical agents have to have the freedom to alieve without believing, on pain of inconsistency, even if we can’t model belief-without-alief in pure logic.
I find it interesting that you (seemingly) nodded along with my descriptions, but then proposed a definition which was almost opposite mine! I think there’s probably a deep reason for that (having to do with how difficult it is to reliably distinguish alief/belief), but I’m not grasping it for now. It is a symptom of my confusion in this regard that I’m not even sure we’re pointing to different notions of belief/alief even though your definition sounds almost opposite to me. It is well within the realm of possibility that we mean the same thing, and are just choosing very different ways to talk about it.
Specifically, your definition seems fine if L is not the formal language which the agent is hard-wired with, but rather, some logic which the agent explicitly endorses (like the relationship that we have with Peano Arithmetic). Then, yeah, “belief” is about provability in L, while “alief” implies that the agent has some “additional assumptions about the bearing of that reasoning on the environment”. Totally! But then, this suggests that those additional assumptions are somehow represented in some other subsystem of the agent (outside of L). The logic of that other subsystem is what I’m interested in. If that other subsystem uses L’, then it makes sense that the agent explicitly believes L. But now the aliefs of the agent are represented by L’. That is: L’ is the logic within the agent’s bones. So it’s L’ that I’m talking about when I define “aliefs” as the axiomatic system, and “beliefs” as more complicated (opposite to your definition).
Over this discussion, a possible interpretation of what you’re saying that’s been in the back of my mind has been that you think agents should not rely on formal logic in their bones, but rather, should use formal logic only as part of their explicit thinking IE a form of thinking which other, “more basic” reasoning systems use as a tool (a tool which they can choose to endorse or not). For example, you might believe that an RL system decides whether to employ logical reasoning. Or deep learning. Etc. In this case, you might say that there is no L’ to find (no logical system represents the aliefs).
To be clear, I do think this is a kind of legitimate response to the Lobstacle: the response of rejecting logic (as a tool for understanding what’s going on in an agent’s “bones” IE their basic decision architecture). This approach says: “Look what happens when you try to make the agent overly reliant on logic! Don’t do that!”
However, the Lobstacle is specifically about using logic to describe or design decision-making procedures. So, this ‘solution’ will not be very satisfying for people trying to do that. The puzzling nature of the Lobstacle remains: the claim is that RL (or something) has to basically solve the problem; we can’t use logic. But why is this? Is it because agents have to “be irrational” at some level (ie, their basic systems can’t conform to the normative content of logic)?
Anyway, this may or may not resemble your view. You haven’t explicitly come out and said anything like that, although it does seem like you think there should be a level beyond logic in some sense.
It seems like you think this property is so important that it’s almost definitional, and so, a notion of belief which doesn’t satisfy it is in conflict with the usual notion of belief. I just don’t have this intuition. My notion of belief-in-contrast-to-alief does contrast with the informal notion, but I would emphasize other differences:
“belief” contrasts with the intuitive notion of belief in that it much less implies that you’ll act on a belief; our intuitive notion more often predicts that you’ll act on something if you really believe it.
“alief” contrasts with the intuitive notion of belief in that it much less implies that you’ll explicitly endorse something, or even be able to articulate it in language at all.
In other words, I see the intuitive notion of belief as really consisting of two parts, belief and alief, which are intuitively assumed to go together but which we are splitting apart here.
The property you mention is derived from this conflation, because in fact we need to alieve a reasoning process in order to believe its outputs; so if we’re conflating alief and belief, then it seems like we need to believe that L-proofs are valid in order to see L-proofs and (as a direct result) come to believe what’s been proved.
But this is precisely what’s at stake, in making the belief/alief distinction: we want to point out that this isn’t a healthy relationship to have with logic (indeed, Godel shows how it leads to inconsistency).
I like that you pointed this out, but yeah, it doesn’t seem especially important to our discussion. In any case, I would define disalief as something like this:
The agent’s mental architecture lacks a deduction rule accepting proofs in L, or anything tantamount to such a rule.
Note that the agent might also disbelieve in such a rule, IE, might expect some such proofs to have false conclusions. But this is, of course, not necessary. In particular it would not be true in the case of logics which the agent explicitly endorses (and therefore must not alieve).
Hrm. I suspect language has failed us, and we need to make some more distinctions (but I am not sure what they are).
In my terms, if A alieves B’s beliefs, then (for example) if B explicitly endorses ZFC, then A must be using ZFC “in its bones”. But if B explicitly endorses ZFC, then the logic which B is using must be more powerful than that. So A might be logically weaker than B (if A only alieves ZFC, and not the stronger system which B used to decide that ZFC is sound). If so, A cannot trust B (A does not alieve that B’s reasoning is sound, that is, A does not believe B).
I have to confess that I’m confusing myself a bit, and am tempted to give yet another (different, incompatible) definition for the alief/belief split. I’ll hold off for now, but I hope it’s very clear that I’m not accusing all confusion about this as coming from you—I’m aiming to minimize confusion, but I still worry that I’ve introduced contradictory ideas in this conversation. (I’m actually tempted to start distinguishing 3 levels rather than 2! Alief/belief are relative terms, and I worry we’re actually conflating important subtleties by using only 2 terms rather than having multple levels...)
This goes back to the idea that you seem to think “belief in X implies belief in the processes whereby you came to believe in X” is so important as to be definitional, where I think this property has to be a by-product of other things.
In my understanding, the definition of “trust” should not explicitly allow or disallow this, if we’re going to be true to what “trust” means. Rather, for the Lobstacle, “A trusts B” has to be defined as “A willingly relies on B to perform mission-critical tasks”. This definition does indeed fail to be true for naive logical agents. But this should be an argument against naive logical agents, not our notion of trust.
Hence my perception that you do indeed have to question the theorems themselves, in order to dispute their “relevance” to the situation. The definition of trust seems fixed in place to me; indeed, I would instead have to question the relevance of your alternative definition, since what I actually want is the thing studied in the paper (IE being able to delegate critical tasks to another agent).
Ok, but if agent B can only act on valid L-proofs, it seems like agent B has been given a frontal lobotomy (IE this is just the “make sure my future self is dumber” style of solution to the problem).
Or, on the other hand, if the agent A also respects this same restriction, then A cannot delegate tasks to B (because A can’t prove that it’s OK to do so, at least not in L, the logic which it has been restricted to use when it comes to deciding how to act).
Which puts us back in the same lobstacle.
Is this a crux for you? My thinking is that this is going to be a deadly sticking point. It seems like you’re admitting that your approach has this problem, but, you think there’s value in what you’ve done so far because you’ve solved one part of the problem and you think this other part could also work with time. Is that what you’re intending to say? Whereas to me, it looks like this other part is just doomed to fail, so I don’t see what the value in your proposal could be.
For me, solving the Lobstacle means being able to actually decide to delegate.
I think I get what you’re saying here. But if the assumptions lying strictly in L’ are only applied passively, how does it help us? I’m thinking your current answer is (as you’ve already said) “trusting that B will do what they’ve said they’ll do is a separate problem”—IE you aren’t even trying to build the full bridge between B thinking something is a good idea and A trusting B with such tasks.
I still really, really don’t get why your language is stuff like “there’s no resultant obstacle” and “what I’m refuting is their relevance”. Your implication is “there was not a problem to begin with” rather than “I have solved the problem”. I asked whether you objected to details of the math in the original paper, and you said no—so apparently you would agree with the result that naive logical agents fail to trust their future self (which is the lobstacle!). Solving the lobstacle would presumably involve providing an improved agent design which would avoid the problem. Yet this seems like it’s not what you want to do—instead you claim something else, along the lines of claiming that there’s not actually a problem?
So, basically, what do you claim to accomplish? I suspect I’m still really misunderstanding that part of it.
Since you’re now identifying it as another part of the perspective shift which you’re trying to communicate (rather than just some technical distraction), it sounds like it might actually be pretty helpful toward me understanding what you’re trying to get at. But, there are already a lot of points floating around in this conversation, so maybe I’ll let it drop.
I’m somewhat curious if you think you’ve communicated your perspective shift to any other person; so far, I’m like “there just doesn’t seem to be anything real here”, but maybe there are other people who get what you’re trying to say?
I don’t know how you so misread what I said; I explicitly wrote that aliefs constitute the larger logic, so that beliefs are contained in aliefs (which I’m pretty sure is what you were going for!) and not vice versa. Maybe you got confused because I put beliefs first in this description, or because I described the smaller of the two logics as the “reasoning engine” (for the reason I subsequently provided)? What you said almost convinced me that our definitions actually align, until I reached the point where you said that that beliefs could be “more complicated” than aliefs, which made me unsure.
Anyway, since you keep taking the time to thoroughly reply in good faith, I’ll do my best to clarify and address some of the rest of what you’ve said. However, thanks to the discussion we’ve had so far, a more formal presentation of my ideas is crystallizing in my mind; I prefer to save that for another proper post, since I anticipate it will involve rejigging the terminology again, and I don’t want to muddy the waters further!
Perhaps we have different intuitive notions of trust, since I certainly trust myself (to perform “mission-critical tasks”), at least as far as my own logical reasoning is concerned, and an agent that doesn’t trust itself is going to waste a lot of time second-guessing its own actions. So I don’t think you’ve addressed my argument that the definition of trust that leads to the Löbstacle is faulty because it fails to be reflexive.
There are two separate issues here, and this response makes it apparent that you are conflating them. The fact that the second agent in the original Löbstacle paper is constrained to act only once it has produced a provably effective strategy and is constrained to follow that strategy means that the Löbstacle only applies to questions concerning the (formal) reasoning of a subordinate agent. Whether or not I manage to convince you that the Löbstacle doesn’t exist (because it’s founded on an untenable definition of trust), you have to acknowledge that the argument as presented there doesn’t address the following second problem. Suppose I can guarantee that my subordinate uses reasoning that I believe to be valid. How can I guarantee that it will act on that reasoning in a way I approve of? This is (obviously) a rather general version of the alignment problem. If you’re claiming that Löb’s theorem has a bearing on this, then that would be big news, especially if it vindicates your opinion that it is “doomed to fail”.
The reason I see my post as progress is that currently the Löbstacle is blocking serious research in using simple systems of formal agents to investigate such important problems as the alignment problem.
Taking the revised definition of trust I described, that last sentence is no longer the content of any formal mathematical result in that paper, so I do not agree with it, and I stand by what I said.
Indeed, my claim boils down to saying that there is no problem. But I don’t see why that doesn’t constitute a solution to the apparent problem. It’s like the Missing Dollar Riddle; explaining why there’s no problem is the solution.
There’s no real way for me to know. Everyone who I’ve spoken to about this in person has gotten it, but that only amounts to a handful of people. It’s hard to find an audience; I hoped LW would supply one, but so far it seems not. Hopefully a more formal presentation will improve the situation.
Looks like I forgot about this discussion! Did you post a more formal treatment?
Sorry for the confusion here! I haven’t re-oriented myself to the whole context, but it sounds like I did invent a big disagreement that didn’t exist. This has to do with my continued confusion about your approach. But in retrospect I do think your early accusation that I was insisting on some rigid assumptions holds water; I needed to go a bit further afield to try and interpret what you were getting at.
Again, I haven’t yet understood your approach or even re-read the whole conversation here, but it now seems to me that I was doing something wrong and silly by insisting on a definition of trust that forces the Löbstacle. The original paper is careful to only state that Löb naively seems to present an obstacle, not that it really truly does so. It looks to me like I was repeatedly stubborn on this point in an unproductive way.