I like the alief/belief distinction, this seems to carry the distinction I was after. To make it more formal, I’ll use “belief” to refer to ‘things which an agent can prove in its reasoning engine/language (L)‘, and “alief” to refer to beliefs plus ‘additional assumptions which the agent makes about the bearing of that reasoning on the environment’, which together constitute a larger logic (L’). Does that match the distinction you intended between these terms?
Unfortunately, this seems almost opposite to the way I was defining the terminology. I had it that the aliefs are precisely what is proven by the formal system, and the beliefs are what the agent would explicitly endorse if asked. Aliefs are what you feel in your bones. So if the “bones” of the agent are the formal system, that’s the aliefs.
Note also that your definition implies that if an agent alieves something, it must also believe it. In contrast, part of the point for me is that an agent can alieve things without believing them. I would also allow the opposite, for humans and other probabilistic reasoners, though for pure-logic agents this would have to correspond to unsoundness. But pure-logical agents have to have the freedom to alieve without believing, on pain of inconsistency, even if we can’t model belief-without-alief in pure logic.
I find it interesting that you (seemingly) nodded along with my descriptions, but then proposed a definition which was almost opposite mine! I think there’s probably a deep reason for that (having to do with how difficult it is to reliably distinguish alief/belief), but I’m not grasping it for now. It is a symptom of my confusion in this regard that I’m not even sure we’re pointing to different notions of belief/alief even though your definition sounds almost opposite to me. It is well within the realm of possibility that we mean the same thing, and are just choosing very different ways to talk about it.
Specifically, your definition seems fine if L is not the formal language which the agent is hard-wired with, but rather, some logic which the agent explicitly endorses (like the relationship that we have with Peano Arithmetic). Then, yeah, “belief” is about provability in L, while “alief” implies that the agent has some “additional assumptions about the bearing of that reasoning on the environment”. Totally! But then, this suggests that those additional assumptions are somehow represented in some other subsystem of the agent (outside of L). The logic of that other subsystem is what I’m interested in. If that other subsystem uses L’, then it makes sense that the agent explicitly believes L. But now the aliefs of the agent are represented by L’. That is: L’ is the logic within the agent’s bones. So it’s L’ that I’m talking about when I define “aliefs” as the axiomatic system, and “beliefs” as more complicated (opposite to your definition).
Over this discussion, a possible interpretation of what you’re saying that’s been in the back of my mind has been that you think agents should not rely on formal logic in their bones, but rather, should use formal logic only as part of their explicit thinking IE a form of thinking which other, “more basic” reasoning systems use as a tool (a tool which they can choose to endorse or not). For example, you might believe that an RL system decides whether to employ logical reasoning. Or deep learning. Etc. In this case, you might say that there is no L’ to find (no logical system represents the aliefs).
To be clear, I do think this is a kind of legitimate response to the Lobstacle: the response of rejecting logic (as a tool for understanding what’s going on in an agent’s “bones” IE their basic decision architecture). This approach says: “Look what happens when you try to make the agent overly reliant on logic! Don’t do that!”
However, the Lobstacle is specifically about using logic to describe or design decision-making procedures. So, this ‘solution’ will not be very satisfying for people trying to do that. The puzzling nature of the Lobstacle remains: the claim is that RL (or something) has to basically solve the problem; we can’t use logic. But why is this? Is it because agents have to “be irrational” at some level (ie, their basic systems can’t conform to the normative content of logic)?
Anyway, this may or may not resemble your view. You haven’t explicitly come out and said anything like that, although it does seem like you think there should be a level beyond logic in some sense.
An immediate pedagogical problem with this terminology is that we have to be careful not to conflate this notion of belief with the usual one: an agent will still be able to prove things in L even if it doesn’t believe (in the conventional sense) that the L-proof involved is valid.
It seems like you think this property is so important that it’s almost definitional, and so, a notion of belief which doesn’t satisfy it is in conflict with the usual notion of belief. I just don’t have this intuition. My notion of belief-in-contrast-to-alief does contrast with the informal notion, but I would emphasize other differences:
“belief” contrasts with the intuitive notion of belief in that it much less implies that you’ll act on a belief; our intuitive notion more often predicts that you’ll act on something if you really believe it.
“alief” contrasts with the intuitive notion of belief in that it much less implies that you’ll explicitly endorse something, or even be able to articulate it in language at all.
In other words, I see the intuitive notion of belief as really consisting of two parts, belief and alief, which are intuitively assumed to go together but which we are splitting apart here.
The property you mention is derived from this conflation, because in fact we need to alieve a reasoning process in order to believe its outputs; so if we’re conflating alief and belief, then it seems like we need to believe that L-proofs are valid in order to see L-proofs and (as a direct result) come to believe what’s been proved.
But this is precisely what’s at stake, in making the belief/alief distinction: we want to point out that this isn’t a healthy relationship to have with logic (indeed, Godel shows how it leads to inconsistency).
There is a more serious formalization issue at play, though, which is the problem of expressing a negative alief. How does one express that an agent “does not alieve that a proof of X in L implies that X is true”? ¬(□X→X) is classically equivalent to □X∧¬X, which in particular is an assertion of both the existence of a proof of X and the falsehood of X, which is clearly far stronger than the intended claim. This is going off on a tangent, so for now I will just assume that it is possible to express disaliefs by introducing some extra operators in L’ and get on with it.
I like that you pointed this out, but yeah, it doesn’t seem especially important to our discussion. In any case, I would define disalief as something like this:
The agent’s mental architecture lacks a deduction rule accepting proofs in L, or anything tantamount to such a rule.
Note that the agent might also disbelieve in such a rule, IE, might expect some such proofs to have false conclusions. But this is, of course, not necessary. In particular it would not be true in the case of logics which the agent explicitly endorses (and therefore must not alieve).
Yes. The mathematical proofs and Löb’s theorem are absolutely fine. What I’m refuting is their relevance; specifically the validity of this claim:
An agent can only trust another agent if it believes that agent’s aliefs.
My position is that *when their beliefs are sound* an agent only ever needs to *alieve* another agents *beliefs* in order to trust them.
Hrm. I suspect language has failed us, and we need to make some more distinctions (but I am not sure what they are).
In my terms, if A alieves B’s beliefs, then (for example) if B explicitly endorses ZFC, then A must be using ZFC “in its bones”. But if B explicitly endorses ZFC, then the logic which B is using must be more powerful than that. So A might be logically weaker than B (if A only alieves ZFC, and not the stronger system which B used to decide that ZFC is sound). If so, A cannot trust B (A does not alieve that B’s reasoning is sound, that is, A does not believe B).
I have to confess that I’m confusing myself a bit, and am tempted to give yet another (different, incompatible) definition for the alief/belief split. I’ll hold off for now, but I hope it’s very clear that I’m not accusing all confusion about this as coming from you—I’m aiming to minimize confusion, but I still worry that I’ve introduced contradictory ideas in this conversation. (I’m actually tempted to start distinguishing 3 levels rather than 2! Alief/belief are relative terms, and I worry we’re actually conflating important subtleties by using only 2 terms rather than having multple levels...)
A definition of trust which fails to be reflexive is clearly a bad definition, and with this modified definition there is no obstacle
This goes back to the idea that you seem to think “belief in X implies belief in the processes whereby you came to believe in X” is so important as to be definitional, where I think this property has to be a by-product of other things.
In my understanding, the definition of “trust” should not explicitly allow or disallow this, if we’re going to be true to what “trust” means. Rather, for the Lobstacle, “A trusts B” has to be defined as “A willingly relies on B to perform mission-critical tasks”. This definition does indeed fail to be true for naive logical agents. But this should be an argument against naive logical agents, not our notion of trust.
Hence my perception that you do indeed have to question the theorems themselves, in order to dispute their “relevance” to the situation. The definition of trust seems fixed in place to me; indeed, I would instead have to question the relevance of your alternative definition, since what I actually want is the thing studied in the paper (IE being able to delegate critical tasks to another agent).
Note that following the construction in the article, the secondary agent B can only act on the basis of a valid L-proof, so there is no need to distinguish between trusting what B says (the L-proofs B produces) and what B does (their subsequent action upon producing an L-proof).
Ok, but if agent B can only act on valid L-proofs, it seems like agent B has been given a frontal lobotomy (IE this is just the “make sure my future self is dumber” style of solution to the problem).
Or, on the other hand, if the agent A also respects this same restriction, then A cannot delegate tasks to B (because A can’t prove that it’s OK to do so, at least not in L, the logic which it has been restricted to use when it comes to deciding how to act).
Which puts us back in the same lobstacle.
Attaining this guarantee in practice, so as to be able to trust that B will do what they have promised to do, is a separate but important problem. In general, the above notion of trust will only apply to what another agent says, or more precisely to the proofs they produce.
Is this a crux for you? My thinking is that this is going to be a deadly sticking point. It seems like you’re admitting that your approach has this problem, but, you think there’s value in what you’ve done so far because you’ve solved one part of the problem and you think this other part could also work with time. Is that what you’re intending to say? Whereas to me, it looks like this other part is just doomed to fail, so I don’t see what the value in your proposal could be.
For me, solving the Lobstacle means being able to actually decide to delegate.
I was taking “reasoning” here to mean “applying the logic L” (so manipulating statements of belief), since any assumptions lying strictly in L’ are only applied passively. It feels strange to me to extend “reasoning” to include this implicit stuff, even if we are including it in our formal model of the agent’s behaviour.
I think I get what you’re saying here. But if the assumptions lying strictly in L’ are only applied passively, how does it help us? I’m thinking your current answer is (as you’ve already said) “trusting that B will do what they’ve said they’ll do is a separate problem”—IE you aren’t even trying to build the full bridge between B thinking something is a good idea and A trusting B with such tasks.
Both A and B “reason” in L’ (B could even work in a distinct extension of L), but will only accept proofs in the fragment L.
[...]
But then, your bolded statement seems to just be a re-statement of the Löbstacle: logical agents can’t explicitly endorse their own logic L’ which they use to reason, but rather, can only generally accept reasoning in some weaker fragment L.
It’s certainly a restatement of Löb’s theorem. My assertion is that there is no resultant obstacle.
I still really, really don’t get why your language is stuff like “there’s no resultant obstacle” and “what I’m refuting is their relevance”. Your implication is “there was not a problem to begin with” rather than “I have solved the problem”. I asked whether you objected to details of the math in the original paper, and you said no—so apparently you would agree with the result that naive logical agents fail to trust their future self (which is the lobstacle!). Solving the lobstacle would presumably involve providing an improved agent design which would avoid the problem. Yet this seems like it’s not what you want to do—instead you claim something else, along the lines of claiming that there’s not actually a problem?
So, basically, what do you claim to accomplish? I suspect I’m still really misunderstanding that part of it.
Re the rest,
(And I also don’t yet see what that part has to do with getting around the Löbstacle.)
It’s not relevant to getting around the Löbstacle; this part of the discussion was the result of me proposing a possible advantage of the perspective shift which (I believe, but have yet to fully convince you) resolves the Löbstacle. I agree that this part is distracting, but it’s also interesting, so please direct message me (via whatever means is available on LW, or by finding me on the MIRIx Discord server or AI alignment Slack) if you have time to discuss it some more.
Since you’re now identifying it as another part of the perspective shift which you’re trying to communicate (rather than just some technical distraction), it sounds like it might actually be pretty helpful toward me understanding what you’re trying to get at. But, there are already a lot of points floating around in this conversation, so maybe I’ll let it drop.
I’m somewhat curious if you think you’ve communicated your perspective shift to any other person; so far, I’m like “there just doesn’t seem to be anything real here”, but maybe there are other people who get what you’re trying to say?
Note also that your definition implies that if an agent alieves something, it must also believe it.
I find it interesting that you (seemingly) nodded along with my descriptions, but then proposed a definition which was almost opposite mine!
I don’t know how you so misread what I said; I explicitly wrote that aliefs constitute the larger logic, so that beliefs are contained in aliefs (which I’m pretty sure is what you were going for!) and not vice versa. Maybe you got confused because I put beliefs first in this description, or because I described the smaller of the two logics as the “reasoning engine” (for the reason I subsequently provided)? What you said almost convinced me that our definitions actually align, until I reached the point where you said that that beliefs could be “more complicated” than aliefs, which made me unsure.
Anyway, since you keep taking the time to thoroughly reply in good faith, I’ll do my best to clarify and address some of the rest of what you’ve said. However, thanks to the discussion we’ve had so far, a more formal presentation of my ideas is crystallizing in my mind; I prefer to save that for another proper post, since I anticipate it will involve rejigging the terminology again, and I don’t want to muddy the waters further!
Rather, for the Lobstacle, “A trusts B” has to be defined as “A willingly relies on B to perform mission-critical tasks”. This definition does indeed fail to be true for naive logical agents. But this should be an argument against naive logical agents, not our notion of trust.
Hence my perception that you do indeed have to question the theorems themselves, in order to dispute their “relevance” to the situation. The definition of trust seems fixed in place to me; indeed, I would instead have to question the relevance of your alternative definition, since what I actually want is the thing studied in the paper (IE being able to delegate critical tasks to another agent).
Perhaps we have different intuitive notions of trust, since I certainly trust myself (to perform “mission-critical tasks”), at least as far as my own logical reasoning is concerned, and an agent that doesn’t trust itself is going to waste a lot of time second-guessing its own actions. So I don’t think you’ve addressed my argument that the definition of trust that leads to the Löbstacle is faulty because it fails to be reflexive.
Attaining this guarantee in practice, so as to be able to trust that B will do what they have promised to do, is a separate but important problem. In general, the above notion of trust will only apply to what another agent says, or more precisely to the proofs they produce.
Is this a crux for you? My thinking is that this is going to be a deadly sticking point. It seems like you’re admitting that your approach has this problem, but, you think there’s value in what you’ve done so far because you’ve solved one part of the problem and you think this other part could also work with time. Is that what you’re intending to say? Whereas to me, it looks like this other part is just doomed to fail, so I don’t see what the value in your proposal could be. For me, solving the Lobstacle means being able to actually decide to delegate.
There are two separate issues here, and this response makes it apparent that you are conflating them. The fact that the second agent in the original Löbstacle paper is constrained to act only once it has produced a provably effective strategy and is constrained to follow that strategy means that the Löbstacle only applies to questions concerning the (formal) reasoning of a subordinate agent. Whether or not I manage to convince you that the Löbstacle doesn’t exist (because it’s founded on an untenable definition of trust), you have to acknowledge that the argument as presented there doesn’t address the following second problem. Suppose I can guarantee that my subordinate uses reasoning that I believe to be valid. How can I guarantee that it will act on that reasoning in a way I approve of? This is (obviously) a rather general version of the alignment problem. If you’re claiming that Löb’s theorem has a bearing on this, then that would be big news, especially if it vindicates your opinion that it is “doomed to fail”.
The reason I see my post as progress is that currently the Löbstacle is blocking serious research in using simple systems of formal agents to investigate such important problems as the alignment problem.
Your implication is “there was not a problem to begin with” rather than “I have solved the problem”. I asked whether you objected to details of the math in the original paper, and you said no—so apparently you would agree with the result that naive logical agents fail to trust their future self (which is the lobstacle!).
Taking the revised definition of trust I described, that last sentence is no longer the content of any formal mathematical result in that paper, so I do not agree with it, and I stand by what I said.
Indeed, my claim boils down to saying that there is no problem. But I don’t see why that doesn’t constitute a solution to the apparent problem. It’s like the Missing Dollar Riddle; explaining why there’s no problem is the solution.
I’m somewhat curious if you think you’ve communicated your perspective shift to any other person; so far, I’m like “there just doesn’t seem to be anything real here”, but maybe there are other people who get what you’re trying to say?
There’s no real way for me to know. Everyone who I’ve spoken to about this in person has gotten it, but that only amounts to a handful of people. It’s hard to find an audience; I hoped LW would supply one, but so far it seems not. Hopefully a more formal presentation will improve the situation.
Anyway, since you keep taking the time to thoroughly reply in good faith, I’ll do my best to clarify and address some of the rest of what you’ve said. However, thanks to the discussion we’ve had so far, a more formal presentation of my ideas is crystallizing in my mind; I prefer to save that for another proper post, since I anticipate it will involve rejigging the terminology again, and I don’t want to muddy the waters further!
Looks like I forgot about this discussion! Did you post a more formal treatment?
I don’t know how you so misread what I said; I explicitly wrote that aliefs constitute the larger logic, so that beliefs are contained in aliefs (which I’m pretty sure is what you were going for!) and not vice versa. Maybe you got confused because I put beliefs first in this description, or because I described the smaller of the two logics as the “reasoning engine” (for the reason I subsequently provided)? What you said almost convinced me that our definitions actually align, until I reached the point where you said that that beliefs could be “more complicated” than aliefs, which made me unsure.
Sorry for the confusion here! I haven’t re-oriented myself to the whole context, but it sounds like I did invent a big disagreement that didn’t exist. This has to do with my continued confusion about your approach. But in retrospect I do think your early accusation that I was insisting on some rigid assumptions holds water; I needed to go a bit further afield to try and interpret what you were getting at.
Whether or not I manage to convince you that the Löbstacle doesn’t exist (because it’s founded on an untenable definition of trust), you have to acknowledge that the argument as presented there doesn’t address the following second problem.
Again, I haven’t yet understood your approach or even re-read the whole conversation here, but it now seems to me that I was doing something wrong and silly by insisting on a definition of trust that forces the Löbstacle. The original paper is careful to only state that Löb naively seems to present an obstacle, not that it really truly does so. It looks to me like I was repeatedly stubborn on this point in an unproductive way.
Unfortunately, this seems almost opposite to the way I was defining the terminology. I had it that the aliefs are precisely what is proven by the formal system, and the beliefs are what the agent would explicitly endorse if asked. Aliefs are what you feel in your bones. So if the “bones” of the agent are the formal system, that’s the aliefs.
Note also that your definition implies that if an agent alieves something, it must also believe it. In contrast, part of the point for me is that an agent can alieve things without believing them. I would also allow the opposite, for humans and other probabilistic reasoners, though for pure-logic agents this would have to correspond to unsoundness. But pure-logical agents have to have the freedom to alieve without believing, on pain of inconsistency, even if we can’t model belief-without-alief in pure logic.
I find it interesting that you (seemingly) nodded along with my descriptions, but then proposed a definition which was almost opposite mine! I think there’s probably a deep reason for that (having to do with how difficult it is to reliably distinguish alief/belief), but I’m not grasping it for now. It is a symptom of my confusion in this regard that I’m not even sure we’re pointing to different notions of belief/alief even though your definition sounds almost opposite to me. It is well within the realm of possibility that we mean the same thing, and are just choosing very different ways to talk about it.
Specifically, your definition seems fine if L is not the formal language which the agent is hard-wired with, but rather, some logic which the agent explicitly endorses (like the relationship that we have with Peano Arithmetic). Then, yeah, “belief” is about provability in L, while “alief” implies that the agent has some “additional assumptions about the bearing of that reasoning on the environment”. Totally! But then, this suggests that those additional assumptions are somehow represented in some other subsystem of the agent (outside of L). The logic of that other subsystem is what I’m interested in. If that other subsystem uses L’, then it makes sense that the agent explicitly believes L. But now the aliefs of the agent are represented by L’. That is: L’ is the logic within the agent’s bones. So it’s L’ that I’m talking about when I define “aliefs” as the axiomatic system, and “beliefs” as more complicated (opposite to your definition).
Over this discussion, a possible interpretation of what you’re saying that’s been in the back of my mind has been that you think agents should not rely on formal logic in their bones, but rather, should use formal logic only as part of their explicit thinking IE a form of thinking which other, “more basic” reasoning systems use as a tool (a tool which they can choose to endorse or not). For example, you might believe that an RL system decides whether to employ logical reasoning. Or deep learning. Etc. In this case, you might say that there is no L’ to find (no logical system represents the aliefs).
To be clear, I do think this is a kind of legitimate response to the Lobstacle: the response of rejecting logic (as a tool for understanding what’s going on in an agent’s “bones” IE their basic decision architecture). This approach says: “Look what happens when you try to make the agent overly reliant on logic! Don’t do that!”
However, the Lobstacle is specifically about using logic to describe or design decision-making procedures. So, this ‘solution’ will not be very satisfying for people trying to do that. The puzzling nature of the Lobstacle remains: the claim is that RL (or something) has to basically solve the problem; we can’t use logic. But why is this? Is it because agents have to “be irrational” at some level (ie, their basic systems can’t conform to the normative content of logic)?
Anyway, this may or may not resemble your view. You haven’t explicitly come out and said anything like that, although it does seem like you think there should be a level beyond logic in some sense.
It seems like you think this property is so important that it’s almost definitional, and so, a notion of belief which doesn’t satisfy it is in conflict with the usual notion of belief. I just don’t have this intuition. My notion of belief-in-contrast-to-alief does contrast with the informal notion, but I would emphasize other differences:
“belief” contrasts with the intuitive notion of belief in that it much less implies that you’ll act on a belief; our intuitive notion more often predicts that you’ll act on something if you really believe it.
“alief” contrasts with the intuitive notion of belief in that it much less implies that you’ll explicitly endorse something, or even be able to articulate it in language at all.
In other words, I see the intuitive notion of belief as really consisting of two parts, belief and alief, which are intuitively assumed to go together but which we are splitting apart here.
The property you mention is derived from this conflation, because in fact we need to alieve a reasoning process in order to believe its outputs; so if we’re conflating alief and belief, then it seems like we need to believe that L-proofs are valid in order to see L-proofs and (as a direct result) come to believe what’s been proved.
But this is precisely what’s at stake, in making the belief/alief distinction: we want to point out that this isn’t a healthy relationship to have with logic (indeed, Godel shows how it leads to inconsistency).
I like that you pointed this out, but yeah, it doesn’t seem especially important to our discussion. In any case, I would define disalief as something like this:
The agent’s mental architecture lacks a deduction rule accepting proofs in L, or anything tantamount to such a rule.
Note that the agent might also disbelieve in such a rule, IE, might expect some such proofs to have false conclusions. But this is, of course, not necessary. In particular it would not be true in the case of logics which the agent explicitly endorses (and therefore must not alieve).
Hrm. I suspect language has failed us, and we need to make some more distinctions (but I am not sure what they are).
In my terms, if A alieves B’s beliefs, then (for example) if B explicitly endorses ZFC, then A must be using ZFC “in its bones”. But if B explicitly endorses ZFC, then the logic which B is using must be more powerful than that. So A might be logically weaker than B (if A only alieves ZFC, and not the stronger system which B used to decide that ZFC is sound). If so, A cannot trust B (A does not alieve that B’s reasoning is sound, that is, A does not believe B).
I have to confess that I’m confusing myself a bit, and am tempted to give yet another (different, incompatible) definition for the alief/belief split. I’ll hold off for now, but I hope it’s very clear that I’m not accusing all confusion about this as coming from you—I’m aiming to minimize confusion, but I still worry that I’ve introduced contradictory ideas in this conversation. (I’m actually tempted to start distinguishing 3 levels rather than 2! Alief/belief are relative terms, and I worry we’re actually conflating important subtleties by using only 2 terms rather than having multple levels...)
This goes back to the idea that you seem to think “belief in X implies belief in the processes whereby you came to believe in X” is so important as to be definitional, where I think this property has to be a by-product of other things.
In my understanding, the definition of “trust” should not explicitly allow or disallow this, if we’re going to be true to what “trust” means. Rather, for the Lobstacle, “A trusts B” has to be defined as “A willingly relies on B to perform mission-critical tasks”. This definition does indeed fail to be true for naive logical agents. But this should be an argument against naive logical agents, not our notion of trust.
Hence my perception that you do indeed have to question the theorems themselves, in order to dispute their “relevance” to the situation. The definition of trust seems fixed in place to me; indeed, I would instead have to question the relevance of your alternative definition, since what I actually want is the thing studied in the paper (IE being able to delegate critical tasks to another agent).
Ok, but if agent B can only act on valid L-proofs, it seems like agent B has been given a frontal lobotomy (IE this is just the “make sure my future self is dumber” style of solution to the problem).
Or, on the other hand, if the agent A also respects this same restriction, then A cannot delegate tasks to B (because A can’t prove that it’s OK to do so, at least not in L, the logic which it has been restricted to use when it comes to deciding how to act).
Which puts us back in the same lobstacle.
Is this a crux for you? My thinking is that this is going to be a deadly sticking point. It seems like you’re admitting that your approach has this problem, but, you think there’s value in what you’ve done so far because you’ve solved one part of the problem and you think this other part could also work with time. Is that what you’re intending to say? Whereas to me, it looks like this other part is just doomed to fail, so I don’t see what the value in your proposal could be.
For me, solving the Lobstacle means being able to actually decide to delegate.
I think I get what you’re saying here. But if the assumptions lying strictly in L’ are only applied passively, how does it help us? I’m thinking your current answer is (as you’ve already said) “trusting that B will do what they’ve said they’ll do is a separate problem”—IE you aren’t even trying to build the full bridge between B thinking something is a good idea and A trusting B with such tasks.
I still really, really don’t get why your language is stuff like “there’s no resultant obstacle” and “what I’m refuting is their relevance”. Your implication is “there was not a problem to begin with” rather than “I have solved the problem”. I asked whether you objected to details of the math in the original paper, and you said no—so apparently you would agree with the result that naive logical agents fail to trust their future self (which is the lobstacle!). Solving the lobstacle would presumably involve providing an improved agent design which would avoid the problem. Yet this seems like it’s not what you want to do—instead you claim something else, along the lines of claiming that there’s not actually a problem?
So, basically, what do you claim to accomplish? I suspect I’m still really misunderstanding that part of it.
Since you’re now identifying it as another part of the perspective shift which you’re trying to communicate (rather than just some technical distraction), it sounds like it might actually be pretty helpful toward me understanding what you’re trying to get at. But, there are already a lot of points floating around in this conversation, so maybe I’ll let it drop.
I’m somewhat curious if you think you’ve communicated your perspective shift to any other person; so far, I’m like “there just doesn’t seem to be anything real here”, but maybe there are other people who get what you’re trying to say?
I don’t know how you so misread what I said; I explicitly wrote that aliefs constitute the larger logic, so that beliefs are contained in aliefs (which I’m pretty sure is what you were going for!) and not vice versa. Maybe you got confused because I put beliefs first in this description, or because I described the smaller of the two logics as the “reasoning engine” (for the reason I subsequently provided)? What you said almost convinced me that our definitions actually align, until I reached the point where you said that that beliefs could be “more complicated” than aliefs, which made me unsure.
Anyway, since you keep taking the time to thoroughly reply in good faith, I’ll do my best to clarify and address some of the rest of what you’ve said. However, thanks to the discussion we’ve had so far, a more formal presentation of my ideas is crystallizing in my mind; I prefer to save that for another proper post, since I anticipate it will involve rejigging the terminology again, and I don’t want to muddy the waters further!
Perhaps we have different intuitive notions of trust, since I certainly trust myself (to perform “mission-critical tasks”), at least as far as my own logical reasoning is concerned, and an agent that doesn’t trust itself is going to waste a lot of time second-guessing its own actions. So I don’t think you’ve addressed my argument that the definition of trust that leads to the Löbstacle is faulty because it fails to be reflexive.
There are two separate issues here, and this response makes it apparent that you are conflating them. The fact that the second agent in the original Löbstacle paper is constrained to act only once it has produced a provably effective strategy and is constrained to follow that strategy means that the Löbstacle only applies to questions concerning the (formal) reasoning of a subordinate agent. Whether or not I manage to convince you that the Löbstacle doesn’t exist (because it’s founded on an untenable definition of trust), you have to acknowledge that the argument as presented there doesn’t address the following second problem. Suppose I can guarantee that my subordinate uses reasoning that I believe to be valid. How can I guarantee that it will act on that reasoning in a way I approve of? This is (obviously) a rather general version of the alignment problem. If you’re claiming that Löb’s theorem has a bearing on this, then that would be big news, especially if it vindicates your opinion that it is “doomed to fail”.
The reason I see my post as progress is that currently the Löbstacle is blocking serious research in using simple systems of formal agents to investigate such important problems as the alignment problem.
Taking the revised definition of trust I described, that last sentence is no longer the content of any formal mathematical result in that paper, so I do not agree with it, and I stand by what I said.
Indeed, my claim boils down to saying that there is no problem. But I don’t see why that doesn’t constitute a solution to the apparent problem. It’s like the Missing Dollar Riddle; explaining why there’s no problem is the solution.
There’s no real way for me to know. Everyone who I’ve spoken to about this in person has gotten it, but that only amounts to a handful of people. It’s hard to find an audience; I hoped LW would supply one, but so far it seems not. Hopefully a more formal presentation will improve the situation.
Looks like I forgot about this discussion! Did you post a more formal treatment?
Sorry for the confusion here! I haven’t re-oriented myself to the whole context, but it sounds like I did invent a big disagreement that didn’t exist. This has to do with my continued confusion about your approach. But in retrospect I do think your early accusation that I was insisting on some rigid assumptions holds water; I needed to go a bit further afield to try and interpret what you were getting at.
Again, I haven’t yet understood your approach or even re-read the whole conversation here, but it now seems to me that I was doing something wrong and silly by insisting on a definition of trust that forces the Löbstacle. The original paper is careful to only state that Löb naively seems to present an obstacle, not that it really truly does so. It looks to me like I was repeatedly stubborn on this point in an unproductive way.