Note also that your definition implies that if an agent alieves something, it must also believe it.
I find it interesting that you (seemingly) nodded along with my descriptions, but then proposed a definition which was almost opposite mine!
I don’t know how you so misread what I said; I explicitly wrote that aliefs constitute the larger logic, so that beliefs are contained in aliefs (which I’m pretty sure is what you were going for!) and not vice versa. Maybe you got confused because I put beliefs first in this description, or because I described the smaller of the two logics as the “reasoning engine” (for the reason I subsequently provided)? What you said almost convinced me that our definitions actually align, until I reached the point where you said that that beliefs could be “more complicated” than aliefs, which made me unsure.
Anyway, since you keep taking the time to thoroughly reply in good faith, I’ll do my best to clarify and address some of the rest of what you’ve said. However, thanks to the discussion we’ve had so far, a more formal presentation of my ideas is crystallizing in my mind; I prefer to save that for another proper post, since I anticipate it will involve rejigging the terminology again, and I don’t want to muddy the waters further!
Rather, for the Lobstacle, “A trusts B” has to be defined as “A willingly relies on B to perform mission-critical tasks”. This definition does indeed fail to be true for naive logical agents. But this should be an argument against naive logical agents, not our notion of trust.
Hence my perception that you do indeed have to question the theorems themselves, in order to dispute their “relevance” to the situation. The definition of trust seems fixed in place to me; indeed, I would instead have to question the relevance of your alternative definition, since what I actually want is the thing studied in the paper (IE being able to delegate critical tasks to another agent).
Perhaps we have different intuitive notions of trust, since I certainly trust myself (to perform “mission-critical tasks”), at least as far as my own logical reasoning is concerned, and an agent that doesn’t trust itself is going to waste a lot of time second-guessing its own actions. So I don’t think you’ve addressed my argument that the definition of trust that leads to the Löbstacle is faulty because it fails to be reflexive.
Attaining this guarantee in practice, so as to be able to trust that B will do what they have promised to do, is a separate but important problem. In general, the above notion of trust will only apply to what another agent says, or more precisely to the proofs they produce.
Is this a crux for you? My thinking is that this is going to be a deadly sticking point. It seems like you’re admitting that your approach has this problem, but, you think there’s value in what you’ve done so far because you’ve solved one part of the problem and you think this other part could also work with time. Is that what you’re intending to say? Whereas to me, it looks like this other part is just doomed to fail, so I don’t see what the value in your proposal could be. For me, solving the Lobstacle means being able to actually decide to delegate.
There are two separate issues here, and this response makes it apparent that you are conflating them. The fact that the second agent in the original Löbstacle paper is constrained to act only once it has produced a provably effective strategy and is constrained to follow that strategy means that the Löbstacle only applies to questions concerning the (formal) reasoning of a subordinate agent. Whether or not I manage to convince you that the Löbstacle doesn’t exist (because it’s founded on an untenable definition of trust), you have to acknowledge that the argument as presented there doesn’t address the following second problem. Suppose I can guarantee that my subordinate uses reasoning that I believe to be valid. How can I guarantee that it will act on that reasoning in a way I approve of? This is (obviously) a rather general version of the alignment problem. If you’re claiming that Löb’s theorem has a bearing on this, then that would be big news, especially if it vindicates your opinion that it is “doomed to fail”.
The reason I see my post as progress is that currently the Löbstacle is blocking serious research in using simple systems of formal agents to investigate such important problems as the alignment problem.
Your implication is “there was not a problem to begin with” rather than “I have solved the problem”. I asked whether you objected to details of the math in the original paper, and you said no—so apparently you would agree with the result that naive logical agents fail to trust their future self (which is the lobstacle!).
Taking the revised definition of trust I described, that last sentence is no longer the content of any formal mathematical result in that paper, so I do not agree with it, and I stand by what I said.
Indeed, my claim boils down to saying that there is no problem. But I don’t see why that doesn’t constitute a solution to the apparent problem. It’s like the Missing Dollar Riddle; explaining why there’s no problem is the solution.
I’m somewhat curious if you think you’ve communicated your perspective shift to any other person; so far, I’m like “there just doesn’t seem to be anything real here”, but maybe there are other people who get what you’re trying to say?
There’s no real way for me to know. Everyone who I’ve spoken to about this in person has gotten it, but that only amounts to a handful of people. It’s hard to find an audience; I hoped LW would supply one, but so far it seems not. Hopefully a more formal presentation will improve the situation.
Anyway, since you keep taking the time to thoroughly reply in good faith, I’ll do my best to clarify and address some of the rest of what you’ve said. However, thanks to the discussion we’ve had so far, a more formal presentation of my ideas is crystallizing in my mind; I prefer to save that for another proper post, since I anticipate it will involve rejigging the terminology again, and I don’t want to muddy the waters further!
Looks like I forgot about this discussion! Did you post a more formal treatment?
I don’t know how you so misread what I said; I explicitly wrote that aliefs constitute the larger logic, so that beliefs are contained in aliefs (which I’m pretty sure is what you were going for!) and not vice versa. Maybe you got confused because I put beliefs first in this description, or because I described the smaller of the two logics as the “reasoning engine” (for the reason I subsequently provided)? What you said almost convinced me that our definitions actually align, until I reached the point where you said that that beliefs could be “more complicated” than aliefs, which made me unsure.
Sorry for the confusion here! I haven’t re-oriented myself to the whole context, but it sounds like I did invent a big disagreement that didn’t exist. This has to do with my continued confusion about your approach. But in retrospect I do think your early accusation that I was insisting on some rigid assumptions holds water; I needed to go a bit further afield to try and interpret what you were getting at.
Whether or not I manage to convince you that the Löbstacle doesn’t exist (because it’s founded on an untenable definition of trust), you have to acknowledge that the argument as presented there doesn’t address the following second problem.
Again, I haven’t yet understood your approach or even re-read the whole conversation here, but it now seems to me that I was doing something wrong and silly by insisting on a definition of trust that forces the Löbstacle. The original paper is careful to only state that Löb naively seems to present an obstacle, not that it really truly does so. It looks to me like I was repeatedly stubborn on this point in an unproductive way.
I have read almost all of this dialog, and my half-serious upshot is:
An agent A can’t prove that another agent B is correct in both its reasoning as well as semantics, but that doesn’t matter because it can’t trust its own reasoning to that degree either.
This glosses over a lot of details in the long and charitable comment thread above. I tried to get an overview of it with ChatGPT. I’m surprised how well that worked:
ChatGPT 5.2 extended thinking summary of the misunderstanding,
1) Shared formal background (what they both accept)
Let:
L be a recursively axiomatized theory strong enough for Löb-style self-reference.
ProvL(⌜φ⌝) be the standard provability predicate; write □Lφ:=ProvL(⌜φ⌝).
U be an “environment/world”.
S be a semantic interpretation mapping formulas of L to propositions about U. Morgan describes this as a semantic map S:Form(L)→C(U).
TrueS(φ) abbreviate “S(φ) holds in UUU”.
A natural “soundness schema relative to S” is:
Sound(L,S) := ∀φ (□Lφ→TrueS(φ)).
The Löbian obstacle setup (as Morgan summarizes it) is that a designer agent A wants to rely on proofs produced by a subordinate B, and this seems to demand something like a schema □Lφ→φ (or its intended-world analogue) for arbitrary φ, which is blocked by Löb-ish reasoning.
So far: aligned.
2) Where the models diverge: what counts as “solving” the obstacle
Demski treats “escape the obstacle” as: produce an agent design whose decision procedure can rationally delegate mission-critical tasks under something like the formal constraints studied in the Löbstacle literature.
A clean formalization of what he keeps asking for is:
There is an action rule πA such that A will take some action aaa after receiving some certificate from B,
and A’s normative warrant for that reliance is itself captured inside the agent’s formal reasoning (or at least specified as an explicit part of the algorithm), not as an unanalyzed extra assumption.
Hence his repeated demand: “what modification of the agent designs do you have in mind?” and later: “I still don’t understand how you propose for the agent to reason/behave differently.”
Morgan’s post is primarily: diagnose the source of the apparent paradox as a category error—treating a semantic claim (“my proofs track the world”) as if it were a syntactic theorem inside LLL.
His move is to relocate Sound(L,S) out of LLL and into a separate layer of “metalogical belief” that lives at the interface between proofs and world-interpretation. He says the agent need not (and cannot) prove the semantic mapping is sound; rather, the agent assumes/believes it when applying proofs to the world.
This is why he says “the agent doesn’t change its reasoning” (meaning: the proof calculus inside LLL is unchanged).
3) The first misunderstanding: “metalogical beliefs” as axioms vs as an external bridge
Demski hears:
“Allowing for metalogical beliefs immediately dissolves the obstacle.”
and translates it into the most standard formal move:
Build L′:=L+Sound(L,S)
Then he points out the familiar reflection hierarchy problem: L′ can at best endorse L, but not itself; so you get “trust weaker but not equal-strength” all over again. He explicitly frames this as “you just get a new system L′… which doesn’t believe its own soundness.” and distinguishes “L is sound” from “L plus me is sound.”
Morgan, meanwhile, is not proposing “add Sound(L,S) as an axiom used inside the proof system that generates actions.” He keeps insisting the semantic map is “defined on L, not on L′,” and that for the delegation story you can restrict attention to accepting L-proofs.
So the mismatch is:
Demski’s formalization: metalogical belief ⇒ explicit axiom(s) in an extended theory that the agent uses for decisions.
Morgan’s formalization: metalogical belief ⇒ an extra-theoretic acceptance bridge “when I see an L-proof, I treat its conclusion as world-true under S”—not something asserted/proved within L.
Formally, Morgan is closer to specifying a two-layer agent:
Object layer: derive □Lφ by proof search.
Bridge layer (not inside L): an acceptance rule Acc(φ) that (defeasibly) upgrades “□Lφ” into “act as if TrueS(φ).”
Demski’s objection is then immediate: if Acc is where the action happens, you have changed the decision architecture; and unless Acc is itself formally grounded, you haven’t solved the original agent-design problem, only renamed it. This is exactly his “back pocket” complaint: “fine, I grant the agent ‘has’ the semantic map; how does it use it?”
4) The second misunderstanding: what “reasoning” denotes
They overload “reasoning” differently.
Morgan uses “reasoning” ≈ “applying L’s inference rules to formulas” and treats semantic application as a separate stage.
Demski uses “reasoning” ≈ “the whole cognitive pipeline that produces decisions,” so if semantics affects what you do, it’s part of the reasoning/agent design.
This is why Demski keeps hearing a contradiction in Morgan’s stance:
“No, the agent doesn’t change its reasoning… The change is that we don’t require A to prove that their semantic mapping is sound…”
Under Demski’s formal lens, “not requiring proof” is already a behavioral modification, because the original Löbstacle agent was (by assumption) constrained to act only when it could justify that constraint internally.
5) The third misunderstanding: “trust B” = trusting proofs vs trusting delegated action
Morgan tries to narrow “trust” to “trusting LLL-proofs”:
B can even be inconsistent internally, as long as it outputs valid LLL-proofs; then AAjust checks the proof and uses it.
Demski pushes back that the Löbstacle was about mission-critical delegation, not merely proof checking; restricting to a weaker fragment can amount to “make the delegate dumber,” which is the classic non-solution.
Formally, Demski is insisting on a property like:
A delegates ⟹ ∃ nontrivial tasks where B’s full cognition matters, not just L-proof output.
Morgan is (at least in that subthread) defending a weaker property:
A trusts B ⟹ A accepts any syntactically valid L-proof produced by B as action-relevant.
Those are different “trust” relations, and each is natural given a different target.
6) The “is/ought” mini-misfire is the same structural issue
Morgan phrases the bridge as normative:
if φ is provable in L, the interpretation of φ “ought to be true”.
Demski hears “ought” as moral/normative in the usual sense and suggests rewriting it as a prediction (“will probably be true”).
But structurally, this is the same dispute:
Morgan is describing Acc: a norm of acceptance/decision (“treat proofs as truth-tracking”).
Demski wants a descriptive link inside the formal model that would license that acceptance without handwaving.
7) The clean formal diagnosis: they are proving different theorems
If you compress all of the above into two “theorems” they are implicitly trying to establish:
Morgan’s implicit claim (interface theorem)
If an agent’s action rule is allowed to use an extra-logical acceptance bridge Acc that is not required to be provable inside L, then Löb’s prohibition “L can’t prove its own soundness schema” does not block the agent from in fact relying on L-proofs in the world (conditional on S being sound).
This is basically: “Löb blocks internal proof of Sound(L,S), not the practical stance of treating proofs as reliable.”
Demski’s implicit claim (agent-design theorem)
If we keep the Löbstacle-style constraint that delegation must be justified within the agent’s own formal apparatus (or equivalently, we forbid a free external Acc oracle), then Morgan’s move either (i) reappears as L′ and triggers the reflection hierarchy, or (ii) reduces to trusting only weaker fragments, which is not the desired kind of delegation.
So: Morgan is dissolving a philosophical inference (“therefore no agent can trust its reasoning”), while Demski is defending a formal obstacle statement about a particular class of reflective agent designs.
Both can be simultaneously right—because the quantifiers differ.
It gets the core point:
the Löbstacle was about mission-critical delegation, not merely proof checking; restricting to a weaker fragment can amount to “make the delegate dumber,” which is the classic non-solution.
It is a non-solution in the sense that it doesn’t let the sub agent B run free because it can be fully trusted. But Morgan’s move does seem to enable a save kind of delegation. So practically, the different approaches come down to:
Rogers: Delegating computation to find checkable proofs
Demski: Delegating open-ended judgment and autonomous action
Clearly, 1 is weaker than 2. But we can’t get 2 anyway, so getting 1 seems like a win.
And maybe we can extend 1 into a full agent by wrapping B into a verifier. And that would nest for repeated delegation.
I don’t know how you so misread what I said; I explicitly wrote that aliefs constitute the larger logic, so that beliefs are contained in aliefs (which I’m pretty sure is what you were going for!) and not vice versa. Maybe you got confused because I put beliefs first in this description, or because I described the smaller of the two logics as the “reasoning engine” (for the reason I subsequently provided)? What you said almost convinced me that our definitions actually align, until I reached the point where you said that that beliefs could be “more complicated” than aliefs, which made me unsure.
Anyway, since you keep taking the time to thoroughly reply in good faith, I’ll do my best to clarify and address some of the rest of what you’ve said. However, thanks to the discussion we’ve had so far, a more formal presentation of my ideas is crystallizing in my mind; I prefer to save that for another proper post, since I anticipate it will involve rejigging the terminology again, and I don’t want to muddy the waters further!
Perhaps we have different intuitive notions of trust, since I certainly trust myself (to perform “mission-critical tasks”), at least as far as my own logical reasoning is concerned, and an agent that doesn’t trust itself is going to waste a lot of time second-guessing its own actions. So I don’t think you’ve addressed my argument that the definition of trust that leads to the Löbstacle is faulty because it fails to be reflexive.
There are two separate issues here, and this response makes it apparent that you are conflating them. The fact that the second agent in the original Löbstacle paper is constrained to act only once it has produced a provably effective strategy and is constrained to follow that strategy means that the Löbstacle only applies to questions concerning the (formal) reasoning of a subordinate agent. Whether or not I manage to convince you that the Löbstacle doesn’t exist (because it’s founded on an untenable definition of trust), you have to acknowledge that the argument as presented there doesn’t address the following second problem. Suppose I can guarantee that my subordinate uses reasoning that I believe to be valid. How can I guarantee that it will act on that reasoning in a way I approve of? This is (obviously) a rather general version of the alignment problem. If you’re claiming that Löb’s theorem has a bearing on this, then that would be big news, especially if it vindicates your opinion that it is “doomed to fail”.
The reason I see my post as progress is that currently the Löbstacle is blocking serious research in using simple systems of formal agents to investigate such important problems as the alignment problem.
Taking the revised definition of trust I described, that last sentence is no longer the content of any formal mathematical result in that paper, so I do not agree with it, and I stand by what I said.
Indeed, my claim boils down to saying that there is no problem. But I don’t see why that doesn’t constitute a solution to the apparent problem. It’s like the Missing Dollar Riddle; explaining why there’s no problem is the solution.
There’s no real way for me to know. Everyone who I’ve spoken to about this in person has gotten it, but that only amounts to a handful of people. It’s hard to find an audience; I hoped LW would supply one, but so far it seems not. Hopefully a more formal presentation will improve the situation.
Looks like I forgot about this discussion! Did you post a more formal treatment?
Sorry for the confusion here! I haven’t re-oriented myself to the whole context, but it sounds like I did invent a big disagreement that didn’t exist. This has to do with my continued confusion about your approach. But in retrospect I do think your early accusation that I was insisting on some rigid assumptions holds water; I needed to go a bit further afield to try and interpret what you were getting at.
Again, I haven’t yet understood your approach or even re-read the whole conversation here, but it now seems to me that I was doing something wrong and silly by insisting on a definition of trust that forces the Löbstacle. The original paper is careful to only state that Löb naively seems to present an obstacle, not that it really truly does so. It looks to me like I was repeatedly stubborn on this point in an unproductive way.
I have read almost all of this dialog, and my half-serious upshot is:
This glosses over a lot of details in the long and charitable comment thread above. I tried to get an overview of it with ChatGPT. I’m surprised how well that worked:
ChatGPT 5.2 extended thinking summary of the misunderstanding,
1) Shared formal background (what they both accept)
Let:
L be a recursively axiomatized theory strong enough for Löb-style self-reference.
ProvL(⌜φ⌝) be the standard provability predicate; write □Lφ:=ProvL(⌜φ⌝).
U be an “environment/world”.
S be a semantic interpretation mapping formulas of L to propositions about U. Morgan describes this as a semantic map S:Form(L)→C(U).
TrueS(φ) abbreviate “S(φ) holds in UUU”.
A natural “soundness schema relative to S” is:
Sound(L,S) := ∀φ (□Lφ→TrueS(φ)).
The Löbian obstacle setup (as Morgan summarizes it) is that a designer agent A wants to rely on proofs produced by a subordinate B, and this seems to demand something like a schema □Lφ→φ (or its intended-world analogue) for arbitrary φ, which is blocked by Löb-ish reasoning.
So far: aligned.
2) Where the models diverge: what counts as “solving” the obstacle
Demski’s implicit “success criterion” (formal delegation)
Demski treats “escape the obstacle” as: produce an agent design whose decision procedure can rationally delegate mission-critical tasks under something like the formal constraints studied in the Löbstacle literature.
A clean formalization of what he keeps asking for is:
There is an action rule πA such that A will take some action aaa after receiving some certificate from B,
and A’s normative warrant for that reliance is itself captured inside the agent’s formal reasoning (or at least specified as an explicit part of the algorithm), not as an unanalyzed extra assumption.
Hence his repeated demand: “what modification of the agent designs do you have in mind?” and later: “I still don’t understand how you propose for the agent to reason/behave differently.”
Morgan’s “success criterion” (epistemic/interface diagnosis)
Morgan’s post is primarily: diagnose the source of the apparent paradox as a category error—treating a semantic claim (“my proofs track the world”) as if it were a syntactic theorem inside LLL.
His move is to relocate Sound(L,S) out of LLL and into a separate layer of “metalogical belief” that lives at the interface between proofs and world-interpretation. He says the agent need not (and cannot) prove the semantic mapping is sound; rather, the agent assumes/believes it when applying proofs to the world.
This is why he says “the agent doesn’t change its reasoning” (meaning: the proof calculus inside LLL is unchanged).
3) The first misunderstanding: “metalogical beliefs” as axioms vs as an external bridge
Demski hears:
and translates it into the most standard formal move:
Build L′:=L+Sound(L,S)
Then he points out the familiar reflection hierarchy problem: L′ can at best endorse L, but not itself; so you get “trust weaker but not equal-strength” all over again. He explicitly frames this as “you just get a new system L′… which doesn’t believe its own soundness.” and distinguishes “L is sound” from “L plus me is sound.”
Morgan, meanwhile, is not proposing “add Sound(L,S) as an axiom used inside the proof system that generates actions.” He keeps insisting the semantic map is “defined on L, not on L′,” and that for the delegation story you can restrict attention to accepting L-proofs.
So the mismatch is:
Demski’s formalization: metalogical belief ⇒ explicit axiom(s) in an extended theory that the agent uses for decisions.
Morgan’s formalization: metalogical belief ⇒ an extra-theoretic acceptance bridge “when I see an L-proof, I treat its conclusion as world-true under S”—not something asserted/proved within L.
Formally, Morgan is closer to specifying a two-layer agent:
Object layer: derive □Lφ by proof search.
Bridge layer (not inside L): an acceptance rule Acc(φ) that (defeasibly) upgrades “□Lφ” into “act as if TrueS(φ).”
Demski’s objection is then immediate: if Acc is where the action happens, you have changed the decision architecture; and unless Acc is itself formally grounded, you haven’t solved the original agent-design problem, only renamed it. This is exactly his “back pocket” complaint: “fine, I grant the agent ‘has’ the semantic map; how does it use it?”
4) The second misunderstanding: what “reasoning” denotes
They overload “reasoning” differently.
Morgan uses “reasoning” ≈ “applying L’s inference rules to formulas” and treats semantic application as a separate stage.
Demski uses “reasoning” ≈ “the whole cognitive pipeline that produces decisions,” so if semantics affects what you do, it’s part of the reasoning/agent design.
This is why Demski keeps hearing a contradiction in Morgan’s stance:
Under Demski’s formal lens, “not requiring proof” is already a behavioral modification, because the original Löbstacle agent was (by assumption) constrained to act only when it could justify that constraint internally.
5) The third misunderstanding: “trust B” = trusting proofs vs trusting delegated action
Morgan tries to narrow “trust” to “trusting LLL-proofs”:
B can even be inconsistent internally, as long as it outputs valid LLL-proofs; then AAjust checks the proof and uses it.
Demski pushes back that the Löbstacle was about mission-critical delegation, not merely proof checking; restricting to a weaker fragment can amount to “make the delegate dumber,” which is the classic non-solution.
Formally, Demski is insisting on a property like:
A delegates ⟹ ∃ nontrivial tasks where B’s full cognition matters, not just L-proof output.
Morgan is (at least in that subthread) defending a weaker property:
A trusts B ⟹ A accepts any syntactically valid L-proof produced by B as action-relevant.
Those are different “trust” relations, and each is natural given a different target.
6) The “is/ought” mini-misfire is the same structural issue
Morgan phrases the bridge as normative:
Demski hears “ought” as moral/normative in the usual sense and suggests rewriting it as a prediction (“will probably be true”).
But structurally, this is the same dispute:
Morgan is describing Acc: a norm of acceptance/decision (“treat proofs as truth-tracking”).
Demski wants a descriptive link inside the formal model that would license that acceptance without handwaving.
7) The clean formal diagnosis: they are proving different theorems
If you compress all of the above into two “theorems” they are implicitly trying to establish:
Morgan’s implicit claim (interface theorem)
This is basically: “Löb blocks internal proof of Sound(L,S), not the practical stance of treating proofs as reliable.”
Demski’s implicit claim (agent-design theorem)
So: Morgan is dissolving a philosophical inference (“therefore no agent can trust its reasoning”), while Demski is defending a formal obstacle statement about a particular class of reflective agent designs.
Both can be simultaneously right—because the quantifiers differ.
It gets the core point:
It is a non-solution in the sense that it doesn’t let the sub agent B run free because it can be fully trusted. But Morgan’s move does seem to enable a save kind of delegation. So practically, the different approaches come down to:
Rogers: Delegating computation to find checkable proofs
Demski: Delegating open-ended judgment and autonomous action
Clearly, 1 is weaker than 2. But we can’t get 2 anyway, so getting 1 seems like a win.
And maybe we can extend 1 into a full agent by wrapping B into a verifier. And that would nest for repeated delegation.