This is a confusing post from my perspective, because I think of LI as being about beliefs and corrigibility being about desires.
If I want my AGI to believe that the sky is green, I guess it’s good if it’s possible to do that. But it’s kinda weird, and not a central example of corrigibility.
Admittedly, one can try to squish beliefs and desires into the same framework. The Active Inference people do that. Does LI do that too? If so, well, I’m generally very skeptical of attempts to do that kind of thing. See here, especially Section 7. In the case of humans, it’s perfectly possible for a plan to seem desirable but not plausible, or for a plan to seem plausible but not desirable. I think there are very good reasons that our brains are set up that way.
Admittedly, one can try to squish beliefs and desires into the same framework. The Active Inference people do that. Does LI do that too?
No. LI defines a notion of logically uncertain variable, which can be used to represent desires. There are also other ways one could build agents out of LI, such as doing the active inference thing.
As I mentioned in the post, I’m agnostic about such things here. We could be building “”“purely epistemic””” AI out of LI, or we could be deliberately building agents. It doesn’t matter very much, in part because we don’t have a good notion of purely epistemic.
Any learning system with a sufficiently rich hypothesis space can potentially learn to behave agentically (whether we want it to or not, until we have anti-inner-optimizer tech), so we should still have corrigibility concerns about such systems.
Also, it is easy for end users to build agentlike things out of belieflike things by making queries about how to accomplish things. Thus, we need to train epistemic systems to be responsible about how such queries are answered (as is already apparent in existing chatbots).
How about “purely epistemic” means “updated by self-supervised learning”, i.e. the updates (gradients, trader bankrolls, whatever) are derived from “things being true vs false” as opposed to “things being good vs bad”. Right?
The original LI paper was in that category, IIUC. The updates (to which traders had more vs less money) are derived from mathematical propositions being true vs false.
LI defines a notion of logically uncertain variable, which can be used to represent desires
I would say that they don’t really represent desires. They represent expectations about what’s going to happen, possibly including expectations about an AI’s own actions.
And then you can then put the LI into a larger system that follows the rule: whatever the expectations are about the AI’s own actions, make that actually happen.
The important thing that changes in this situation is that the convergence of the algorithm is underdetermined—you can have multiple fixed points. I can expect to stand up, and then I stand up, and my expectation was validated. No update. I can expect to stay seated, and then I stay seated, and my expectation was validated. No update.
(I don’t think I’m saying anything you don’t already know well.)
Anyway, if you do that, then I guess you could say that the LI’s expectations “can be used” to represent desires … but I maintain that that’s a somewhat confused and unproductive way to think about what’s going on. If I intervene to change the LI variable, it would be analogous to changing habits (what do I expect myself to do ≈ which action plans seem most salient and natural), not analogous to changing desires.
(I think the human brain has a system vaguely like LI, and that it resolves the underdetermination by a separate valence system, which evaluates expectations as being good vs bad, and applies reinforcement learning to systematically seek out the good ones.)
beliefs can have impacts on the world if the world looks at them
…Indeed, what I said above is just a special case. Here’s something more general and elegant. You have the core LI system, and then some watcher system W, which reads off some vector of internal variables V of the core LI system, and then W takes actions according to some function A(V).
After a while, the LI system will automatically catch onto what W is doing, and “learn” to interpret V as an expectation that A(V) is going to happen.
I think the central case is that W is part of the larger AI system, as above, leading to normal agent-like behavior (assuming some sensible system for resolving the underdetermination). But in theory W could also be humans peeking into the LI system and taking actions based on what they see. Fundamentally, these aren’t that different.
So whatever solution we come up with to resolve the underdetermination, whether human-brain-like “valence” or something else, that solution ought to work for the humans-peeking-into-the-LI situation just as it works for the normal W-is-part-of-the-larger-AI situation.
(But maybe weird things would happen before convergence. And also, if you don’t have any system at all to resolve the underdetermination, then probably the results would be weird and hard to reason about.)
Also, it is easy for end users to build agentlike things out of belieflike things by making queries about how to accomplish things. Thus, we need to train epistemic systems to be responsible about how such queries are answered (as is already apparent in existing chatbots).
I’m not sure that this is coming from a coherent threat model (or else I don’t follow).
If Dr. Evil trains his own AGI, then this whole thing is moot, because he wants the AGI to have accurate beliefs about bioweapons.
If Benevolent Bob trains the AGI and gives API access to Dr. Evil, then Bob can design the AGI to (1) have accurate beliefs about bioweapons, and (2) not answer Dr. Evil’s questions about bioweapons. That might ideally look like what we’re used to in the human world: the AGI says things because it wants to say those things, all things considered, and it doesn’t want Dr. Evil to build bioweapons, either directly or because it’s guessing what Bob would want.
How about “purely epistemic” means “updated by self-supervised learning”, i.e. the updates (gradients, trader bankrolls, whatever) are derived from “things being true vs false” as opposed to “things being good vs bad”. Right?
You can define it that way, but then I don’t think it’s highly relevant for this context.
The story I’m telling here is that partial feedback (typically: learning some sort of input-output relation via some sort of latents) always leaves us with undesired hypotheses which we can’t rule out using the restricted feedback mechanism.
Reinforcement Learning cannot rule out the wireheading hypothesis or human-manipulation hypothesis.
Updating on things being true or false cannot rule out agentic hypotheses (the inner optimizer problem).
I think it would be fair to define a teleosemantic notion of “purely epistemic” as something like “there is no optimization (anywhere in the system—‘inner’ or ‘outer’) except optimization for epistemic accuracy”.
The obvious application of my main point is that some form of “complete feedback” is a necessary (but insufficient) condition for this.
“Epistemic accuracy” here has to be defined in such a way as to capture the one-way “direction-of-fit” optimization of the map to fit the territory, but never the territory to fit the map. IE the optimization algorithm has to ignore the causal impact of its predictions.
However, I don’t particularly endorse this as the correct design choice—although a system with this property would be relatively safe in the sense of eliminating inner-alignment concerns and (in a sense) outer-alignment concerns, it is doing so by ignoring its impact on the world, which creates its own set of dangers. If such a system were widely deployed and became highly trusted for its predictions, it could stumble into bad self-fulfilling prophecies.
So, in my view, “epistemic” systems should be as transparent as possible with human users about possible multiple-fixed-point issues, try to keep humans in the loop and give the important decisions to humans; but ultimately, we need to view even “purely epistemic” systems as making some important (instrumental) decisions, and have them take some responsibility for making those decisions well instead of poorly.
The original LI paper was in that category, IIUC. The updates (to which traders had more vs less money) are derived from mathematical propositions being true vs false.
I was going to remind you that the paper didn’t say how the fixed-point selection works, and we can do that part in an agentic way, but then you go on to say basically the same thing (with the caveat that where you say “put the LI into a larger system that follows the rule: whatever the expectations are about the AI’s own actions, make that actually happen” I would say the more general “put the LI into an environment which somehow reacts to its predictions”):
LI defines a notion of logically uncertain variable, which can be used to represent desires
I would say that they don’t really represent desires. They represent expectations about what’s going to happen, possibly including expectations about an AI’s own actions.
And then you can then put the LI into a larger system that follows the rule: whatever the expectations are about the AI’s own actions, make that actually happen.
The important thing that changes in this situation is that the convergence of the algorithm is underdetermined—you can have multiple fixed points. I can expect to stand up, and then I stand up, and my expectation was validated. No update. I can expect to stay seated, and then I stay seated, and my expectation was validated. No update.
(I don’t think I’m saying anything you don’t already know well.)
Anyway, if you do that, then I guess you could say that the LI’s expectations “can be used” to represent desires … but I maintain that that’s a somewhat confused and unproductive way to think about what’s going on. If I intervene to change the LI variable, it would be analogous to changing habits (what do I expect myself to do ≈ which action plans seem most salient and natural), not analogous to changing desires.
(I think the human brain has a system vaguely like LI, and that it resolves the underdetermination by a separate valence system, which evaluates expectations as being good vs bad, and applies reinforcement learning to systematically seek out the good ones.)
I don’t understand what you’re trying to accomplish in these paragraphs. To me you sound sorta like Bob in the following:
Alice: Here’s my computer model of an agent.
Bob: Uh oh, that sounds sort of like active inference. How did you represent values? Did you confuse them with beliefs?
Alice: I used floating-point numbers to represent the expected value of a state. Here, look at my code. It’s a Q-learning algorithm.
Bob: You realize that “expected value” is a statistics thing, right? That makes it epistemic, not really value-laden in the sense that makes something agentic. It’s a prediction of what a number will be. Indeed, we can justify expected values as min-quadratic-loss estimates. That makes them epistemic!
Alice: Well, I agree that expected values aren’t automatically “values” in the agentic sense, but look, my code can solve mazes and stuff—like a rat learning to get cheese.
Bob: Of course I agree that it can be used in an instrumental way, but that’s a really misleading way to describe it overall, right? If you changed one Q-value estimated by the system, that would be analogous to changing habits, not desires, right?
Alice: um??? If we agree that it can be used in an instrumental way, then what are you saying is misleading?
Bob: I mean, sure, the human brain does something like this.
Alice: Ok??
It seems possible that you think we have some disagreement that we don’t have?
Also, it is easy for end users to build agentlike things out of belieflike things by making queries about how to accomplish things. Thus, we need to train epistemic systems to be responsible about how such queries are answered (as is already apparent in existing chatbots).
I’m not sure that this is coming from a coherent threat model (or else I don’t follow).
If Dr. Evil trains his own AGI, then this whole thing is moot, because he wants the AGI to have accurate beliefs about bioweapons.
If Benevolent Bob trains the AGI and gives API access to Dr. Evil, then Bob can design the AGI to (1) have accurate beliefs about bioweapons, and (2) not answer Dr. Evil’s questions about bioweapons. That might ideally look like what we’re used to in the human world: the AGI says things because it wants to say those things, all things considered, and it doesn’t want Dr. Evil to build bioweapons, either directly or because it’s guessing what Bob would want.
I’m not clear on what you’re trying to disagree with here. It sounds like we both agree that if Benevolent Bob builds a powerful “purely epistemic system” (by whatever definition), without limiting its knowledge, then Dr. Evil can misuse it; and we both agree that as a consequence of this, it makes sense to instead build some agency into the system, so that the system can decide not to give users dangerous information.
Possibly you disagree with the claim “it is easy to build agentlike things out of belieflike things”? What I have in mind is a powerful epistemic oracle. As a simple example, let’s say it can give highly accurate guesses to mathematically-posed problems. Then Dr. Evil can implement AIXI by feeding in AIXI’s mathematical definition, for example. This is the sort of thing I had in mind, but generalized to the nonmathematical case. (EG, “conditional on my owning a super-powerful death ray soon, what actions do I take now”)
Hmm, I think the point I’m trying to make is: it’s dicey to have a system S that’s being continually modified to systematically reduce some loss L, but then we intervene to edit S in a way that increases L. We’re kinda fighting against the loss-reducing mechanism (be it gradient descent or bankroll-changes or whatever), hoping that the loss-reducing mechanism won’t find a “repair” that works around our interventions.
In that context, my presumption is that an AI will have some epistemic part S that’s continually modified to produce correct objective understanding of the world, including correct anticipation of the likely consequences of actions. The loss L for that part would probably be self-supervised learning, but could also include self-consistency or whatever.
And then I’m interpreting you (maybe not correctly?) as proposing that we should consider things like making the AI have objectively incorrect beliefs about (say) bioweapons, and I feel like that’s fighting against this L in that dicey way.
Whereas your Q-learning example doesn’t have any problem with fighting against a loss function, because Q(S,A) is being consistently and only updated by the reward.
The above is inapplicable to LLMs, I think. (And this seems tied IMO to the fact that LLMs can’t do great novel science yet etc.) But it does apply to FixDT.
Specifically, for things like FixDT, if there are multiple fixed points (e.g. I expect to stand up, and then I stand up, and thus the prediction was correct), then whatever process you use to privilege one fixed point over another, you’re not fighting against the above L (i.e., the “epistemic” loss L based on self-supervised learning and/or self-consistency or whatever). L is applying no force either way. It’s a wide-open degree of freedom.
(If your response is “L incentivizes fixed-points that make the world easier to predict”, then I don’t think that’s a correct description of what such a learning algorithm would do.)
So if your feedback proposal exclusively involves a mechanism that privileging one fixed point over another, then I have no complaints, and would describe it as choosing a utility function (preferences not beliefs) within the FixDT framework.
Btw I think we’re in agreement that there should be some mechanism privileging one fixed point over another, instead of ignoring it and just letting the underdetermined system do whatever it does.
Updating on things being true or false cannot rule out agentic hypotheses (the inner optimizer problem). … Any sufficiently rich hypotheses space has agentic policies, which can’t be ruled out by the feedback.
Oh, I want to set that problem aside because I don’t think you need an arbitrarily rich hypothesis space to get ASI. The agency comes from the whole AI system, not just the “epistemic” part, so the “epistemic” part can be selected from a limited model class, as opposed to running arbitrary computations etc. For example, the world model can be “just” a Bayes net, or whatever. We’ve talked about this before.
Reinforcement Learning cannot rule out the wireheading hypothesis or human-manipulation hypothesis.
I also learned the term observation-utility agents from you :) You don’t think that can solve those problems (in principle)?
I’m probably misunderstanding you here and elsewhere, but enjoying the chat, thanks :)
This is a confusing post from my perspective, because I think of LI as being about beliefs and corrigibility being about desires.
If I want my AGI to believe that the sky is green, I guess it’s good if it’s possible to do that. But it’s kinda weird, and not a central example of corrigibility.
Admittedly, one can try to squish beliefs and desires into the same framework. The Active Inference people do that. Does LI do that too? If so, well, I’m generally very skeptical of attempts to do that kind of thing. See here, especially Section 7. In the case of humans, it’s perfectly possible for a plan to seem desirable but not plausible, or for a plan to seem plausible but not desirable. I think there are very good reasons that our brains are set up that way.
No. LI defines a notion of logically uncertain variable, which can be used to represent desires. There are also other ways one could build agents out of LI, such as doing the active inference thing.
As I mentioned in the post, I’m agnostic about such things here. We could be building “”“purely epistemic””” AI out of LI, or we could be deliberately building agents. It doesn’t matter very much, in part because we don’t have a good notion of purely epistemic.
Any learning system with a sufficiently rich hypothesis space can potentially learn to behave agentically (whether we want it to or not, until we have anti-inner-optimizer tech), so we should still have corrigibility concerns about such systems.
In my view, beliefs are a type of decision (not because we smoosh beliefs and values together, but rather because beliefs can have impacts on the world if the world looks at them) which means we should have agentic concerns about beliefs.
Also, it is easy for end users to build agentlike things out of belieflike things by making queries about how to accomplish things. Thus, we need to train epistemic systems to be responsible about how such queries are answered (as is already apparent in existing chatbots).
How about “purely epistemic” means “updated by self-supervised learning”, i.e. the updates (gradients, trader bankrolls, whatever) are derived from “things being true vs false” as opposed to “things being good vs bad”. Right?
[I learned the term teleosemantics from you! :) ]
The original LI paper was in that category, IIUC. The updates (to which traders had more vs less money) are derived from mathematical propositions being true vs false.
I would say that they don’t really represent desires. They represent expectations about what’s going to happen, possibly including expectations about an AI’s own actions.
And then you can then put the LI into a larger system that follows the rule: whatever the expectations are about the AI’s own actions, make that actually happen.
The important thing that changes in this situation is that the convergence of the algorithm is underdetermined—you can have multiple fixed points. I can expect to stand up, and then I stand up, and my expectation was validated. No update. I can expect to stay seated, and then I stay seated, and my expectation was validated. No update.
(I don’t think I’m saying anything you don’t already know well.)
Anyway, if you do that, then I guess you could say that the LI’s expectations “can be used” to represent desires … but I maintain that that’s a somewhat confused and unproductive way to think about what’s going on. If I intervene to change the LI variable, it would be analogous to changing habits (what do I expect myself to do ≈ which action plans seem most salient and natural), not analogous to changing desires.
(I think the human brain has a system vaguely like LI, and that it resolves the underdetermination by a separate valence system, which evaluates expectations as being good vs bad, and applies reinforcement learning to systematically seek out the good ones.)
…Indeed, what I said above is just a special case. Here’s something more general and elegant. You have the core LI system, and then some watcher system W, which reads off some vector of internal variables V of the core LI system, and then W takes actions according to some function A(V).
After a while, the LI system will automatically catch onto what W is doing, and “learn” to interpret V as an expectation that A(V) is going to happen.
I think the central case is that W is part of the larger AI system, as above, leading to normal agent-like behavior (assuming some sensible system for resolving the underdetermination). But in theory W could also be humans peeking into the LI system and taking actions based on what they see. Fundamentally, these aren’t that different.
So whatever solution we come up with to resolve the underdetermination, whether human-brain-like “valence” or something else, that solution ought to work for the humans-peeking-into-the-LI situation just as it works for the normal W-is-part-of-the-larger-AI situation.
(But maybe weird things would happen before convergence. And also, if you don’t have any system at all to resolve the underdetermination, then probably the results would be weird and hard to reason about.)
I’m not sure that this is coming from a coherent threat model (or else I don’t follow).
If Dr. Evil trains his own AGI, then this whole thing is moot, because he wants the AGI to have accurate beliefs about bioweapons.
If Benevolent Bob trains the AGI and gives API access to Dr. Evil, then Bob can design the AGI to (1) have accurate beliefs about bioweapons, and (2) not answer Dr. Evil’s questions about bioweapons. That might ideally look like what we’re used to in the human world: the AGI says things because it wants to say those things, all things considered, and it doesn’t want Dr. Evil to build bioweapons, either directly or because it’s guessing what Bob would want.
You can define it that way, but then I don’t think it’s highly relevant for this context.
The story I’m telling here is that partial feedback (typically: learning some sort of input-output relation via some sort of latents) always leaves us with undesired hypotheses which we can’t rule out using the restricted feedback mechanism.
Reinforcement Learning cannot rule out the wireheading hypothesis or human-manipulation hypothesis.
Updating on things being true or false cannot rule out agentic hypotheses (the inner optimizer problem).
Any sufficiently rich hypotheses space has agentic policies, which can’t be ruled out by the feedback. “Purely epistemic” in your sense filters for hypotheses which make good predictions, but this doesn’t constrain things to be non-agentic. The system can learn to use predictions as actions in some way.
I think it would be fair to define a teleosemantic notion of “purely epistemic” as something like “there is no optimization (anywhere in the system—‘inner’ or ‘outer’) except optimization for epistemic accuracy”.
The obvious application of my main point is that some form of “complete feedback” is a necessary (but insufficient) condition for this.
“Epistemic accuracy” here has to be defined in such a way as to capture the one-way “direction-of-fit” optimization of the map to fit the territory, but never the territory to fit the map. IE the optimization algorithm has to ignore the causal impact of its predictions.
However, I don’t particularly endorse this as the correct design choice—although a system with this property would be relatively safe in the sense of eliminating inner-alignment concerns and (in a sense) outer-alignment concerns, it is doing so by ignoring its impact on the world, which creates its own set of dangers. If such a system were widely deployed and became highly trusted for its predictions, it could stumble into bad self-fulfilling prophecies.
So, in my view, “epistemic” systems should be as transparent as possible with human users about possible multiple-fixed-point issues, try to keep humans in the loop and give the important decisions to humans; but ultimately, we need to view even “purely epistemic” systems as making some important (instrumental) decisions, and have them take some responsibility for making those decisions well instead of poorly.
I was going to remind you that the paper didn’t say how the fixed-point selection works, and we can do that part in an agentic way, but then you go on to say basically the same thing (with the caveat that where you say “put the LI into a larger system that follows the rule: whatever the expectations are about the AI’s own actions, make that actually happen” I would say the more general “put the LI into an environment which somehow reacts to its predictions”):
I don’t understand what you’re trying to accomplish in these paragraphs. To me you sound sorta like Bob in the following:
It seems possible that you think we have some disagreement that we don’t have?
I’m not clear on what you’re trying to disagree with here. It sounds like we both agree that if Benevolent Bob builds a powerful “purely epistemic system” (by whatever definition), without limiting its knowledge, then Dr. Evil can misuse it; and we both agree that as a consequence of this, it makes sense to instead build some agency into the system, so that the system can decide not to give users dangerous information.
Possibly you disagree with the claim “it is easy to build agentlike things out of belieflike things”? What I have in mind is a powerful epistemic oracle. As a simple example, let’s say it can give highly accurate guesses to mathematically-posed problems. Then Dr. Evil can implement AIXI by feeding in AIXI’s mathematical definition, for example. This is the sort of thing I had in mind, but generalized to the nonmathematical case. (EG, “conditional on my owning a super-powerful death ray soon, what actions do I take now”)
Hmm, I think the point I’m trying to make is: it’s dicey to have a system S that’s being continually modified to systematically reduce some loss L, but then we intervene to edit S in a way that increases L. We’re kinda fighting against the loss-reducing mechanism (be it gradient descent or bankroll-changes or whatever), hoping that the loss-reducing mechanism won’t find a “repair” that works around our interventions.
In that context, my presumption is that an AI will have some epistemic part S that’s continually modified to produce correct objective understanding of the world, including correct anticipation of the likely consequences of actions. The loss L for that part would probably be self-supervised learning, but could also include self-consistency or whatever.
And then I’m interpreting you (maybe not correctly?) as proposing that we should consider things like making the AI have objectively incorrect beliefs about (say) bioweapons, and I feel like that’s fighting against this L in that dicey way.
Whereas your Q-learning example doesn’t have any problem with fighting against a loss function, because Q(S,A) is being consistently and only updated by the reward.
The above is inapplicable to LLMs, I think. (And this seems tied IMO to the fact that LLMs can’t do great novel science yet etc.) But it does apply to FixDT.
Specifically, for things like FixDT, if there are multiple fixed points (e.g. I expect to stand up, and then I stand up, and thus the prediction was correct), then whatever process you use to privilege one fixed point over another, you’re not fighting against the above L (i.e., the “epistemic” loss L based on self-supervised learning and/or self-consistency or whatever). L is applying no force either way. It’s a wide-open degree of freedom.
(If your response is “L incentivizes fixed-points that make the world easier to predict”, then I don’t think that’s a correct description of what such a learning algorithm would do.)
So if your feedback proposal exclusively involves a mechanism that privileging one fixed point over another, then I have no complaints, and would describe it as choosing a utility function (preferences not beliefs) within the FixDT framework.
Btw I think we’re in agreement that there should be some mechanism privileging one fixed point over another, instead of ignoring it and just letting the underdetermined system do whatever it does.
Oh, I want to set that problem aside because I don’t think you need an arbitrarily rich hypothesis space to get ASI. The agency comes from the whole AI system, not just the “epistemic” part, so the “epistemic” part can be selected from a limited model class, as opposed to running arbitrary computations etc. For example, the world model can be “just” a Bayes net, or whatever. We’ve talked about this before.
I also learned the term observation-utility agents from you :) You don’t think that can solve those problems (in principle)?
I’m probably misunderstanding you here and elsewhere, but enjoying the chat, thanks :)