Example of a self-aware system: A traditional RL agent. (Why? Because it has a special concept of “its own actions” represented in its models.)
Example of a self-unaware system: Any system that takes inputs, does a deterministic computation, and spits out an output. (Why? Because when you correctly compute a computable function, you get the same answer regardless of where and whether the computation is physically instantiated in the universe.)
A traditional RL agent absolutely could be a deterministic computation (modulo bugs in the code). It is a program that gets compiled into or run by machine instructions which follow a particular deterministic specification that Intel has written (if you’re running on Intel chips).
Also, this argument would prove that humans are not deterministic systems, which seems like it’s proving too much.
If you want to predict what’s going to happen in the world, it often helps if you know that you are a thing that affects the world.
For your three examples: The Solomonoff induction example is a weird case because it is an uncomputable algorithm that only has computable hypotheses, so it can’t be aware of itself, but your second and third examples seem like they totally could lead to self-aware systems. In fact, the third example sounds like a description of humans, and humans are self-aware.
Overall I don’t see how we could tell in advance whether a system would be self-unaware or not.
For deterministic computation: What I was trying to get at is that a traditional RL agent does some computation, gets a new input based on its actions and environment, does some more computation, and so on. (I admit that I didn’t describe this well. I edited a bit.)
Your argument about Solomonoff induction is clever but I feel like it’s missing the point. Systems with some sense of self and self-understanding don’t generally simulate themselves or form perfect models of themselves; I know I don’t! Here’s a better statement: “I am a predictive world-model, I guess I’m probably implemented on some physical hardware somewhere.” This is a true statement, and the system can believe that statement without knowing what the physical hardware is (then it can start reasoning about what the physical hardware is, looking for news stories about AI projects). I’m proposing that we can and should build world-models that don’t have this type of belief in its world model.
What I really have in mind is: There’s a large but finite space of computable predictive models (given a bit-stream, predict the next bit). We run a known algorithm that searches through this space to find the model that best fits the internet. This model is full of insightful, semantic information about the world, as this helps it make predictions. Maybe if we do it right, the best model would not be self-reflective, not knowing what it was doing as it did its predictive thing, and thus unable to reason about its internal processes or recognize causal connections between that and the world it sees (even if such connections are blatant).
One intuition is: An oracle is supposed to just answer questions. It’s not supposed to think through how its outputs will ultimately affect the world. So, one way of ensuring that it does what it’s supposed to do, is to design the oracle to not know that it is a thing that can affect the world.
Your argument about Solomonoff induction is clever but I feel like it’s missing the point.
I agree it’s missing the point. I do get the point, and I disagree with it—I wanted to say “all three cases will build self-models”; I couldn’t because that may not be true for Solomonoff induction due to an unrelated reason which as you note misses the point. I did claim that the other two cases would be self-aware as you define it.
(I agree that Solomonoff induction might build an approximate model of itself, idk.)
Maybe if we do it right, the best model would not be self-reflective, not knowing what it was doing as it did its predictive thing, and thus unable to reason about its internal processes or recognize causal connections between that and the world it sees (even if such connections are blatant).
My claim is that we have no idea how to do this, and I think the examples in your post would not do this.
One intuition is: An oracle is supposed to just answer questions. It’s not supposed to think through how its outputs will ultimately affect the world. So, one way of ensuring that it does what it’s supposed to do, is to design the oracle to not know that it is a thing that can affect the world.
I’m not disagreeing that if we could build a self-unaware oracle then we would be safe. That seems reasonably likely to fix agency issues (though I’d want to think about it more). My disagreement is on the premise of the argument, i.e. can we build self-unaware oracles at all.
I think we’re on the same page! As I noted at the top, this is a brainstorming post, and I don’t think my definitions are quite right, or that my arguments are airtight. The feedback from you and others has been super-helpful, and I’m taking that forward as I search for more a rigorous version of this, if it exists!! :-)
Overall I don’t see how we could tell in advance whether a system would be self-unaware or not.
A sufficient condition here should be a lack of feedback loops that include information about the agent. I’m not sure that this is necessary, though, and there may be some more lax criteria we could live with.
This is mostly all theoretically, though, because it’s going to be very hard to create any system which is actually embedded in the world and present it information that is not causally influenced by itself since that seems theoretically impossible, though you might be able to achieve practically good enough acausality via some kind of data “scrubbing” procedure, though I don’t hold out much hope there given how hard it is to achieve this even in narrow cases*.
*I speak from firsthand experience here. I used to work at a company that sold ad insights and required collecting PII about users. We scrubbed the data for EU export, and we met the legal standard, but we also later figured out we could still triangulate the identity for any customer given their anonymized data to within an accuracy of 3 individuals on average.
Well, it takes two things: (1) Self-knowledge (“I wrote ‘0’ into register X”, “I am thinking about turtles”, etc. being in the world-model) and (2) knowledge of things causal consequences of that (the programmers see the 1 in register X and then change their behavior). With both of those, the system can learn causal links between its own decisions and the rest of the world, and can therefore effect real-world consequences.
Out of these two options, I think you’re proposing to cut off path (2), which I agree is very challenging. I am proposing to cut off path (1) instead, and not worry about path (2). Thus it’s a cybersecurity-type hardware/software design challenge, not a data sanitation challenge.
Given how I’m now thinking of what you mean by self-unawareness (in that it includes a lack of optimization and learning), (I) seems uninteresting here, since this seems to me to be suggesting that we build “oracles” that are not AI but instead regular software.
A traditional RL agent absolutely could be a deterministic computation (modulo bugs in the code). It is a program that gets compiled into or run by machine instructions which follow a particular deterministic specification that Intel has written (if you’re running on Intel chips).
Also, this argument would prove that humans are not deterministic systems, which seems like it’s proving too much.
If you want to predict what’s going to happen in the world, it often helps if you know that you are a thing that affects the world.
For your three examples: The Solomonoff induction example is a weird case because it is an uncomputable algorithm that only has computable hypotheses, so it can’t be aware of itself, but your second and third examples seem like they totally could lead to self-aware systems. In fact, the third example sounds like a description of humans, and humans are self-aware.
Overall I don’t see how we could tell in advance whether a system would be self-unaware or not.
GPUs aren’t deterministic.
I mean, sure. Seems irrelevant to the point being made here.
If you’re objecting to the fact that I said a thing that was literally false but basically correct, I’ve changed “is” to “could be”.
For deterministic computation: What I was trying to get at is that a traditional RL agent does some computation, gets a new input based on its actions and environment, does some more computation, and so on. (I admit that I didn’t describe this well. I edited a bit.)
Your argument about Solomonoff induction is clever but I feel like it’s missing the point. Systems with some sense of self and self-understanding don’t generally simulate themselves or form perfect models of themselves; I know I don’t! Here’s a better statement: “I am a predictive world-model, I guess I’m probably implemented on some physical hardware somewhere.” This is a true statement, and the system can believe that statement without knowing what the physical hardware is (then it can start reasoning about what the physical hardware is, looking for news stories about AI projects). I’m proposing that we can and should build world-models that don’t have this type of belief in its world model.
What I really have in mind is: There’s a large but finite space of computable predictive models (given a bit-stream, predict the next bit). We run a known algorithm that searches through this space to find the model that best fits the internet. This model is full of insightful, semantic information about the world, as this helps it make predictions. Maybe if we do it right, the best model would not be self-reflective, not knowing what it was doing as it did its predictive thing, and thus unable to reason about its internal processes or recognize causal connections between that and the world it sees (even if such connections are blatant).
One intuition is: An oracle is supposed to just answer questions. It’s not supposed to think through how its outputs will ultimately affect the world. So, one way of ensuring that it does what it’s supposed to do, is to design the oracle to not know that it is a thing that can affect the world.
I agree it’s missing the point. I do get the point, and I disagree with it—I wanted to say “all three cases will build self-models”; I couldn’t because that may not be true for Solomonoff induction due to an unrelated reason which as you note misses the point. I did claim that the other two cases would be self-aware as you define it.
(I agree that Solomonoff induction might build an approximate model of itself, idk.)
My claim is that we have no idea how to do this, and I think the examples in your post would not do this.
I’m not disagreeing that if we could build a self-unaware oracle then we would be safe. That seems reasonably likely to fix agency issues (though I’d want to think about it more). My disagreement is on the premise of the argument, i.e. can we build self-unaware oracles at all.
On further reflection, you’re right, the Solomonoff induction example is not obvious. I put a correction in my post, thanks again.
I think we’re on the same page! As I noted at the top, this is a brainstorming post, and I don’t think my definitions are quite right, or that my arguments are airtight. The feedback from you and others has been super-helpful, and I’m taking that forward as I search for more a rigorous version of this, if it exists!! :-)
A sufficient condition here should be a lack of feedback loops that include information about the agent. I’m not sure that this is necessary, though, and there may be some more lax criteria we could live with.
This is mostly all theoretically, though, because it’s going to be very hard to create any system which is actually embedded in the world and present it information that is not causally influenced by itself since that seems theoretically impossible, though you might be able to achieve practically good enough acausality via some kind of data “scrubbing” procedure, though I don’t hold out much hope there given how hard it is to achieve this even in narrow cases*.
*I speak from firsthand experience here. I used to work at a company that sold ad insights and required collecting PII about users. We scrubbed the data for EU export, and we met the legal standard, but we also later figured out we could still triangulate the identity for any customer given their anonymized data to within an accuracy of 3 individuals on average.
Agreed. Also agreed that this seems very difficult, both in theory and in practice.
Well, it takes two things: (1) Self-knowledge (“I wrote ‘0’ into register X”, “I am thinking about turtles”, etc. being in the world-model) and (2) knowledge of things causal consequences of that (the programmers see the 1 in register X and then change their behavior). With both of those, the system can learn causal links between its own decisions and the rest of the world, and can therefore effect real-world consequences.
Out of these two options, I think you’re proposing to cut off path (2), which I agree is very challenging. I am proposing to cut off path (1) instead, and not worry about path (2). Thus it’s a cybersecurity-type hardware/software design challenge, not a data sanitation challenge.
Given how I’m now thinking of what you mean by self-unawareness (in that it includes a lack of optimization and learning), (I) seems uninteresting here, since this seems to me to be suggesting that we build “oracles” that are not AI but instead regular software.