Has anyone thought about how the idea of naturallatents may be used to help formalise QACI?
The simple core insight of QACI according to me is something like: A formal process we can describe that we’re pretty sure would return the goals we want an AGI to optimise for is itself often a sufficient specification of those goals. Even if this formal process costs galactic amounts of compute and can never actually be run, not even by the AGI itself.
This allows for some funny value specification strategies we might not usually think about. For example, we could try using some camera recordings of the present day, a for loop, and a code snippet implementing something like Solomonof induction to formally specify the idea of Earth sitting around in a time loop until it has worked out its CEV.
It doesn’t matter that the AGI can’t compute that. So long as it can reason about what the result of the computation would be without running it, this suffices as a pointer to our CEV. Even if the AGI doesn’t manage to infer the exact result of the process, that’s fine so long as it can infer some bits of information about the result. This just ends up giving the AGI some moral uncertainty that smoothly goes down as its intelligence goes up.
Unfortunately, afaik these funny strategies seem to not work at the moment. They don’t really give you computable code that corresponds to Earth sitting around in a time loop to work out its CEV.
But maybe we can point to the concept without having completely formalised it ourselves?
A Solomonoff inductor walks into a bar in a foreign land. (Stop me if you’ve heard this one before.) The bartender, who is also a Solomonoff inductor, asks “What’ll it be?”. The customer looks around at what the other patrons are having, points to an unfamiliar drink, and says “One of those, please.”. The bartender points to a drawing of the same drink on a menu, and says “One of those?”. The customer replies “Yes, one of those.”. The bartender then delivers a drink, and it matches what the first inductor expected.What’s up with that?
This is from a recent post on natural latents by John.
Natural latents are an idea that tries to explain, among other things, how one agent can point to a concept and have another agent realise what concept is meant, even when it may naively seem like the pointer is too fuzzy, impresice and low bit rate to allow for this.
If ‘CEV as formalized by a time loop’ is a sort of natural abstraction, it seems to me like one ought to be able to point to it like this even if we don’t have an explicit formal specification of the concept, just like the customer and bartender need not have an explicit formal specification of the drink to point out the drink to each other.
Then, it’d be fine for us to not quite have the code snippet corresponding to e,g. a simulation of Earth going through a time loop to work out its CEV. So long as we can write a pointer such that theclosest natural abstraction singled out by that pointer is a code snippet simulating Earth going through a time loop to work out its CEV, we might be fine. Provided we can figure out how abstractions and natural latents in the AGI’s mind actually work and manipulate them. But we probably need to figure that out anyway, if we want to point the AGI’s values at anything specific whatsoever.
Is ‘CEV as formalized by a simulated time loop’ a concept made of something like natural latents? I don’t know, but I’d kind of suspect it is. It seems suspiciously straightforward for us humans to communicate the concept to each other at least, even as we lack a precise specification of it. We can’t write down a lattice quantum field theory simulation of all of the Earth going through the time loop because we don’t have the current state of Earth to initialize with. But we can talk to each other about the idea of writing that simulation, and know what we mean.
I do think natural latents could have a significant role to play somehow in QACI-like setups, but it doesn’t seem like they let you avoid formalizing, at least in the way you’re talking about. It seems more interesting in terms of avoiding specifying a universal prior over possible worlds, if we can instead specify a somewhat less universal prior that bakes in assumptions about our worlds’ known causal structure. it might help with getting a robust pointer to the start of the time snippet. I don’t see how it helps avoiding specifying “looping”, or “time snippet”, etc. natural latents seem to me to be primarily about the causal structure of our universe, and it’s unclear what they even mean otherwise. it seems like our ability to talk about this concept is made up of a bunch of natural latents, and some of them are kind of messy and underspecified by the phrase, mainly relating to what the heck is a physics.
it might help with getting a robust pointer to the start of the time snippet.
That’s mainly what I meant, yes.
Specifying what the heck a physics is seems much more tractable to me.We don’t have a neat theory of quantum gravity, but a lattice simulation of quantum field theory in curved space-time, or just a computer game world populated by characters controlled by neural networks, seems pretty straightforward to formally specify. We could probably start coding that up right now.
What we lack is a pointer to the right initial conditions for the simulation. The wave function of Earth in case of the lattice qft setup, or the human uploads as neural network parameters in case of the game environment.
To me kinda the whole point of QACI is that it tries to actually be fully formalized. Informal definitions seem very much not robust to when superintelligences think about them; fully formalized definitions are the only thing I know of that keep meaning the same thing regardless of what kind of AI looks at it or with what kind of ontology.
I don’t really get the whole natural latents ontology at all, and mostly expect it to be too weak for us to be able to get reflectively stable goal-content integrity even as the AI becomes vastly superintelligent. If definitions are informal, that feels to me like degrees of freedom in which an ASI can just pick whichever values make its job easiest.
Perhaps something like this allows use to use current, non-vastly-superintelligent AIs to help design a formalized version of QACI or ESP which itself is robust enough to be passed to superintelligent optimizers; but my response to this is usually “have you tried first formalizing CEV/QACI/ESP by hand?” because it feels like we’ve barely tried and like reasonable progress can be made on it that way.
Perhaps there are some cleverer schemes where the superintelligent optimizer is pointed at the weaker current-tech-level AI, itself pointed informally at QACI, and we tell the superintelligent optimizer “do what this guy says”; but that seems like it either leaves too many degrees of freedom to the superintelligent optimizer again, or it requires solving corrigibility (the superintelligent optimizer is corrigibly assisting the weaker AI) at which point why not just point the corrigibility at the human directly and ignore QACI altogether, at least to begin with.
The idea would be that an informal definition of a concept conditioned on that informal definition being a pointer to a natural concept, is ≈ a formal specification of that concept. Where the ≈ is close enough to a = that it’d hold up to basically arbitrary optimization power.
So the formalized concept is Get_Simplest_Concept_Which_Can_Be_Informally_Described_As("QACI is an outer alignment scheme consisting of…") ? Is an informal definition written in english?
It seems like “natural latent” here just means “simple (in some simplicity prior)”. If I read the first line of your post as:
Has anyone thought about QACI could be located in some simplicity prior, by searching the prior for concepts matching(??in some way??) some informal description in english?
It sure sounds like I should read the two posts you linked (perhaps especially this one), despite how hard I keep bouncing off of the natural latents idea. I’ll give that a try.
More like the formalised concept is the thing you get if you poke through the AGI’s internals searching for its representation of the concept combination pointed to by an english sentence plus simulation code, and then point its values at that concept combination.
Seems really wonky and like there could be a lot of things that could go wrong in hard-to-predict ways, but I guess I sorta get the idea.
I guess one of the main things I’m worried about is that it seems to require that we either:
Be really good at timing when we pause it to look at its internals, such that we look at the internals after it’s had long enough to think about things that there are indeed such representations, but not long enough that it started optimizing really hard such that we either {die before we get to look at the internals} or {the internals are deceptively engineered to brainhack whoever would look at them}. If such a time interval even occurs for any amount of time at all.
Have an AI that is powerful enough to have powerful internals-about-QACI to look at, but corrigible enough that this power is not being used to do instrumentally convergent stuff like eat the world in order to have more resources with which to reason.
Current AIs are not representative of what dealing with powerful optimizers is like; when we’ll start getting powerful optimizers, they won’t sit around long enough for us to look at them and ponder, they’ll just quickly eat us.
In natural langage maybe it would be something like “given these ontological boundaries, give us the best estimate you can of CEV. ”?
It seems kind of related to boundaries as well if you think of natural latents as “functional markov blankets” that cut reality at it’s joints then you could probably say that you want to perserve part of that structure that is “human agency” or similar. I don’t know if that makes sense but I like the idea direction!
I think the fact that natural latents are much lower dimensional than all of physics makes it suitable for specifying the pointer to CEV as an equivalence class over physical processes (many quantum field configurations can correspond to the same human, and we want to ignore differences within that equivalence class).
IMO the main bottleneck is to account for the reflective aspects in CEV, because one constraint of natural latents is that it should be redundantly represented in the environment.
It is redundantly represented in the environment, because humans are part of the environment.
If you tell an AI to imagine what happens if humans sit around in a time loop until they figure out what they want, this will single out a specific thought experiment to the AI, provided humans and physics are concepts the AI itself thinks in.
(The time loop part and the condition for terminating the loop can be formally specified in code, so the AI doesn’t need to think those are natural concepts)
If the AI didn’t have a model of human internals that let it predict the outcome of this scenario, it would be bad at predicting humans.
natural latents are about whether the AI’s cognition routes through the same concepts that humans use.
We can imagine the AI maintaining predictive accuracy about humans without using the same human concepts. For example, it can use low-level physics to simulate the environment, which would be predictively accurate, but that cognition doesn’t make use of the concept “strawberry” (in principle, we can still “single out” the concept of “strawberry” within it, but that information comes mostly from us, not from the physics simulation)
Natural latents are equivalent up to isomorphism (ie two latent variables are equivalent iff they give the same conditional probabilities on observables), but for reflective aspects of human cognition, it’s unclear whether that equivalence class pin down all information we care about for CEV (there may be differences within the equivalence class that we care about), in a way that generalizes far out of distribution
My claim is that the natural latents the AI needs to share for this setup are not about the details of what a ‘CEV’ is. They are about what researchers mean when they talk about initializing, e.g., a physics simulation with the state of the Earth at a specific moment in time.
Has anyone thought about how the idea of natural latents may be used to help formalise QACI?
The simple core insight of QACI according to me is something like: A formal process we can describe that we’re pretty sure would return the goals we want an AGI to optimise for is itself often a sufficient specification of those goals. Even if this formal process costs galactic amounts of compute and can never actually be run, not even by the AGI itself.
This allows for some funny value specification strategies we might not usually think about. For example, we could try using some camera recordings of the present day, a for loop, and a code snippet implementing something like Solomonof induction to formally specify the idea of Earth sitting around in a time loop until it has worked out its CEV.
It doesn’t matter that the AGI can’t compute that. So long as it can reason about what the result of the computation would be without running it, this suffices as a pointer to our CEV. Even if the AGI doesn’t manage to infer the exact result of the process, that’s fine so long as it can infer some bits of information about the result. This just ends up giving the AGI some moral uncertainty that smoothly goes down as its intelligence goes up.
Unfortunately, afaik these funny strategies seem to not work at the moment. They don’t really give you computable code that corresponds to Earth sitting around in a time loop to work out its CEV.
But maybe we can point to the concept without having completely formalised it ourselves?
This is from a recent post on natural latents by John.
Natural latents are an idea that tries to explain, among other things, how one agent can point to a concept and have another agent realise what concept is meant, even when it may naively seem like the pointer is too fuzzy, impresice and low bit rate to allow for this.
If ‘CEV as formalized by a time loop’ is a sort of natural abstraction, it seems to me like one ought to be able to point to it like this even if we don’t have an explicit formal specification of the concept, just like the customer and bartender need not have an explicit formal specification of the drink to point out the drink to each other.
Then, it’d be fine for us to not quite have the code snippet corresponding to e,g. a simulation of Earth going through a time loop to work out its CEV. So long as we can write a pointer such that the closest natural abstraction singled out by that pointer is a code snippet simulating Earth going through a time loop to work out its CEV, we might be fine. Provided we can figure out how abstractions and natural latents in the AGI’s mind actually work and manipulate them. But we probably need to figure that out anyway, if we want to point the AGI’s values at anything specific whatsoever.
Is ‘CEV as formalized by a simulated time loop’ a concept made of something like natural latents? I don’t know, but I’d kind of suspect it is. It seems suspiciously straightforward for us humans to communicate the concept to each other at least, even as we lack a precise specification of it. We can’t write down a lattice quantum field theory simulation of all of the Earth going through the time loop because we don’t have the current state of Earth to initialize with. But we can talk to each other about the idea of writing that simulation, and know what we mean.
I do think natural latents could have a significant role to play somehow in QACI-like setups, but it doesn’t seem like they let you avoid formalizing, at least in the way you’re talking about. It seems more interesting in terms of avoiding specifying a universal prior over possible worlds, if we can instead specify a somewhat less universal prior that bakes in assumptions about our worlds’ known causal structure. it might help with getting a robust pointer to the start of the time snippet. I don’t see how it helps avoiding specifying “looping”, or “time snippet”, etc. natural latents seem to me to be primarily about the causal structure of our universe, and it’s unclear what they even mean otherwise. it seems like our ability to talk about this concept is made up of a bunch of natural latents, and some of them are kind of messy and underspecified by the phrase, mainly relating to what the heck is a physics.
That’s mainly what I meant, yes.
Specifying what the heck a physics is seems much more tractable to me.We don’t have a neat theory of quantum gravity, but a lattice simulation of quantum field theory in curved space-time, or just a computer game world populated by characters controlled by neural networks, seems pretty straightforward to formally specify. We could probably start coding that up right now.
What we lack is a pointer to the right initial conditions for the simulation. The wave function of Earth in case of the lattice qft setup, or the human uploads as neural network parameters in case of the game environment.
To me kinda the whole point of QACI is that it tries to actually be fully formalized. Informal definitions seem very much not robust to when superintelligences think about them; fully formalized definitions are the only thing I know of that keep meaning the same thing regardless of what kind of AI looks at it or with what kind of ontology.
I don’t really get the whole natural latents ontology at all, and mostly expect it to be too weak for us to be able to get reflectively stable goal-content integrity even as the AI becomes vastly superintelligent. If definitions are informal, that feels to me like degrees of freedom in which an ASI can just pick whichever values make its job easiest.
Perhaps something like this allows use to use current, non-vastly-superintelligent AIs to help design a formalized version of QACI or ESP which itself is robust enough to be passed to superintelligent optimizers; but my response to this is usually “have you tried first formalizing CEV/QACI/ESP by hand?” because it feels like we’ve barely tried and like reasonable progress can be made on it that way.
Perhaps there are some cleverer schemes where the superintelligent optimizer is pointed at the weaker current-tech-level AI, itself pointed informally at QACI, and we tell the superintelligent optimizer “do what this guy says”; but that seems like it either leaves too many degrees of freedom to the superintelligent optimizer again, or it requires solving corrigibility (the superintelligent optimizer is corrigibly assisting the weaker AI) at which point why not just point the corrigibility at the human directly and ignore QACI altogether, at least to begin with.
The idea would be that an informal definition of a concept conditioned on that informal definition being a pointer to a natural concept, is ≈ a formal specification of that concept. Where the ≈ is close enough to a = that it’d hold up to basically arbitrary optimization power.
So the formalized concept is
Get_Simplest_Concept_Which_Can_Be_Informally_Described_As("QACI is an outer alignment scheme consisting of…")
? Is an informal definition written in english?It seems like “natural latent” here just means “simple (in some simplicity prior)”. If I read the first line of your post as:
It sure sounds like I should read the two posts you linked (perhaps especially this one), despite how hard I keep bouncing off of the natural latents idea. I’ll give that a try.
More like the formalised concept is the thing you get if you poke through the AGI’s internals searching for its representation of the concept combination pointed to by an english sentence plus simulation code, and then point its values at that concept combination.
Seems really wonky and like there could be a lot of things that could go wrong in hard-to-predict ways, but I guess I sorta get the idea.
I guess one of the main things I’m worried about is that it seems to require that we either:
Be really good at timing when we pause it to look at its internals, such that we look at the internals after it’s had long enough to think about things that there are indeed such representations, but not long enough that it started optimizing really hard such that we either {die before we get to look at the internals} or {the internals are deceptively engineered to brainhack whoever would look at them}. If such a time interval even occurs for any amount of time at all.
Have an AI that is powerful enough to have powerful internals-about-QACI to look at, but corrigible enough that this power is not being used to do instrumentally convergent stuff like eat the world in order to have more resources with which to reason.
Current AIs are not representative of what dealing with powerful optimizers is like; when we’ll start getting powerful optimizers, they won’t sit around long enough for us to look at them and ponder, they’ll just quickly eat us.
In natural langage maybe it would be something like “given these ontological boundaries, give us the best estimate you can of CEV. ”?
It seems kind of related to boundaries as well if you think of natural latents as “functional markov blankets” that cut reality at it’s joints then you could probably say that you want to perserve part of that structure that is “human agency” or similar. I don’t know if that makes sense but I like the idea direction!
I think the fact that natural latents are much lower dimensional than all of physics makes it suitable for specifying the pointer to CEV as an equivalence class over physical processes (many quantum field configurations can correspond to the same human, and we want to ignore differences within that equivalence class).
IMO the main bottleneck is to account for the reflective aspects in CEV, because one constraint of natural latents is that it should be redundantly represented in the environment.
It is redundantly represented in the environment, because humans are part of the environment.
If you tell an AI to imagine what happens if humans sit around in a time loop until they figure out what they want, this will single out a specific thought experiment to the AI, provided humans and physics are concepts the AI itself thinks in.
(The time loop part and the condition for terminating the loop can be formally specified in code, so the AI doesn’t need to think those are natural concepts)
If the AI didn’t have a model of human internals that let it predict the outcome of this scenario, it would be bad at predicting humans.
natural latents are about whether the AI’s cognition routes through the same concepts that humans use.
We can imagine the AI maintaining predictive accuracy about humans without using the same human concepts. For example, it can use low-level physics to simulate the environment, which would be predictively accurate, but that cognition doesn’t make use of the concept “strawberry” (in principle, we can still “single out” the concept of “strawberry” within it, but that information comes mostly from us, not from the physics simulation)
Natural latents are equivalent up to isomorphism (ie two latent variables are equivalent iff they give the same conditional probabilities on observables), but for reflective aspects of human cognition, it’s unclear whether that equivalence class pin down all information we care about for CEV (there may be differences within the equivalence class that we care about), in a way that generalizes far out of distribution
My claim is that the natural latents the AI needs to share for this setup are not about the details of what a ‘CEV’ is. They are about what researchers mean when they talk about initializing, e.g., a physics simulation with the state of the Earth at a specific moment in time.
Noted, that does seem a lot more tractable than using natural latents to pin down details of CEV by itself