Indeed, when cryptographers are trying to ensure that certain agents cannot do certain things, and other agents can, they often use trapdoor functions that are computationally impracticable for general agents to invert, but can be easily inverted by agents in possession of a specific secret.
I don’t think there’s a great deal that cryptography can teach agent fundamentals, but I do think there’s some overlap: it should be possible to interface a valid agent fundamentals theory neatly to the basics of cryptography.
I’m fairly optimistic about arriving at a robust solution to alignment via agent-foundations research in a timely manner. (My semi-arbitrary deadline is 2030, and I expect to arrive at intermediate solid results by EOY 2025.)
I understanding that rigorously reexpressing philosophy in mathematics is non-trivial, but (as I’m sure you’re aware) given currently plausible timelines, ~2030 seems pretty late for getting this figured out: we may well need to solve some rather specific and urgent practicalities by somewhere around then.
Can you tell me what is the hard part in formalizing the following:
Agent A (an AI) is less computationally limited than a set of agents H1 through HN (humans). It models and can affect the world, itself, and the humans, using an efficient approximately Bayesian approach, and also models its own current remaining uncertainty due to insufficient knowledge (including due to not having access to the Universal prior since it is computationally bounded). It can plan both how to optimize the world for a specific goad while pessimizing with appropriate caution over its current uncertainty, and also how to prioritize using the scientific method to reduce its uncertainty. It understands (with some current uncertainty) what preference ordering the humans each have on future states of the world. It synthesizes all of these into a fairly good compromise (a problem extensively studied in economics and the theory of things like voting), then uses its superior computational capacity to optimize the world for this (with suitable minimizing caution over its remaining uncertainty) and also to reduce its uncertainty so it can optimize better.
Idealized Agents Are Approximate Causal Mirrors…
The literal formulation also runs into all sorts of infinite recursion paradoxes. What if the agent wants to model itself? What if the environment contains other agents? What if some of them are modeling this agent? And so on.
I recall reading a description by an early 20th century Asian-influenced-European-mystic of the image of a universe full of people being like a array of mirror-surfaced balls, each reflecting within it in miniature the entire rest of the array, including the reflections inside each of the other mirrored balls, recursively. (Though this image omits the agent modelling itself, it’ not hard to extend it, say by adding some fuzz to the outside of each ball, and a reflection of that inside it,.)
I don’t think there’s a great deal that cryptography can teach agent fundamentals, but I do think there’s some overlap
Yup! Cryptography actually was the main thing I was thinking about there. And there’s indeed some relation. For example, it appears that NP≠P is because our universe’s baseline “forward-pass functions” are just poorly suited for being composed into functions solving certain problems. The environment doesn’t calculate those; all of those are in P.
However, the inversion of the universe’s forward passes can be NP-complete functions. Hence a lot of difficulties.
~2030 seems pretty late for getting this figured out: we may well need to solve some rather specific and urgent practicalities by somewhere around then
2030 is the target for having completed the “hire a horde of mathematicians and engineers and blow the problem wide open” step, to be clear. I don’t expect the theoretical difficulties to take quite so long.
Can you tell me what is the hard part in formalizing the following:
Usually, the hard part is finding a way to connect abstract agency frameworks to reality. As in: here you have your framework, here’s the Pile, now write some code to make them interface with each other.
Specifically in this case, the problems are:
an efficient approximately Bayesian approach
What approach specifically? The agent would need to take in the Pile, and regurgitate some efficient well-formatted hierarchical world-model over which it can do search. What’s the algorithm for this?
It understands (with some current uncertainty) what preference ordering the humans each have
How do you make it not just understand that, but care about that? How do you interface with the world-model it learned, and point at what the humans care about?
However, the inversion of the universe’s forward passes can be NP-complete functions. Hence a lot of difficulties.
If were talking about cryptography specifically, we don’t believe that the inversion of the universe’s forward passes for cryptography is NP-complete, and if this was proved, this would collapse the polynomial hierarchy to the first level. The general view is that the polynomial hierarchy is likely to have an infinite amount of levels, ala Hilbert’s hotel.
Yup! Cryptography actually was the main thing I was thinking about there. And there’s indeed some relation. For example, it appears that NP≠P is because our universe’s baseline “forward-pass functions” are just poorly suited for being composed into functions solving certain problems. The environment doesn’t calculate those; all of those are in P.
A different story is that the following constraints potentially prevent us from solving NP-complete problems efficiently:
The first law of thermodynamics coming from time-symmetry of the universe’s physical laws.
Light speed being finite, meaning there’s only a finite amount of universe to build your computer.
Limits on memory and computational speed not letting us scale exponentially forever.
(Possibly) Time Travel and Quantum Gravity are inconsistent, or time travel/CTCs are impossible.
Edit: OTCs might also be impossible, where you can’t travel in time but nevertheless have a wormhole, meaning wormholes might be impossible.
.
However, the inversion of the universe’s forward passes can be NP-complete functions.
Like a cryptographer, I’m not very concerned about worst-case complexity, only average-case complexity. We don’t even generally need an exact inverse, normally just an approximation to some useful degree of accuracy. If I’m in a position to monitor and repeatedly apply corrections as I approach my goal, even fairly coarse approximations with some bounded error rate may well be enough. Some portions of the universe are pretty approximately-invertible in the average case using much lower computational resources than simulating the field-theoretical wave function of every fundamental particle. Others (for example non-linear systems after many Lyapunov times, carefully designed cryptosystems, or most chaotic cellular automata), less so. Animals including humans seem to be able to survive in the presence of a mixed situation where they can invert/steer some things but not others, basically by attempting to avoid situations where they need to do the impossible. AIs are going to face the same situation.
Hence a lot of difficulties.an efficient approximately Bayesian approach
What approach specifically? The agent would need to take in the Pile, and regurgitate some efficient well-formatted hierarchical world-model over which it can do search. What’s the algorithm for this?
Basically every functional form of machine learning we know, including both SGD and in-context learning in sufficiently large LLMs, implements an approximate version of Bayesianism. I agree we need to engineer a specific implementation to build my proposal, but for mathematical analysis just the fact that it’s a computationally-bounded approximation to Bayesianism takes us quite some way, until we need to analyze its limitations and inaccuracies.
It understands (with some current uncertainty) what preference ordering the humans each have
How do you make it not just understand that, but care about that? How do you interface with the world-model it learned, and point at what the humans care about?
I’m assuming a structure similar to a computationally-bounded version of AIXI, upgraded to do value learning rather than having a hard-coded utility function. It maintains and performs approximate Bayesian updates on an ensemble of theories about a) mappings from current world state + actions to distributions of future world states, and b) mappings from world states to something utility function-like for individual humans, plus an aggregate/compromise of these across all humans. It can apply the scientific method to reducing uncertainty on these both of these ensembles of theories, in a prioritized way, and its final goal is to meanwhile attempt to optimize the utility of the aggregate/compromise across all humans, in a suitably cautious/pessimizing way over uncertainties in a) and b). So like AIXI, it has an a explicit final goal slot by construction, and that goal slot has been pointed at value learning. You don’t need to point at what humans care about in detail, that’s part b) of its world model ensemble. You probably do need to point at a definition of what a human is, plus the fact that humans, as sentient biological organisms, are computational bounded agents who have preferences/goals (which your agent fundamentals program clearly could be helpful for, if Biology alone wasn’t enough of a pointer).
Given access to an LLM, I don’t believe finding a basically-unique best-fit mapping between the human linguistic world model encoded in the LLM and the AI’s Bayesian ensemble of world models is a hard problem, so I don’t consider something as basic as pointing at the biological species Homo sapiens is very hard. I’m actually very puzzled why (post GPT-3) anyone still considers the pointers problem to be a challenge: given two very large, very complex and easily queriable world models, there is clearly almost always (apart from statistically unlikely corner cases) going to be a functionally-unique solution to finding a closest fit between the two that makes as much as possible of one an approximate subset of the other. (And in those cases where there are a small number of plausible alternative fits, either globally or at least for small portions of the world-model networks, there should be a clear experimental way to distinguish between the alternative hypotheses, often just by asking some humans some questions.) This is basically just a problem in optimal approximate subset-isomorphism of labelled graphs (with an unknown label mapping), something that has excellent heuristic methods that work in the average case. (I expect the worst case is NP-complete, but we’re not going to hit it.) Doing this between different generations of human scientific paradigms for the same subject matter is basically always trivial, other than for paradigms so primitive and mistaken as to have almost no valid content (even the Ancient Greek Earth-Air-Fire-Water model maps onto solid, gas, plasma, liquid: the four most common states of matter). There may of course be parts that don’t fit together well due to mistakes on one side or the other, but the concepts “the species Homo sapiens” and “humans are evolved sentient animals, and thus computationally-bounded agents with preferences/goals” both seem to me to be rather unlikely to be one of them, given how genetically similar to each other we all are.
Indeed, when cryptographers are trying to ensure that certain agents cannot do certain things, and other agents can, they often use trapdoor functions that are computationally impracticable for general agents to invert, but can be easily inverted by agents in possession of a specific secret.
I don’t think there’s a great deal that cryptography can teach agent fundamentals, but I do think there’s some overlap: it should be possible to interface a valid agent fundamentals theory neatly to the basics of cryptography.
I understanding that rigorously reexpressing philosophy in mathematics is non-trivial, but (as I’m sure you’re aware) given currently plausible timelines, ~2030 seems pretty late for getting this figured out: we may well need to solve some rather specific and urgent practicalities by somewhere around then.
Can you tell me what is the hard part in formalizing the following:
Agent A (an AI) is less computationally limited than a set of agents H1 through HN (humans). It models and can affect the world, itself, and the humans, using an efficient approximately Bayesian approach, and also models its own current remaining uncertainty due to insufficient knowledge (including due to not having access to the Universal prior since it is computationally bounded). It can plan both how to optimize the world for a specific goad while pessimizing with appropriate caution over its current uncertainty, and also how to prioritize using the scientific method to reduce its uncertainty. It understands (with some current uncertainty) what preference ordering the humans each have on future states of the world. It synthesizes all of these into a fairly good compromise (a problem extensively studied in economics and the theory of things like voting), then uses its superior computational capacity to optimize the world for this (with suitable minimizing caution over its remaining uncertainty) and also to reduce its uncertainty so it can optimize better.
I recall reading a description by an early 20th century Asian-influenced-European-mystic of the image of a universe full of people being like a array of mirror-surfaced balls, each reflecting within it in miniature the entire rest of the array, including the reflections inside each of the other mirrored balls, recursively. (Though this image omits the agent modelling itself, it’ not hard to extend it, say by adding some fuzz to the outside of each ball, and a reflection of that inside it,.)
Yup! Cryptography actually was the main thing I was thinking about there. And there’s indeed some relation. For example, it appears that NP≠P is because our universe’s baseline “forward-pass functions” are just poorly suited for being composed into functions solving certain problems. The environment doesn’t calculate those; all of those are in P.
However, the inversion of the universe’s forward passes can be NP-complete functions. Hence a lot of difficulties.
2030 is the target for having completed the “hire a horde of mathematicians and engineers and blow the problem wide open” step, to be clear. I don’t expect the theoretical difficulties to take quite so long.
Usually, the hard part is finding a way to connect abstract agency frameworks to reality. As in: here you have your framework, here’s the Pile, now write some code to make them interface with each other.
Specifically in this case, the problems are:
What approach specifically? The agent would need to take in the Pile, and regurgitate some efficient well-formatted hierarchical world-model over which it can do search. What’s the algorithm for this?
How do you make it not just understand that, but care about that? How do you interface with the world-model it learned, and point at what the humans care about?
If were talking about cryptography specifically, we don’t believe that the inversion of the universe’s forward passes for cryptography is NP-complete, and if this was proved, this would collapse the polynomial hierarchy to the first level. The general view is that the polynomial hierarchy is likely to have an infinite amount of levels, ala Hilbert’s hotel.
A different story is that the following constraints potentially prevent us from solving NP-complete problems efficiently:
The first law of thermodynamics coming from time-symmetry of the universe’s physical laws.
Light speed being finite, meaning there’s only a finite amount of universe to build your computer.
Limits on memory and computational speed not letting us scale exponentially forever.
(Possibly) Time Travel and Quantum Gravity are inconsistent, or time travel/CTCs are impossible.
Edit: OTCs might also be impossible, where you can’t travel in time but nevertheless have a wormhole, meaning wormholes might be impossible. .
Like a cryptographer, I’m not very concerned about worst-case complexity, only average-case complexity. We don’t even generally need an exact inverse, normally just an approximation to some useful degree of accuracy. If I’m in a position to monitor and repeatedly apply corrections as I approach my goal, even fairly coarse approximations with some bounded error rate may well be enough. Some portions of the universe are pretty approximately-invertible in the average case using much lower computational resources than simulating the field-theoretical wave function of every fundamental particle. Others (for example non-linear systems after many Lyapunov times, carefully designed cryptosystems, or most chaotic cellular automata), less so. Animals including humans seem to be able to survive in the presence of a mixed situation where they can invert/steer some things but not others, basically by attempting to avoid situations where they need to do the impossible. AIs are going to face the same situation.
Basically every functional form of machine learning we know, including both SGD and in-context learning in sufficiently large LLMs, implements an approximate version of Bayesianism. I agree we need to engineer a specific implementation to build my proposal, but for mathematical analysis just the fact that it’s a computationally-bounded approximation to Bayesianism takes us quite some way, until we need to analyze its limitations and inaccuracies.
I’m assuming a structure similar to a computationally-bounded version of AIXI, upgraded to do value learning rather than having a hard-coded utility function. It maintains and performs approximate Bayesian updates on an ensemble of theories about a) mappings from current world state + actions to distributions of future world states, and b) mappings from world states to something utility function-like for individual humans, plus an aggregate/compromise of these across all humans. It can apply the scientific method to reducing uncertainty on these both of these ensembles of theories, in a prioritized way, and its final goal is to meanwhile attempt to optimize the utility of the aggregate/compromise across all humans, in a suitably cautious/pessimizing way over uncertainties in a) and b). So like AIXI, it has an a explicit final goal slot by construction, and that goal slot has been pointed at value learning. You don’t need to point at what humans care about in detail, that’s part b) of its world model ensemble. You probably do need to point at a definition of what a human is, plus the fact that humans, as sentient biological organisms, are computational bounded agents who have preferences/goals (which your agent fundamentals program clearly could be helpful for, if Biology alone wasn’t enough of a pointer).
Given access to an LLM, I don’t believe finding a basically-unique best-fit mapping between the human linguistic world model encoded in the LLM and the AI’s Bayesian ensemble of world models is a hard problem, so I don’t consider something as basic as pointing at the biological species Homo sapiens is very hard. I’m actually very puzzled why (post GPT-3) anyone still considers the pointers problem to be a challenge: given two very large, very complex and easily queriable world models, there is clearly almost always (apart from statistically unlikely corner cases) going to be a functionally-unique solution to finding a closest fit between the two that makes as much as possible of one an approximate subset of the other. (And in those cases where there are a small number of plausible alternative fits, either globally or at least for small portions of the world-model networks, there should be a clear experimental way to distinguish between the alternative hypotheses, often just by asking some humans some questions.) This is basically just a problem in optimal approximate subset-isomorphism of labelled graphs (with an unknown label mapping), something that has excellent heuristic methods that work in the average case. (I expect the worst case is NP-complete, but we’re not going to hit it.) Doing this between different generations of human scientific paradigms for the same subject matter is basically always trivial, other than for paradigms so primitive and mistaken as to have almost no valid content (even the Ancient Greek Earth-Air-Fire-Water model maps onto solid, gas, plasma, liquid: the four most common states of matter). There may of course be parts that don’t fit together well due to mistakes on one side or the other, but the concepts “the species Homo sapiens” and “humans are evolved sentient animals, and thus computationally-bounded agents with preferences/goals” both seem to me to be rather unlikely to be one of them, given how genetically similar to each other we all are.