Master’s student in applied mathematics, funded by Center on Long-Term Risk to investigate the cheating problem in safe pareto-improvements. Agent foundations fellow with @Alex_Altair.
Some other areas I’m interested in:
Investigate properties of general purpose search so that we can handcraft it & simply retarget the search
Investigate the type signature of world models to find properties that remain invariant under ontology shifts
Natural latents
How to characterize natural latents in settings like PDEs?
Equivalence of natural latents under transformation of variables
Formalizing automated design
Information theoretic impact measures
Scalable blockchain consensus mechanisms
Programming language for concurrency
Quantifying optimization power without assuming a particular utility function
What mathematical axioms would emerge in a solomonoff inductor?
How things like riemannian metric & differential equations might emerge from discrete systems
Morphogenesis
I think one pattern which needs to hold in the environment in order for subgoal corrigibility to make sense is that the world is modular, but that modularity structure can be broken or changed
For one, modularity is the main thing that enables general purpose search: If we can optimize for a goal by just optimizing for a few instrumental subgoals while ignoring the influence of pretty much everything else, then that reflects some degree of modularity in the problem space
Secondly, if the modularity structure of the environment stays constant no matter what (e.g We can represent it as a fixed causal DAG), then there would be no need to “respect modularity” because any action we take would preserve the modularity of the environment by default (given our assumption); we would only need to worry about side effects if there’s at least a possibility for those side effects to break or change the modularity of the problem space, and that means the modularity structure of the problem space is a thing that can be broken or changed
Example of modularity structure of the environment changing: Most objects in the world pretty much only have direct influence on other objects nearby, and we can break or change that modularity structure by moving objects to different positions. In particular, the positions are the variables which determines the modularity of “which objects influence which other objects”, and the way that we “break” the modularity structure between the objects is by intervening on those variables.
So we know that “subgoal corrigibility” requries the environment to be modular, but that modularity structure can be broken or changed. If this is true, then the modularity structure of the environment can be tracked by a set of “second-order” variables such as position which tells us “what things influence what other things” (In particular, these second-order variables themselves might satisfy some sort of modularity structure that can be changed, and we may have third-order variables that tracks the modularity structure of the second-order variables). The way that we “respect the modularity” of other instrumental subgoals is by preserving these second-order variables that track the modularity structure of the problem space.
For instance, we get to break down the goal of baking a cake into instrumental subgoals such as acquiring coca powder (while ignoring most other things) if and only if a particular modularity structure of the problem space holds (e.g. other equipments are all in the right place & right positions), and there is a set of variables that track that modularity structure (the conditions & positions of the equipments). The way we preserve that modularity structure is by preserving those variables (the conditions & positions of the equipments).
Given this, we might want to model the world in a way that explicitly represents variables that track the modularity of other variables, so that we get to preserve influence over those variables (and therefore the modularity structure that GPS relies on)