Question for my fellow alignment researchers out there, do you have a list of unsolved problems in AI alignment? I’m thinking of creating an “alignment mosaic” of the questions we need to resolve and slowly filling it in with insights from papers/posts.
I have my own version of this, but I would love to combine it with others’ alignment backcasting game-trees. I want to collect the kinds of questions people are keeping in mind when reading papers/posts, thinking about alignment or running experiments. I’m working with others to make this into a collaborative effort.
Ultimately, what I’m looking for are important questions and sub-questions we need to be thinking about and updating on when we read papers and posts as well as when we decide what to read.
Here’s my Twitter thread posing this question: https://twitter.com/jacquesthibs/status/1633146464640663552?s=46&t=YyfxSdhuFYbTafD4D1cE9A.
Here’s a sub-thread breaking down the alignment problem in various forms: https://twitter.com/jacquesthibs/status/1633165299770880001?s=46&t=YyfxSdhuFYbTafD4D1cE9A.
I’m going to answer a different question: what’s my list of open problems in understanding agents? I claim that, once you dig past the early surface-level questions about alignment, basically the whole cluster of “how do agents work?”-style questions and subquestions form the main barrier to useful alignment progress. So with that in mind, here are some of my open questions about understanding agents (and the even deeper problems one runs into when trying to understand agents), going roughly from “low-level” to “high-level”.
How does abstraction work?
How can we efficiently compute natural abstractions...
… in simulations or toy models (i.e. from an explicit low-level model)?
… from only data or other interactions with the environment?
Inverse problem of abstraction: humans often learn higher-level abstract models before lower-level models (e.g. Newtonian physics before quantum). How does that work?
How can we efficiently represent or reason about the class of low-level models compatible with a given abstract model?
What are the natural data structures for representing natural abstractions?
What constraints or selection pressures does massive parallelism place on factorizations?
What representations are convergent for realistic/typical cognitive systems which arise from selection pressure?
What are the key properties of the training/evolutionary environment and initial conditions which make them convergent?
Are there relevant mathematical senses of “naturality” of representations other than convergence?
What factors in the environment make certain representations/factorizations “natural” in senses other than convergence?
How is indexical uncertainty handled?
What are the natural “atoms” from which abstraction-representations are built (e.g. circuits, do-operators, ???)
How can we generalize principles from thermodynamics/stat mech to talk about agents more generally?
What conditions allow us to make thermo-style arguments (like e.g. the Generalized Heat Engine) without relying on reversibility (or, better yet, without thinking in terms of dynamic systems at all)?
What’s the right language for relativistic thermo, or thermo on causal models more generally?
What’s the right language to talk about chaos-like phenomena (i.e. “loss” of macroscopic information due to sensitivity) over space-like separation rather than time-like separation?
Narrower problem: what’s the right language to talk about chaos-like phenomena over time, but without time-symmetric “laws of physics”?
How far do Maxwell’s Demon-style arguments generalize for talking about embedded agents?
What can the existing quantitative theory of phase transitions generalize to tell us about bits-of-optimization required to change the value of an naturally-abstract variable (like e.g. temperature of some object)?
What’s the right language in which to talk about that, once we’re no longer relying on reversibility or thinking in terms of dynamic systems?
How far do maxentropic distributions generalize to distributions of low-level state given natural abstract summaries?
What’s the form of the constraints in the relevant maximum entropy problems, and why that form?
What are the quantitative conditions for spontaneous self-amplification of natural abstractions, again once we’re no longer relying on reversibility or thinking in terms of dynamic systems?
What’s up with boundaries and modularity?
To what extent do boundaries/modules typically exist “by default” in complex systems, vs require optimization pressure (e.g. training/selection) to appear?
Why are biological systems so modular? To what extent will that generalize to agents beyond biology?
How modular are trained neural nets? Why, and to what extent will it generalize?
What is the right mathematical language in which to talk about modularity, boundaries, etc?
How do modules/boundaries interact with thermodynamics—e.g. can we quantify the negentropy/bits-of-optimization requirements to create new boundaries/modules, or maintain old ones?
Can we characterize the selection pressures on transboundary transport/information channels in a general way?
To what extent do agents in general form internal submodules? Why?
To what extent do various phenomena of biology generalize to other kinds of agenty systems?
What’s the right language in which to talk about self-reproducing patterns (“tilers”) in general (e.g. bacteria, but also memes, transposons, hypothetical transposon-like phenomena in neural nets, etc)?
How do we talk about this in an embedded way, e.g. how can we identify a self-reproducing pattern embedded in an environment in general?
To what extent do agenty systems and/or tilers in general rely on combinatorial constructions from standardized parts (e.g. DNA/RNA/proteins in biology)? Why?
To what extent will agents and/or tilers have a component specialized in representing a specification of the agent (e.g. DNA in biological systems or the string used in a typical quine)?
Why don’t sessile organisms (i.e. organisms with no controlled movement) cephalize (i.e. evolve nervous systems)? How general is this property, in the mathematical space of agent-like things?
Precisely what selection pressures induce internal information channels, information processing capacity, feedback, memory, etc? Precisely what internal structures do the external pressures select for (e.g. what specific circuits etc)?
To what extent is an internal active immune system a general property of agency? What is the right mathematical language in which to talk about that question?
What’s the right way to generalize the level of abstraction which biologists call “morphology” to radically different systems, like e.g. deep learning systems? What qualitative phenomena of morphology generalize?
What’s up with mesa-optimization?
When and to what extent do mesa-optimizers show up?
What’s the right mathematical language to talk about both inner and outer optimization processes (e.g. are the type signatures of the two the same)?
How will the inner objective relate to the outer objective?
What kind of inner search-process will typically show up? What are its parts and their type signatures?
How can we detect inner optimizers embedded in a system?
How can we detect outer optimization processes embedded in a system?
What’s up with “shards” and factorization?
What are natural (e.g. convergent) ways of factoring the problems which an agenty system faces? What’s the right mathematical language in which to talk about that question?
How do natural factorizations of problems correspond to natural epistemic factorizations (i.e. natural abstraction)?
To what extent are internal components/modules of an agent selected to “correspond to” natural factors of the problems faced by the agent?
What is that correspondence?
How do “shards” corresponding to different problem-factors interact with each other? Is there convergent structure to cross-shard interfaces?
What’s up with consistency pressure?
When, and to what extent, will different parts of an agenty system be selected to make similar trade-offs?
What’s the right data structure to represent parts of an agent making similar trade-offs? E.g. utility function, expected utility, market, ???
Insofar as there’s a “shared goal” across the parts, what’s the input type, i.e. what kinds-of-things does the system “want”?
In particular, how well do the things-the-system-wants map to natural abstractions?
More generally, how do inputs-to-the-”goals” correspond to things-in-the-environment (in other words, how do we model and/or solve the pointers problem)?
To what extent is general-purpose search required in order for “goals” to make sense at all, vs “goals” arising entirely via consistency pressure?
How can we measure consistency pressure both in an environment, and by looking at the internals of a trained/evolved system?
What decision theory is convergent?
What are the necessary conditions for a measuring stick of utility, or some generalization thereof? How can we detect one embedded in an environment?
Is there a universal measuring stick of utility (maybe e.g. negentropy)?
What’s up with general-purpose search?
When, and to what extent, does general-purpose search show up in agenty systems?
What general-purpose search algorithm(s) is/are convergent?
What are the type signatures of goals, knowledge, and other intermediate data structures passed around within the convergent general-purpose search algorithm(s)?
To what extent is symbolic representation convergent, as a way of passing information around within general-purpose search processes?
What’s the right mathematical language to talk about correspondence between internal “symbols” in a general-purpose search process and the things-in-the-environment to which those symbols (approximately) correspond?
How can we detect general-purpose search embedded in a system?
How can we extract the (maybe implicit and/or lazily represented) goal, knowledge, and other internal data structures from a general-purpose search process embedded in a system?
How do we practically map those structures to things-in-the-environment which they presumably represent/talk about?
To what extent is general-purpose world-modeling a convergent and separate component of general-purpose search?
How is the world model represented?
How does the world model interact with everything else?
How does the convergent general-purpose search/modeling algorithm handle embeddedness and self-modeling?
What’s the right mathematical language in which to talk about embedded agents, especially with self-models?
What’s up with language?
When, and to what extent, does language show up in agenty systems?
When, and to what extent, will systems be selected to bind linguistic symbols (e.g. spoken/written words) to the same “symbols” used internally by a general-purpose search or, more generally, the same “symbols” used internally for cross-module communication?
Based on convergent data structures for internal representations, what’s the natural mathematical language to represent the semantics of natural language?
What’s up with trade, markets, and firm-level selection pressures?
What’s a resource?
Mathematically, what does it mean to “own” or “control” something, in a way which plays well with embeddedness (i.e. ownership can’t be ontologically fundamental)?
Quantitatively, what selection pressures are produced by markets? When and to what extent do they reproduce all the phenomena covered by earlier questions, but for firms rather than bacteria/neural nets/etc?
What factors determine convergent firm size and structure (i.e. what things are done by employees vs outsourced, who talks to who about what, management style, etc)?
To what extent is firm size/structure convergent under market selection pressures vs determined by other things?
What parts of all this will generalize to “firms” of agents very different from humans?
To what extent can we model the information-carrying function of prices separately from the incentive/bargaining role of prices?
To what extent are market-like internal structures convergent within agenty systems even without bargaining, e.g. within bacteria?
What’s up with Schelling problems?
To what extent do Schelling-style problems convergently induce phenomena similar to politics (i.e. fighting over control of Schelling points like laws or norms), governments (i.e. monopoly control over certain Schelling points like laws, capability-of-violence and property ownership), wars (i.e. fights between groups/governments, usually over control of Schelling points like borders or laws and notably usually NOT to extinction), etc, amongst agents very different from humans?
What are natural Schelling points among minds very different from humans?
Something something natural abstractions? Natural boundaries?
To what extent are there convergent game-theoretic norms/standards for interaction among non-human minds?
What selection pressures act on/within governments and government-like structures?
What selection pressures determine the convergent type of government, e.g. feudal vs democratic, socialist vs capitalist, etc? How will this generalize to government-like structures amongst non-human agents?
Dalton Sakthivadivel showed here that boundaries (i.e., sparse couplings) do exist and are “ubiquitous” in high-dimensional (i.e., complex) systems.
See section 3. “Optimization and Scale Separation in Evolving Systems” in “Toward a theory of evolution as multilevel learning” (Vanchurin et al., 2022).
Also, see Michael Levin’s work on “multiscale competency architectures”. Fields, Levin, et al. apply this framework to ANNs in “The free energy principle induces neuromorphic development” (2022), see sections 2 and 4 in particular. This paper also addresses the question “How do modules/boundaries interact with thermodynamics—e.g. can we quantify the negentropy/bits-of-optimization requirements to create new boundaries/modules, or maintain old ones?”
I think this is an ill-posed question. Boundaries and modularity could be discussed in the context of different mathematical languages/frameworks: quantum mechanics, random dynamical systems formalism, neural network formalism, whatever. All these mathematical languages permit talking about information exchange, modularity, and boundaries. Cf. this comment.
Even if we reformulate the question as “Which mathematical language permits identifying boundaries [of a particular physical system, because asking this question in the abstract for any system also doesn’t make sense] most accurately?”, then the answer probably depends on the meta-theoretical (epistemological) framework that the scientist who asks this question applies to themselves.
(this answer is cross-posted on my blog)
here is a list of problems which i seek to either resolve or get around, in order to implement my formal alignment plans, especially QACI:
formal inner alignment: in the formal alignment paradigm, “inner alignment” means refers to the problem of building an AI which, when ran, actually maximizes the formal goal we give it (in tractable time) rather than doing something else such as getting hijacked by an unaligned internal component of itself. because its goal is formal and fully general, it feels like building something that maximizes it should be much easier than the regular kind of inner alignment, and we could have a lot more confidence in the resulting system. (progress on this problem could be capability-exfohazardous, however!)
continuous alignment: given a utility function which is theoretically eventually aligned such that there exists a level of capabilities at which it has good outcomes for any level above it, how do we bridge the gap from where we are to that level? will a system “accidentally” destroy all values before realizing it shouldn’t have done that?
blob location: for QACI, how do we robustly locate pieces of data stored on computers encoded on top of bottom-level-physics turing-machine solomonoff hypotheses for the world? see 1, 2, 3 for details.
physics embedding: related to the previous problem, how precisely does the prior we’re using need to capture our world, for the intended instance of the blobs to be locatable? can we just find the blobs in the universal program — or, if P≠BQP, some universal quantum program? do we need to demand worlds to contain, say, a dump of wikipedia to count as ours? can we use the location of such a dump as a prior for the location of the blobs?
infrastructure design: what formal-math language will the formal goal be expressed in? what kind of properties should it have? should it include some kind of proving system, and in what logic? in QACI, will this also be the language for the user’s answer? what kind of checksums should accompany the question and answer blobs? these questions are at this stage premature, but they will need some figuring out at some point if formal alignment is, as i currently believe, the way to go.
Open Problems in AI X-Risk:
https://www.alignmentforum.org/s/FaEBwhhe3otzYKGQt/p/5HtDzRAk7ePWsiL2L
Here’s Quintin Pope’s answer from the Twitter thread I posted (https://twitter.com/quintinpope5/status/1633148039622959104?s=46&t=YyfxSdhuFYbTafD4D1cE9A):
How much convergence is there really between AI and human internal representations?
1.1 How do we make there be more convergence?
How do we minimize semantic drift in LMs when we train them to do other stuff? (If you RL them to program good, how to make sure their English continues to describe their programs well?)
How well do alignment techniques generalize across capabilities advances? Id AI start doing AI research and make 20 capabilities advances like the Chinchilla scaling laws, will RLHF/whatever still work on the resulting systems?
Where do the inductive biases of very good SGD point? Are they “secretly evil”, in the sense that powerful models convergently end deceptive / explicit reward optimizers / other bad thing?
4.1 If so, how do we stop that?
How should we even start thinking about data curation feedback loops? If we train an LM, then have the LM curate / write higher quality training data for its successor, and repeat this process many times, what even happens? What types of attractors can arise here?
5.1 How do we safely shape such a process. We want the process to enter stable attractors along certain dimensions (like “in favour of humanity”), but not along others (like “I should produce lots of text that agents similar to me would approve of”).
What are the limits of efficient generalization? Can plausible early TAI generalize from “all the biological data humans gathered” to “design protein sequences to build nanofactory precursors”?
Given a dataset that can be solved in multiple different ways, how can we best influence the specific mechanism the AI uses to solve that dataset?
7.1 like this? arxiv.org/abs/2211.08422
7.2 or this? https://openreview.net/forum?id=mNtmhaDkAr
7.3 of how about this? https://www.lesswrong.com/posts/rgh4tdNrQyJYXyNs8/qapr-3-interpretability-guided-training-of-neural-nets
How to best extract unspoken beliefs from LM internal states? Basically ELK for LMs. See: https://github.com/EleutherAI/elk
What mathematical framework best quantifies the geometric structure of model embedding space? E.g., using cosine similarly between embeddings is bad because it’s dominated by outlier dims and doesn’t reflect distance along embedding manifold. We want math that more meaningfully reflects the learned geometry. Such a framework would help a lot with questions like “what does this layer do?” And “how similar are the internal representations of these two models?”
How do we best establish safe, high bandwidth, information-dense communication between human brains and models? This is the big bottleneck on approaches like cyborgism, and includes all forms of BCI research / “cortical prosthesis” / “merging with AI”. But it also incudes things like “write a very good visualiser of LM internal representations”, which might allow researchers a higher-bandwidth view of what’s going on in LMs beyond just “read the tokens sampled from those hidden representations”.
What is the correct “object of study” for alignment researchers in understanding the mechanics of a world immediately before and during takeoff? A good step in this direction is the work of Alex Flint and Shimi’s UAO.
What form does the correct alignment goal take? Is it a utility function over a region of space, a set of conditions to be satisfied or something else?
Mechanistically, how do systems trained primarily on token frequencies appear to be capable of higher level reasoning?
How likely is the emergence of deceptively aligned systems?
Some braindumping, took me a while, many passes of editing in loom to see if I’d missed something—I rejected almost every loom branch though, this is still almost all my writing, sometimes the only thing I get from loom is knowing what I don’t intend to say:
What properties are easy to prove through large physical systems without knowing their internals? Are any of those properties selection theorems? Can I make an AI that segments real space in a way that allows me to prove that a natural abstraction is maintained through it?
Can we structure ai architectures so we have a guaranteed margin of natural abstraction? How much of existing physics knowledge can I hardcode safely, given that eventually the AI must do physics research and generalize correctly?
How can we become able to trade with ants?
What’s the deal with the game theory between GAN generator and GAN descriminator, and how does it compare to the reason why diffusion beats GANs? is there anything relevant to how to encode a utility function about the fact that diffusion is built out of noise-resistance, same as bio life has to be?
Can we build models of any of this in a less nonlinear simulator than quantum that adds properties not found in classical cellular automata? eg, I’m excited about particle lenia—what would it look like to build a test case for the game theory of thermodynamic coprotection in a lenia world? perhaps it needs more refinement?
What does a deep learning version of the discovering agents (causal discovery of systems being moved by reasons) algorithm look like? How do I actually run discovering agents on a language model, right now?
How do I have to add conditions that limit generality of formal statements in order to build a connected manifold of conditional statements that fully cover the behavior manifold of co-protective behavior? Can I make statements of co-protection knowing only what an agent is, not what a person is, and yet trust that the diffusion agency of the self-healing process will be maintained?
Am I correct that humanity have a moral obligation to become more efficient per watt in order to make room for more beings? What is the fair tradeoff of how much smarter per watt different beings are allowed to be before it’s moral to start a war about it? Seems like it’s probably a pretty wide window, but maybe there’s some ratio where one ai is obligated to attack another stronger one on behalf of a weaker one or something? I hope this does not occur and am interested in analyzing it to ensure we can build defenses against it
How does having infinite statements in your game-tree reasoning process (instead of a strictly finite game-tree) affect a self-modifying diffusion player with both symbolic neural models in ensemble? what is the myopic behavior of a diffusion model? [the most loom contribution to this one, and it shows, I find it less crisp than the others, which are themselves not the most crisp]
My current sense is I will be the one to answer exactly none of these. But who knows! anyway, here’s some. I think I have more knocking around somewhere in my head and/or my previous comments.