johnswentworth answers What‘s in your list of unsolved problems in AI alignment?

johnswentworth 9 Mar 2023 0:39 UTC
27 points
4
I’m going to answer a different question: what’s my list of open problems in understanding agents? I claim that, once you dig past the early surface-level questions about alignment, basically the whole cluster of “how do agents work?”-style questions and subquestions form the main barrier to useful alignment progress. So with that in mind, here are some of my open questions about understanding agents (and the even deeper problems one runs into when trying to understand agents), going roughly from “low-level” to “high-level”.
- How does abstraction work?
  - How can we efficiently compute natural abstractions...
    … in simulations or toy models (i.e. from an explicit low-level model)?
    … from only data or other interactions with the environment?
  - Inverse problem of abstraction: humans often learn higher-level abstract models before lower-level models (e.g. Newtonian physics before quantum). How does that work?
    How can we efficiently represent or reason about the class of low-level models compatible with a given abstract model?
  - What are the natural data structures for representing natural abstractions?
    What constraints or selection pressures does massive parallelism place on factorizations?
    What representations are convergent for realistic/typical cognitive systems which arise from selection pressure?
    What are the key properties of the training/evolutionary environment and initial conditions which make them convergent?
    Are there relevant mathematical senses of “naturality” of representations other than convergence?
    What factors in the environment make certain representations/factorizations “natural” in senses other than convergence?
    How is indexical uncertainty handled?
    What are the natural “atoms” from which abstraction-representations are built (e.g. circuits, do-operators, ???)
- How can we generalize principles from thermodynamics/stat mech to talk about agents more generally?
  - What conditions allow us to make thermo-style arguments (like e.g. the Generalized Heat Engine) without relying on reversibility (or, better yet, without thinking in terms of dynamic systems at all)?
    What’s the right language for relativistic thermo, or thermo on causal models more generally?
  - What’s the right language to talk about chaos-like phenomena (i.e. “loss” of macroscopic information due to sensitivity) over space-like separation rather than time-like separation?
    Narrower problem: what’s the right language to talk about chaos-like phenomena over time, but without time-symmetric “laws of physics”?
  - How far do Maxwell’s Demon-style arguments generalize for talking about embedded agents?
  - What can the existing quantitative theory of phase transitions generalize to tell us about bits-of-optimization required to change the value of an naturally-abstract variable (like e.g. temperature of some object)?
    What’s the right language in which to talk about that, once we’re no longer relying on reversibility or thinking in terms of dynamic systems?
  - How far do maxentropic distributions generalize to distributions of low-level state given natural abstract summaries?
    What’s the form of the constraints in the relevant maximum entropy problems, and why that form?
  - What are the quantitative conditions for spontaneous self-amplification of natural abstractions, again once we’re no longer relying on reversibility or thinking in terms of dynamic systems?
- What’s up with boundaries and modularity?
  - To what extent do boundaries/modules typically exist “by default” in complex systems, vs require optimization pressure (e.g. training/selection) to appear?
  - Why are biological systems so modular? To what extent will that generalize to agents beyond biology?
  - How modular are trained neural nets? Why, and to what extent will it generalize?
  - What is the right mathematical language in which to talk about modularity, boundaries, etc?
  - How do modules/boundaries interact with thermodynamics—e.g. can we quantify the negentropy/bits-of-optimization requirements to create new boundaries/modules, or maintain old ones?
  - Can we characterize the selection pressures on transboundary transport/information channels in a general way?
  - To what extent do agents in general form internal submodules? Why?
- To what extent do various phenomena of biology generalize to other kinds of agenty systems?
  - What’s the right language in which to talk about self-reproducing patterns (“tilers”) in general (e.g. bacteria, but also memes, transposons, hypothetical transposon-like phenomena in neural nets, etc)?
    How do we talk about this in an embedded way, e.g. how can we identify a self-reproducing pattern embedded in an environment in general?
  - To what extent do agenty systems and/or tilers in general rely on combinatorial constructions from standardized parts (e.g. DNA/RNA/proteins in biology)? Why?
  - To what extent will agents and/or tilers have a component specialized in representing a specification of the agent (e.g. DNA in biological systems or the string used in a typical quine)?
  - Why don’t sessile organisms (i.e. organisms with no controlled movement) cephalize (i.e. evolve nervous systems)? How general is this property, in the mathematical space of agent-like things?
  - Precisely what selection pressures induce internal information channels, information processing capacity, feedback, memory, etc? Precisely what internal structures do the external pressures select for (e.g. what specific circuits etc)?
  - To what extent is an internal active immune system a general property of agency? What is the right mathematical language in which to talk about that question?
  - What’s the right way to generalize the level of abstraction which biologists call “morphology” to radically different systems, like e.g. deep learning systems? What qualitative phenomena of morphology generalize?
- What’s up with mesa-optimization?
  - When and to what extent do mesa-optimizers show up?
  - What’s the right mathematical language to talk about both inner and outer optimization processes (e.g. are the type signatures of the two the same)?
  - How will the inner objective relate to the outer objective?
  - What kind of inner search-process will typically show up? What are its parts and their type signatures?
  - How can we detect inner optimizers embedded in a system?
  - How can we detect outer optimization processes embedded in a system?
- What’s up with “shards” and factorization?
  - What are natural (e.g. convergent) ways of factoring the problems which an agenty system faces? What’s the right mathematical language in which to talk about that question?
  - How do natural factorizations of problems correspond to natural epistemic factorizations (i.e. natural abstraction)?
  - To what extent are internal components/modules of an agent selected to “correspond to” natural factors of the problems faced by the agent?
  - What is that correspondence?
  - How do “shards” corresponding to different problem-factors interact with each other? Is there convergent structure to cross-shard interfaces?
- What’s up with consistency pressure?
  - When, and to what extent, will different parts of an agenty system be selected to make similar trade-offs?
  - What’s the right data structure to represent parts of an agent making similar trade-offs? E.g. utility function, expected utility, market, ???
    Insofar as there’s a “shared goal” across the parts, what’s the input type, i.e. what kinds-of-things does the system “want”?
    In particular, how well do the things-the-system-wants map to natural abstractions?
    More generally, how do inputs-to-the-”goals” correspond to things-in-the-environment (in other words, how do we model and/or solve the pointers problem)?
  - To what extent is general-purpose search required in order for “goals” to make sense at all, vs “goals” arising entirely via consistency pressure?
  - How can we measure consistency pressure both in an environment, and by looking at the internals of a trained/evolved system?
  - What decision theory is convergent?
  - What are the necessary conditions for a measuring stick of utility, or some generalization thereof? How can we detect one embedded in an environment?
  - Is there a universal measuring stick of utility (maybe e.g. negentropy)?
- What’s up with general-purpose search?
  - When, and to what extent, does general-purpose search show up in agenty systems?
  - What general-purpose search algorithm(s) is/are convergent?
  - What are the type signatures of goals, knowledge, and other intermediate data structures passed around within the convergent general-purpose search algorithm(s)?
    To what extent is symbolic representation convergent, as a way of passing information around within general-purpose search processes?
    What’s the right mathematical language to talk about correspondence between internal “symbols” in a general-purpose search process and the things-in-the-environment to which those symbols (approximately) correspond?
  - How can we detect general-purpose search embedded in a system?
  - How can we extract the (maybe implicit and/or lazily represented) goal, knowledge, and other internal data structures from a general-purpose search process embedded in a system?
    How do we practically map those structures to things-in-the-environment which they presumably represent/talk about?
  - To what extent is general-purpose world-modeling a convergent and separate component of general-purpose search?
    How is the world model represented?
    How does the world model interact with everything else?
  - How does the convergent general-purpose search/modeling algorithm handle embeddedness and self-modeling?
    What’s the right mathematical language in which to talk about embedded agents, especially with self-models?
- What’s up with language?
  - When, and to what extent, does language show up in agenty systems?
  - When, and to what extent, will systems be selected to bind linguistic symbols (e.g. spoken/written words) to the same “symbols” used internally by a general-purpose search or, more generally, the same “symbols” used internally for cross-module communication?
  - Based on convergent data structures for internal representations, what’s the natural mathematical language to represent the semantics of natural language?
- What’s up with trade, markets, and firm-level selection pressures?
  - What’s a resource?
  - Mathematically, what does it mean to “own” or “control” something, in a way which plays well with embeddedness (i.e. ownership can’t be ontologically fundamental)?
  - Quantitatively, what selection pressures are produced by markets? When and to what extent do they reproduce all the phenomena covered by earlier questions, but for firms rather than bacteria/neural nets/etc?
  - What factors determine convergent firm size and structure (i.e. what things are done by employees vs outsourced, who talks to who about what, management style, etc)?
    To what extent is firm size/structure convergent under market selection pressures vs determined by other things?
    What parts of all this will generalize to “firms” of agents very different from humans?
  - To what extent can we model the information-carrying function of prices separately from the incentive/bargaining role of prices?
    To what extent are market-like internal structures convergent within agenty systems even without bargaining, e.g. within bacteria?
- What’s up with Schelling problems?
  - To what extent do Schelling-style problems convergently induce phenomena similar to politics (i.e. fighting over control of Schelling points like laws or norms), governments (i.e. monopoly control over certain Schelling points like laws, capability-of-violence and property ownership), wars (i.e. fights between groups/governments, usually over control of Schelling points like borders or laws and notably usually NOT to extinction), etc, amongst agents very different from humans?
  - What are natural Schelling points among minds very different from humans?
    Something something natural abstractions? Natural boundaries?
    To what extent are there convergent game-theoretic norms/standards for interaction among non-human minds?
  - What selection pressures act on/within governments and government-like structures?
    What selection pressures determine the convergent type of government, e.g. feudal vs democratic, socialist vs capitalist, etc? How will this generalize to government-like structures amongst non-human agents?
What links here?
- Shallow review of technical AI safety, 2024 by technicalities (29 Dec 2024 12:01 UTC; 180 points)
- «Boundaries/Membranes» and AI safety compilation by Chipmonk (3 May 2023 21:41 UTC; 57 points)
- Roman Leventov 12 Jun 2023 12:11 UTC
  3 points
  0
  Parent
  To what extent do boundaries/modules typically exist “by default” in complex systems, vs require optimization pressure (e.g. training/selection) to appear?
  Dalton Sakthivadivel showed here that boundaries (i.e., sparse couplings) do exist and are “ubiquitous” in high-dimensional (i.e., complex) systems.
- Roman Leventov 12 Jun 2023 12:46 UTC
  3 points
  0
  Parent
  Why are biological systems so modular? To what extent will that generalize to agents beyond biology?
  See section 3. “Optimization and Scale Separation in Evolving Systems” in “Toward a theory of evolution as multilevel learning” (Vanchurin et al., 2022).
  Also, see Michael Levin’s work on “multiscale competency architectures”. Fields, Levin, et al. apply this framework to ANNs in “The free energy principle induces neuromorphic development” (2022), see sections 2 and 4 in particular. This paper also addresses the question “How do modules/boundaries interact with thermodynamics—e.g. can we quantify the negentropy/bits-of-optimization requirements to create new boundaries/modules, or maintain old ones?”
- Roman Leventov 12 Jun 2023 13:05 UTC
  3 points
  1
  Parent
  What is the right mathematical language in which to talk about modularity, boundaries, etc?
  I think this is an ill-posed question. Boundaries and modularity could be discussed in the context of different mathematical languages/frameworks: quantum mechanics, random dynamical systems formalism, neural network formalism, whatever. All these mathematical languages permit talking about information exchange, modularity, and boundaries. Cf. this comment.
  Even if we reformulate the question as “Which mathematical language permits identifying boundaries [of a particular physical system, because asking this question in the abstract for any system also doesn’t make sense] most accurately?”, then the answer probably depends on the meta-theoretical (epistemological) framework that the scientist who asks this question applies to themselves.