Apparently, MIRI has given up on their current mainline approach to understanding agency and are trying to figure out what to do next. It seems like it might be worthwhile to collect some alternative approaches to the problem—after all, intelligence and agency feature in pretty much all areas of human thought and action, so the space of possible ways to make progress should be pretty vast. By no means is it exhausted by the mathematical analysis of thought experiments! What are people’s best ideas?
(By ‘understanding agency’ I mean research that is attempting to establish a better understanding of how agency works, not alignment research in general. So IDA would not be considered agent foundations, since it takes ML capabilities as a black-box. )
ETA: I originally wrote ‘agent foundations’ in place of ‘understanding agency’ in the above, which was ambiguous between a broad sense of the term(any research aimed at obtaining a foundational understanding of agency) and a narrow sense(the set of research directions outlined in the agent foundations agenda document). See this comment by Rob re: MIRI’s ongoing work on agent foundations(narrow sense).
Memetics. Maybe agents are to memes as ships are to seafarers.
--Maybe memes will be prevalent in highly capable minds; at least, maybe some of the first powerful AIs we build will be meme-vessels like we are.
--Memetics is an entirely new way of understanding minds and agency. It is severely understudied as far as I can tell; the only academic work on the topic that I know of is Blackmore’s The Meme Machine from twenty years ago. This should give us some reason to hope that significant progress can be made quickly.
--Better understanding of memetics might have significant benefits for things other than AI alignment also, such as helping us steer the public discourse around AI risk, helping us be more rational and truth-seeking in general, helping us predict a potential decay of public epistemology, etc.
This has long been my suspicion. Combine meme theory with a multi-agent model of mind and it starts to look like the occult concept of “egregores” is right after all—distributed agents composed of many separate programs running on many separate human minds, coordinating via communication and ritualistic behaviors, control most of the world. Corporations and gods are two obvious examples.
Reverse-engineer neural networks.
Develop a theory of how neural networks function, then apply that theory to either directly align neural networks, or develop an alternative approach to creating AGI that is easier to align. This seems more promising than trying to develop new foundations from scratch, since we already know that neural networks do intelligence-like optimization, the challenge is just figuring out why.
It seems a bit arrogant to just say “what I’ve been working on,” but on the other hand, the things I’ve been working on have obviously often been my best ideas!
Right now I’m still thinking about how to allow for value specification in hierarchical models. There are two flanks to this problem: the problem of alien concepts and the problem of human underdetermination.
The problem of alien concepts is relatively well-understood: we want the AI to generalize in a human-like way, which runs unto trouble if there are “alien concepts” that predict the training data well but are unsafe to try to maximize. Solving this problem looks like skillful modeling of an environment that includes humans, progress in interpretability, and better learning from human feedback.
The problem of human underdetermination is a bit less appreciated: human behavior underdetermines a utility function, in the sense that you could fit many utility functions to human behavior, all equally well. But there’s simultaneously a problem with human inconsistency with intuitive desiderata. Solving this problem looks like finding ways to model humans that strike a decent balance between our incompatible desiderata, or ways to encode and insert our desiderata to avoid “no free lunch” problems in general models of environments that contain humans. Wheras a lot of good progress has been made on the problem of alien concepts using fairly normal ML methods, I think the problem of human underdetermination requires a combination of philosophy, mathematical foundations, and empirical ML research.
Try to solve metaphilosophy
This is largely Wei Dai’s idea. We have (computationally-unbounded) models like AIXI that can be argued to capture many key aspects of human reasoning and action. But one thing they don’t capture is philosophy—our inclination to ask questions such as ‘what is reasoning, really?” in the first place. We could try to come up with a model that could be argued to ‘do philosophy’ in addition to things like planning and reasoning. This seems promising since the lack of philosophical ability is a really glaring area where our current models are lacking—in particular, our models can’t currently account for our desire to come up with good models!
“Bottom-up” agent foundations
To date, agent foundations work has taken a “top-down” approach of taking highly idealized agents and gradually trying to make them more realistic. Instead, you could try to proceed in the opposite direction—start by making models of very simple optimizing systems we actually see in the world, then try to extrapolate that theory to very powerful agents. At first, this might look like some sort of ‘theoretical biology’, attempting to create models of viruses, then cells, then many-bodied organisms. This might be followed up by a ‘theoretical sociology’ of some kind. Of course, much of the theory wouldn’t generalize to AI, but it’s plausible that some of it would: for example, anti-cancer mechanisms can be seen as solving a subagent alignment problem. I think this is promising because it engages what we care about(optimization pressure in the real world) more directly than approaches based on idealized agents, and has lots of data to test ideas on.
Agent Meta-Foundations
I argue for this approach on the basis that Agent Foundations is really hard to do right and that the fact that we have run into difficulties suggests a need to go right back to the philosophical foundations to confirm they are on solid ground.