To date, agent foundations work has taken a “top-down” approach of taking highly idealized agents and gradually trying to make them more realistic. Instead, you could try to proceed in the opposite direction—start by making models of very simple optimizing systems we actually see in the world, then try to extrapolate that theory to very powerful agents. At first, this might look like some sort of ‘theoretical biology’, attempting to create models of viruses, then cells, then many-bodied organisms. This might be followed up by a ‘theoretical sociology’ of some kind. Of course, much of the theory wouldn’t generalize to AI, but it’s plausible that some of it would: for example, anti-cancer mechanisms can be seen as solving a subagent alignment problem. I think this is promising because it engages what we care about(optimization pressure in the real world) more directly than approaches based on idealized agents, and has lots of data to test ideas on.
“Bottom-up” agent foundations
To date, agent foundations work has taken a “top-down” approach of taking highly idealized agents and gradually trying to make them more realistic. Instead, you could try to proceed in the opposite direction—start by making models of very simple optimizing systems we actually see in the world, then try to extrapolate that theory to very powerful agents. At first, this might look like some sort of ‘theoretical biology’, attempting to create models of viruses, then cells, then many-bodied organisms. This might be followed up by a ‘theoretical sociology’ of some kind. Of course, much of the theory wouldn’t generalize to AI, but it’s plausible that some of it would: for example, anti-cancer mechanisms can be seen as solving a subagent alignment problem. I think this is promising because it engages what we care about(optimization pressure in the real world) more directly than approaches based on idealized agents, and has lots of data to test ideas on.