Tor Økland Barstad comments on Clarifying the Agent-Like Structure Problem

Tor Økland Barstad 16 Mar 2023 21:51 UTC
2 points
0
I think I’m probably missing the point here somehow and/or that this will be perceived as not helpful. Like, my conceptions of what you mean, and what the purpose of the theorem would be, are both vague.
But I’ll note down some thoughts.
Next, the world model. As with the search process, it should be a subsystem which interacts with the rest of the system/environment only via a specific API, although it’s less clear what that API should be. Conceptually, it should be a data structure representing the world.
(...)
The search process should be able to run queries on the world model
Problems can often be converted into other problems. This can be for both the top-level problem and recursively for sub-problems. One example of this is how NP-completene problems by definition can be converted into other NP-problems in polynomial time:

And as humans we are fairly limited in terms of finding and leveraging abstractions like these. What we are able to do (in terms of converting tasks/problems into more “abstracted” tasks/problems) ≠ what’s possible to do.

So then it’s not necessarily necessary to be able to do search in a world model? Since very powerful optimizers maybe can get by while being restricted to searching within models that aren’t world models (after having converted whatever it is they want to maximize into something more “abstract”, or into a problem that corresponds to a different world/ontology—be that wholesale or in “chunks”).

I was browsing your posts just now, partly to see if I could get a better idea of what you mean by the terms you use in this post. And I came across What’s General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems?, which seems to describe either the same phenomena as the what I’m trying to point to, or at least something similar/overlapping. And it’s a good post too (it explains various things better than I can remember hearing/reading elsewhere). So that increases my already high odds that I’m missing the point somehow.

But, depending on what the theorem would be used for, the distinction I’m pointing to could maybe make an important difference:

For example, we may want to verify that certain capabilities are “aligned”. Maybe we have AIs compete to make functions that do some specialized task as effectively/optimally as possible, as measured by various metrics.
Some specialized tasks may be tasks where we can test performance safely/robustly, while for other tasks we may only be able to do that for some subset of all possible outputs/predictions/proposals/etc. But we could for example have AIs compete to implement both of these functions with code that overlaps as little as possible^[1].
For example, we may want functions to do is to predict human output (e.g. how humans would answer various questionnaires based on info about those humans). But we may not be able/willing to test the full range of predictions that such functions make (e.g., we may want to avoid exposing real humans to AGI-generated content). However, possible ways to implement such functions may have a lot of overlap with functions where we are able/willing to test the full range of predictions. And we may want to enforce restrictions/optimization-criteria such that it becomes hard to make functions that (1) get maximum score and (2) return wrong output outside of the range where we are able/willing to test/score output and (3) don’t return wrong/suboptimal output inside of the range where we are able/willing to test/score output.

To be clear, I wouldn’t expect world models to always/typically be abstracted/converted before search is done if what we select for simply is to have systems that do “the final task we are interested in” as effectively/optimally as possible, and we pretty much try to score/optimize/select for that thing in the most straightforward way we can (when “training” AIs / searching for AI-designs). But maybe there sometimes would be merit to actively trying to obtain/optimize systems that “abstract”/convert the model before search is done.
1. ^
  As well as optimizing for other optimization-criteria that incentivize for the task to be “abstracted”/converted (and most of the work to be done on models that have been “abstracted”/converted).