I see this proposal as reducing the level of deep understanding of neural networks that would be required to have an “agentometer”.
If we had a way of iterating over every “computation” in the world model, then in principle, we could use the definition of intelligence above to measure the intelligence of each computation, and filter out all the low intelligence ones. I think this covers most of the work required to identify the operator.
Working out how to iterate over every computation in the world model is the difficult part. We could try iterating over subnetworks of the world model, but it’s not clear this would work. Maybe iterate over pairs of regions in activation space? Of course these are not practical spaces to search over, but once we know the right type signature to search for, we can probably speed up the search by developing heuristic guided methods.
An agentometer is presumably a thing that finds stuff that looks like (some mathematically precise operationalisation of) bundles-of-preferences-that-are-being-optimised-on. If you have that, can’t you just look through the set of things like this in the AI’s world model that’s active when it’s say, talking to the operator, or looking at footage of the operator on camera, or anything else that’s probably require thinking about the operator in some fashion, and point at the bundle of preferences that gets lit up by that?
Yeah this is approximately how I think the “operator identification” would work.
Is the fear here that the AI may eventually stop thinking of the operator as a bundle-of-preferences-that-are-optimised-on, i.e. an “agent” at all, in favour of some galaxy brained superior representation only a superintelligence would come up with?
Yeah this is one of the fears. The point of the intelligence measuring equation for g is that it is supposed to work, even on galaxy brained world model ontologies. It only works by measuring competence of a computation at achieving goals, not by looking at the structure of the computation for “agentyness”.
it can even spit out complete agent policies for you to run and modify
These can be computations that aren’t every agenty, or don’t match an agent at all, or only match part of an agent, so the part that spits out potential policies doesn’t have to be very good. The g computation is used to find among these the ones that best match an agent.
I see this proposal as reducing the level of deep understanding of neural networks that would be required to have an “agentometer”.
If we had a way of iterating over every “computation” in the world model, then in principle, we could use the definition of intelligence above to measure the intelligence of each computation, and filter out all the low intelligence ones. I think this covers most of the work required to identify the operator.
Working out how to iterate over every computation in the world model is the difficult part. We could try iterating over subnetworks of the world model, but it’s not clear this would work. Maybe iterate over pairs of regions in activation space? Of course these are not practical spaces to search over, but once we know the right type signature to search for, we can probably speed up the search by developing heuristic guided methods.
Yeah this is approximately how I think the “operator identification” would work.
Yeah this is one of the fears. The point of the intelligence measuring equation for g is that it is supposed to work, even on galaxy brained world model ontologies. It only works by measuring competence of a computation at achieving goals, not by looking at the structure of the computation for “agentyness”.
These can be computations that aren’t every agenty, or don’t match an agent at all, or only match part of an agent, so the part that spits out potential policies doesn’t have to be very good. The g computation is used to find among these the ones that best match an agent.