Someone more versed in this line of research clue me in please: Conditional on us having developed the kind of deep understanding of neural networks and their training implicit in having “agentometers” and “operator recognition programs” and being able to point to specific representations of stuff in the AGIs’ “world model” at all, why would we expect picking out the part of the model that corresponds to human preferences specifically to be hard and in need of precise mathematical treatment like this?
An agentometer is presumably a thing that finds stuff that looks like (some mathematically precise operationalisation of) bundles-of-preferences-that-are-being-optimised-on. If you have that, can’t you just look through the set of things like this in the AI’s world model that’s active when it’s say, talking to the operator, or looking at footage of the operator on camera, or anything else that’s probably require thinking about the operator in some fashion, and point at the bundle of preferences that gets lit up by that?
Is the fear here that the AI may eventually stop thinking of the operator as a bundle-of-preferences-that-are-optimised-on, i.e. an “agent” at all, in favour of some galaxy brained superior representation only a superintelligence would come up with? Then I’d imagine your agentometer would stop working too, since it’d no longer recognise that representation as belonging to something agentic. So the formula for finding the operator utility function, which relies on the operator being in the set of stuff with high g your agentometer found, wouldn’t work anymore either.
It kind of seems to me like all the secret sauce is in the agentometer part here. If that part works at all, to the point where it can even spit out complete agent policies for you to run and modify, like your formula seems to demand, it’s hard for me to see why it wouldn’t just be able to point you to the agent’s preferences directly as well. The Great Eldritch Powers required for that seem, if anything, lesser to me.
I see this proposal as reducing the level of deep understanding of neural networks that would be required to have an “agentometer”.
If we had a way of iterating over every “computation” in the world model, then in principle, we could use the definition of intelligence above to measure the intelligence of each computation, and filter out all the low intelligence ones. I think this covers most of the work required to identify the operator.
Working out how to iterate over every computation in the world model is the difficult part. We could try iterating over subnetworks of the world model, but it’s not clear this would work. Maybe iterate over pairs of regions in activation space? Of course these are not practical spaces to search over, but once we know the right type signature to search for, we can probably speed up the search by developing heuristic guided methods.
An agentometer is presumably a thing that finds stuff that looks like (some mathematically precise operationalisation of) bundles-of-preferences-that-are-being-optimised-on. If you have that, can’t you just look through the set of things like this in the AI’s world model that’s active when it’s say, talking to the operator, or looking at footage of the operator on camera, or anything else that’s probably require thinking about the operator in some fashion, and point at the bundle of preferences that gets lit up by that?
Yeah this is approximately how I think the “operator identification” would work.
Is the fear here that the AI may eventually stop thinking of the operator as a bundle-of-preferences-that-are-optimised-on, i.e. an “agent” at all, in favour of some galaxy brained superior representation only a superintelligence would come up with?
Yeah this is one of the fears. The point of the intelligence measuring equation for g is that it is supposed to work, even on galaxy brained world model ontologies. It only works by measuring competence of a computation at achieving goals, not by looking at the structure of the computation for “agentyness”.
it can even spit out complete agent policies for you to run and modify
These can be computations that aren’t every agenty, or don’t match an agent at all, or only match part of an agent, so the part that spits out potential policies doesn’t have to be very good. The g computation is used to find among these the ones that best match an agent.
Someone more versed in this line of research clue me in please: Conditional on us having developed the kind of deep understanding of neural networks and their training implicit in having “agentometers” and “operator recognition programs” and being able to point to specific representations of stuff in the AGIs’ “world model” at all, why would we expect picking out the part of the model that corresponds to human preferences specifically to be hard and in need of precise mathematical treatment like this?
An agentometer is presumably a thing that finds stuff that looks like (some mathematically precise operationalisation of) bundles-of-preferences-that-are-being-optimised-on. If you have that, can’t you just look through the set of things like this in the AI’s world model that’s active when it’s say, talking to the operator, or looking at footage of the operator on camera, or anything else that’s probably require thinking about the operator in some fashion, and point at the bundle of preferences that gets lit up by that?
Is the fear here that the AI may eventually stop thinking of the operator as a bundle-of-preferences-that-are-optimised-on, i.e. an “agent” at all, in favour of some galaxy brained superior representation only a superintelligence would come up with? Then I’d imagine your agentometer would stop working too, since it’d no longer recognise that representation as belonging to something agentic. So the formula for finding the operator utility function, which relies on the operator being in the set of stuff with high g your agentometer found, wouldn’t work anymore either.
It kind of seems to me like all the secret sauce is in the agentometer part here. If that part works at all, to the point where it can even spit out complete agent policies for you to run and modify, like your formula seems to demand, it’s hard for me to see why it wouldn’t just be able to point you to the agent’s preferences directly as well. The Great Eldritch Powers required for that seem, if anything, lesser to me.
I see this proposal as reducing the level of deep understanding of neural networks that would be required to have an “agentometer”.
If we had a way of iterating over every “computation” in the world model, then in principle, we could use the definition of intelligence above to measure the intelligence of each computation, and filter out all the low intelligence ones. I think this covers most of the work required to identify the operator.
Working out how to iterate over every computation in the world model is the difficult part. We could try iterating over subnetworks of the world model, but it’s not clear this would work. Maybe iterate over pairs of regions in activation space? Of course these are not practical spaces to search over, but once we know the right type signature to search for, we can probably speed up the search by developing heuristic guided methods.
Yeah this is approximately how I think the “operator identification” would work.
Yeah this is one of the fears. The point of the intelligence measuring equation for g is that it is supposed to work, even on galaxy brained world model ontologies. It only works by measuring competence of a computation at achieving goals, not by looking at the structure of the computation for “agentyness”.
These can be computations that aren’t every agenty, or don’t match an agent at all, or only match part of an agent, so the part that spits out potential policies doesn’t have to be very good. The g computation is used to find among these the ones that best match an agent.