I can understand why it would seem excessively abstract, but when we speak of agency, we are in fact talking about patterns in the activations of the gpu’s circuit elements—specifically we’d be talking about patterns of numerical feedback where the program forms a causal predictive model of a variable and then, based on the result of the predictive model, does any form of model-predictive control, eg outputting bytes (floats, probably) that encode an action that the action-conditional predictive model evaluates as likely to impact the variable.
Merely minimizing loss is insufficient to end up with this outcome in many cases, but on some datasets, with some problem formulations—ones that we expect to come up, such as motor control of a robot in order to walk across a room, for a trivial example, or trying to select videos which maximize probability that a user stays on the website—we can expect that the predictive model, if more precise about the future than a human’s predictive model, would allow the gpu code to select actions (motor actions or video selections) that have higher reliability of reaching the target outcome (cross the room, ensure the user stays on the site) that the control loop code evaluated via the predictive model. The worry is that, if an agent is general enough in purpose to form its own subgoals and evaluate those in the predictive model, it could end up doing multi-step plan chaining through this general world-simulator subalgorithm and realize it can attack its creators in one of a great many possible ways.
Ngl I did not fully understand this, but to be clear I don’t think understanding alignment through the lense of agency is “excessively abstract.” In fact I think I’d agree with the implicit default view that it’s largely the single most productive lense to look through. My objection to the status quo is that it seems like the scale/ontology/lense/whatever I was describing is getting 0% of the research attention whereas perhaps it should be getting 10 or 20%.
Not sure this analogy works, but if NIH was spending $10B on cancer research, I would (prima facie, as a layperson) want >$0 but probably <$2B spent on looking at cancer as an atomic-scale phenomenon, and maybe some amount at an even lower-scale scale
yeah I was probably too abstract in my reply—to rephrase: a thermostat (or other extremely small control system) is a perfectly valid example of agency. it’s not dangerously strong agency or any such thing. but my point is really to say that you’re on the right track here, looking at the micro-scale versions of things is very promising.
I can understand why it would seem excessively abstract, but when we speak of agency, we are in fact talking about patterns in the activations of the gpu’s circuit elements—specifically we’d be talking about patterns of numerical feedback where the program forms a causal predictive model of a variable and then, based on the result of the predictive model, does any form of model-predictive control, eg outputting bytes (floats, probably) that encode an action that the action-conditional predictive model evaluates as likely to impact the variable.
Merely minimizing loss is insufficient to end up with this outcome in many cases, but on some datasets, with some problem formulations—ones that we expect to come up, such as motor control of a robot in order to walk across a room, for a trivial example, or trying to select videos which maximize probability that a user stays on the website—we can expect that the predictive model, if more precise about the future than a human’s predictive model, would allow the gpu code to select actions (motor actions or video selections) that have higher reliability of reaching the target outcome (cross the room, ensure the user stays on the site) that the control loop code evaluated via the predictive model. The worry is that, if an agent is general enough in purpose to form its own subgoals and evaluate those in the predictive model, it could end up doing multi-step plan chaining through this general world-simulator subalgorithm and realize it can attack its creators in one of a great many possible ways.
Ngl I did not fully understand this, but to be clear I don’t think understanding alignment through the lense of agency is “excessively abstract.” In fact I think I’d agree with the implicit default view that it’s largely the single most productive lense to look through. My objection to the status quo is that it seems like the scale/ontology/lense/whatever I was describing is getting 0% of the research attention whereas perhaps it should be getting 10 or 20%.
Not sure this analogy works, but if NIH was spending $10B on cancer research, I would (prima facie, as a layperson) want >$0 but probably <$2B spent on looking at cancer as an atomic-scale phenomenon, and maybe some amount at an even lower-scale scale
yeah I was probably too abstract in my reply—to rephrase: a thermostat (or other extremely small control system) is a perfectly valid example of agency. it’s not dangerously strong agency or any such thing. but my point is really to say that you’re on the right track here, looking at the micro-scale versions of things is very promising.