Useful perspective when thinking of mechanistic pictures of agent/value development is to take the “perspective” of different optimizers, consider their relative “power,” and how they interact with each other.
E.g., early on SGD is the dominant optimizer, which has the property of (having direct access to feedback from U / greedy). Later on early proto-GPS (general-purpose search) forms, which is less greedy, but still can largely be swayed by SGD (such as having its problem-specification-input tweaked, having the overall GPS-implementation modified, etc). Much later, GPS becomes the dominant optimizing force “at run-time” which shortens the relevant time-scale and we can ignore the SGD’s effect. This effect becomes much more pronounced after reflectivity + gradient hacking when the GPS’s optimization target becomes fixed.
(very much inspired by reading Thane Ruthenis’s value formation post)
This is a very useful approximation at the late-stage when the GPS self-modifies the agent in pursuit of its objective! Rather than having to meticulously think about local SGD gradient incentives and such, since GPS is non-greedy, we can directly model it as doing what’s obviously rational from a birds-eye-perspective.
(kinda similar to e.g., separation of timescale when analyzing dynamical systems)
Useful perspective when thinking of mechanistic pictures of agent/value development is to take the “perspective” of different optimizers, consider their relative “power,” and how they interact with each other.
E.g., early on SGD is the dominant optimizer, which has the property of (having direct access to feedback from U / greedy). Later on early proto-GPS (general-purpose search) forms, which is less greedy, but still can largely be swayed by SGD (such as having its problem-specification-input tweaked, having the overall GPS-implementation modified, etc). Much later, GPS becomes the dominant optimizing force “at run-time” which shortens the relevant time-scale and we can ignore the SGD’s effect. This effect becomes much more pronounced after reflectivity + gradient hacking when the GPS’s optimization target becomes fixed.
(very much inspired by reading Thane Ruthenis’s value formation post)
This is a very useful approximation at the late-stage when the GPS self-modifies the agent in pursuit of its objective! Rather than having to meticulously think about local SGD gradient incentives and such, since GPS is non-greedy, we can directly model it as doing what’s obviously rational from a birds-eye-perspective.
(kinda similar to e.g., separation of timescale when analyzing dynamical systems)