What is the correct “object of study” for alignment researchers in understanding the mechanics of a world immediately before and during takeoff? A good step in this direction is the work of Alex Flint and Shimi’s UAO.
What form does the correct alignment goal take?
Is it a utility function over a region of space, a set of conditions to be satisfied or something else?
Mechanistically, how do systems trained primarily on token frequencies appear to be capable of higher level reasoning?
How likely is the emergence of deceptively aligned systems?
What is the correct “object of study” for alignment researchers in understanding the mechanics of a world immediately before and during takeoff? A good step in this direction is the work of Alex Flint and Shimi’s UAO.
What form does the correct alignment goal take? Is it a utility function over a region of space, a set of conditions to be satisfied or something else?
Mechanistically, how do systems trained primarily on token frequencies appear to be capable of higher level reasoning?
How likely is the emergence of deceptively aligned systems?