I want to focus on having a better mechanistic picture of agent value formation & distinguishing between hypotheses (e.g., shard theory, Thane Ruthenis’s value-compilation hypothesis, etc) and forming my own.
I think I have a specific but very high uncertainty baseline model of what-to-expect from agent value-formation using greedy search optimization. It’s probably time to allocate more resources on reducing that uncertainty by touching reality i.e. running experiments.
(and also think about related theoretical arguments like Selection Theorem)
So I’ll probably allocate my research time:
Studying math (more linear algebra / dynamical systems / causal inference / statistical mechanics)
Sketching a better picture of agent development, assigning confidence, proposing high-bit experiments (that might have the side-effect of distinguishing between different conflicting pictures), formalization, etc.
and read relevant literature (eg ones on theoretic DL and inductive biases)
Upskilling mechanistic interpretability to actually start running quick experiments
Unguided research brainstorming (e.g., going through various alignment exercises, having a writeup of random related ideas, etc)
Possibly participate in programs like MATS? Probably the biggest benefit to me would be (1) commitment mechanism / additional motivation and (2) high-value conversations with other researchers.
Quick thoughts on my plans:
I want to focus on having a better mechanistic picture of agent value formation & distinguishing between hypotheses (e.g., shard theory, Thane Ruthenis’s value-compilation hypothesis, etc) and forming my own.
I think I have a specific but very high uncertainty baseline model of what-to-expect from agent value-formation using greedy search optimization. It’s probably time to allocate more resources on reducing that uncertainty by touching reality i.e. running experiments.
(and also think about related theoretical arguments like Selection Theorem)
So I’ll probably allocate my research time:
Studying math (more linear algebra / dynamical systems / causal inference / statistical mechanics)
Sketching a better picture of agent development, assigning confidence, proposing high-bit experiments (that might have the side-effect of distinguishing between different conflicting pictures), formalization, etc.
and read relevant literature (eg ones on theoretic DL and inductive biases)
Upskilling mechanistic interpretability to actually start running quick experiments
Unguided research brainstorming (e.g., going through various alignment exercises, having a writeup of random related ideas, etc)
Possibly participate in programs like MATS? Probably the biggest benefit to me would be (1) commitment mechanism / additional motivation and (2) high-value conversations with other researchers.
Dunno, sounds pretty reasonable!