Or, what kind of hammers do find yourself whacking alignment topics (“nails”) with? Alignment topics include problems (e.g. instrumental convergence, mesa-optimizers, goodhart’s law), proposed solutions (e.g. quantilizers, debate, IRL), and whatever else get’s brought up in this forum.
For example, I imagine Steven Byrnes would see an alignment problem and think about what algorithm or structure in the brain might solve it (please correct me if I’m wrong!).
I imagine Rohin Shah would see a proposed solution and ask how it intervenes on a threat model (also correct me if I’m wrong!).
Related: including specific examples where you’ve used your heuristic would be appreciated. The same for full, gears-level model of when your heuristic is useful.
[Question] What Heuristics Do You Use to Think About Alignment Topics?
Or, what kind of hammers do find yourself whacking alignment topics (“nails”) with? Alignment topics include problems (e.g. instrumental convergence, mesa-optimizers, goodhart’s law), proposed solutions (e.g. quantilizers, debate, IRL), and whatever else get’s brought up in this forum.
For example, I imagine Steven Byrnes would see an alignment problem and think about what algorithm or structure in the brain might solve it (please correct me if I’m wrong!).
I imagine Rohin Shah would see a proposed solution and ask how it intervenes on a threat model (also correct me if I’m wrong!).
Or, in general, asking for specific examples of the topic.
Related: including specific examples where you’ve used your heuristic would be appreciated. The same for full, gears-level model of when your heuristic is useful.