Alignment is hard in part because the subject of alignment will optimize, and optimization drives toward corner cases.
“Solve thinking physics problems” or “grind leetcode” is a great problem, but it lacks hard optimization pressure, so it will be missing some of this edge caseish- “spice.”
Alignment is “one shot, design a system that performs under ~superhuman optimization pressure.” There are a couple professional problems in this category with fast feedback loops:
Design a javascript engine with no security holes
Design an MTG set without format-breaking combos
Design a tax code
Write a Cryptocurrency
However, these all use a live source of superhuman optimization, and so would be prohibitively expensive to practice against.
The sort of dual to the above category is “exert superhuman optimization pressure on a system”. This dual can be made fast feedbackable more cheaply: “(optionally one shot) design a solution that is competitive with preexisting optimized solutions”
Design a gravity powered machine that can launch an 8 lb pumpkin as far as possible, with a budget of 5k (WCPC rules)
Design an entry in a Codingame AI contest (I recommend Coders Strike Back) that will place in the top 10 of legends league
Design a fast global illumination program
Exploit a tax code /cryptocurrency/javascript engine/mtg format
If fast feedback gets a team generally good at these, then they can at least red team harder.
I don’t think your first five examples work exactly for “exercises” (they’re a pretty long spin-up process before you can even work on them, and I don’t know that I agree the feedback loops are even that good? i.e. you probably only get to design one tax-code-iteration per year?)
But I think looking for places with adversarial optimization pressure and figuring out how to make them more feedbackloop-able is a good place to go.
This also updated me that a place I might want to seek out alignment researchers are people with a background in at least two domains that involve this sort of adversarial pressure, so they’ll have an easier time triangulating “how does optimization apply in the domain of alignment?”.
There’s this tension between what I know from the literature (i.e. transfer learning is basically impossible) and my lived experience that I and a handful of the people I know in real life whom I have examined in depth are able to quickly apply e.g. thermodynamics concepts to designing software systems, or how consuming political fiction has increased my capacity to model equilibrium strategies in social situations. Hell, this entire website was built on the back of HPMoR, which is an explicit attempt to teach rationality by reading about it.
The point other people have made about alignment research being highly nebulous is important but irrelevant. You simply cannot advance the frontiers of a field without mastery of some technique or skill (or a combination thereof) that puts you in a spot where you can do things that were impossible before, like how Rosalind Franklin needed some mastery of x-ray crystallography to be able to image the DNA.
Research also seems to be another skill that’s trainable or at least has trainable parts. If for example the bottleneck is sheer research output, I can imagine a game where you just output as many shitty papers as possible in a bounded period of time would let people write more papers ceteris paribus afterwards. Or even at the level of paragraphs even: one could play a game of “Here’s 10 random papers outside your field with the titles, authors, and publication year removed. Guess how many citations they got.” to develop one’s nose for what makes a paper impactful, or “Write the abstract of this paper.” to get better at distillation.
Alignment is hard in part because the subject of alignment will optimize, and optimization drives toward corner cases.
“Solve thinking physics problems” or “grind leetcode” is a great problem, but it lacks hard optimization pressure, so it will be missing some of this edge caseish- “spice.”
Alignment is “one shot, design a system that performs under ~superhuman optimization pressure.” There are a couple professional problems in this category with fast feedback loops:
Design a javascript engine with no security holes
Design an MTG set without format-breaking combos
Design a tax code
Write a Cryptocurrency However, these all use a live source of superhuman optimization, and so would be prohibitively expensive to practice against.
The sort of dual to the above category is “exert superhuman optimization pressure on a system”. This dual can be made fast feedbackable more cheaply: “(optionally one shot) design a solution that is competitive with preexisting optimized solutions”
Design a gravity powered machine that can launch an 8 lb pumpkin as far as possible, with a budget of 5k (WCPC rules)
Design an entry in a Codingame AI contest (I recommend Coders Strike Back) that will place in the top 10 of legends league
Design a fast global illumination program
Exploit a tax code /cryptocurrency/javascript engine/mtg format
If fast feedback gets a team generally good at these, then they can at least red team harder.
Yeah I like this train of thought.
I don’t think your first five examples work exactly for “exercises” (they’re a pretty long spin-up process before you can even work on them, and I don’t know that I agree the feedback loops are even that good? i.e. you probably only get to design one tax-code-iteration per year?)
But I think looking for places with adversarial optimization pressure and figuring out how to make them more feedbackloop-able is a good place to go.
This also updated me that a place I might want to seek out alignment researchers are people with a background in at least two domains that involve this sort of adversarial pressure, so they’ll have an easier time triangulating “how does optimization apply in the domain of alignment?”.
There’s this tension between what I know from the literature (i.e. transfer learning is basically impossible) and my lived experience that I and a handful of the people I know in real life whom I have examined in depth are able to quickly apply e.g. thermodynamics concepts to designing software systems, or how consuming political fiction has increased my capacity to model equilibrium strategies in social situations. Hell, this entire website was built on the back of HPMoR, which is an explicit attempt to teach rationality by reading about it.
The point other people have made about alignment research being highly nebulous is important but irrelevant. You simply cannot advance the frontiers of a field without mastery of some technique or skill (or a combination thereof) that puts you in a spot where you can do things that were impossible before, like how Rosalind Franklin needed some mastery of x-ray crystallography to be able to image the DNA.
Research also seems to be another skill that’s trainable or at least has trainable parts. If for example the bottleneck is sheer research output, I can imagine a game where you just output as many shitty papers as possible in a bounded period of time would let people write more papers ceteris paribus afterwards. Or even at the level of paragraphs even: one could play a game of “Here’s 10 random papers outside your field with the titles, authors, and publication year removed. Guess how many citations they got.” to develop one’s nose for what makes a paper impactful, or “Write the abstract of this paper.” to get better at distillation.