So I agree with all the knobs-on-the-equation you and Adam are bringing up. I’ve spent a lot of time pushing for LessWrong to be a place where people feel more free to explore early stage ideas without having to justify them at every step.
I stand by my claim, although a) I want to clarify some detail about what I’m actually claiming, b) after clarifying, I expect we’ll still disagree, albeit for somewhat vague aesthetic-sense reasons, but I think my disagreement is important.
Main Clarifications:
This post is primarily talking about training, rather than executing. (although I’ll say more nuanced things in a bit). i.e. yes you definitely don’t want to goodhart your actual research process on “what makes for a good feedback loop?”. The topic of this post is “how can we develop better training methods for fuzzy, early stage science?”
Reminder that a central claim here is that Thinking Physics style problems are inadequate. In 6 months, if feedbackloop-rationality hadn’t resulted in a much better set of exercises/feedbackloops than thinking physics puzzles, I’d regard it as failing and I’d probably give up. (I wouldn’t give up at 6 months if I thought we had more time, but it’d have become clear at least that this isn’t obviously easier than just directly working on the alignment problem itself rather than going meta-on it)
It sounds like you’re worried about the impact of this being “people who might have curiously, openendedly approached alignment instead Goodhart on something concrete and feedbackloop-able”.
But a major motivation of mine here is that I think that failure mode is already happening by default – IMO, people are gravitating towards “do stuff in ML with clearer feedback-loops because it’s easier to demonstrate you’re doing something at least plausibly ‘real’ there”, while failing to engage with the harder problems that actually need solving. And meanwhile, maybe contributing to capabilities advances that are net-negative.
So one of my goals here is to help provide traction on how to think in more openended domains, such that it’s possible to do anything other than either “gravitate towards high-feedback approaches” or “pick a direction to curiously explore for months/years and… hope it turns out you have good research taste / you won-the-bus-ticket-lottery?”
If those were the only two approaches, I think “have a whole bunch of people do Option B and hope some of them win the research-taste-lottery” would be among my strategies, but it seems like something we should be pretty sad about.
I agree that if you’re limiting yourself to “what has good feedbackloops”, you get a Goodharty outcome, but the central claim here is “actually, it’s just real important to learn how to invent better feedback loops.” And that includes figuring out how to take fuzzy things and operationalize them without losing what was actually important about them. And yeah that’s hard, but it seems at least not harder than solving Alignment in the first place (and IMO it just seems pretty tractable? It seems “relatively straightforward” to design exercises for, it’s just that it’d take awhile to design enough exercises to make a full fledged training program + test set)
(Put another way: I would be extremely surprised if you and Adam spent a day thinking about “okay, what sort of feedbackloops would actually be good, given what we believe about how early stage science works?” and you didn’t come up with anything that didn’t seem worth trying, by both your lights and mine)
So I agree with all the knobs-on-the-equation you and Adam are bringing up. I’ve spent a lot of time pushing for LessWrong to be a place where people feel more free to explore early stage ideas without having to justify them at every step.
I stand by my claim, although a) I want to clarify some detail about what I’m actually claiming, b) after clarifying, I expect we’ll still disagree, albeit for somewhat vague aesthetic-sense reasons, but I think my disagreement is important.
Main Clarifications:
This post is primarily talking about training, rather than executing. (although I’ll say more nuanced things in a bit). i.e. yes you definitely don’t want to goodhart your actual research process on “what makes for a good feedback loop?”. The topic of this post is “how can we develop better training methods for fuzzy, early stage science?”
Reminder that a central claim here is that Thinking Physics style problems are inadequate. In 6 months, if feedbackloop-rationality hadn’t resulted in a much better set of exercises/feedbackloops than thinking physics puzzles, I’d regard it as failing and I’d probably give up. (I wouldn’t give up at 6 months if I thought we had more time, but it’d have become clear at least that this isn’t obviously easier than just directly working on the alignment problem itself rather than going meta-on it)
It sounds like you’re worried about the impact of this being “people who might have curiously, openendedly approached alignment instead Goodhart on something concrete and feedbackloop-able”.
But a major motivation of mine here is that I think that failure mode is already happening by default – IMO, people are gravitating towards “do stuff in ML with clearer feedback-loops because it’s easier to demonstrate you’re doing something at least plausibly ‘real’ there”, while failing to engage with the harder problems that actually need solving. And meanwhile, maybe contributing to capabilities advances that are net-negative.
So one of my goals here is to help provide traction on how to think in more openended domains, such that it’s possible to do anything other than either “gravitate towards high-feedback approaches” or “pick a direction to curiously explore for months/years and… hope it turns out you have good research taste / you won-the-bus-ticket-lottery?”
If those were the only two approaches, I think “have a whole bunch of people do Option B and hope some of them win the research-taste-lottery” would be among my strategies, but it seems like something we should be pretty sad about.
I agree that if you’re limiting yourself to “what has good feedbackloops”, you get a Goodharty outcome, but the central claim here is “actually, it’s just real important to learn how to invent better feedback loops.” And that includes figuring out how to take fuzzy things and operationalize them without losing what was actually important about them. And yeah that’s hard, but it seems at least not harder than solving Alignment in the first place (and IMO it just seems pretty tractable? It seems “relatively straightforward” to design exercises for, it’s just that it’d take awhile to design enough exercises to make a full fledged training program + test set)
(Put another way: I would be extremely surprised if you and Adam spent a day thinking about “okay, what sort of feedbackloops would actually be good, given what we believe about how early stage science works?” and you didn’t come up with anything that didn’t seem worth trying, by both your lights and mine)