Meta: I have some gripes about the feedback loop focus in rationality culture, and I think this comment unfairly mixes a bunch of my thoughts about this topic in general with my thoughts in response to this post in particular—sorry in advance for that. I wish I was better at delineating between them, but that turned out to be kind of hard, and I have limited time and so on…
It is quite hard to argue against feedback loops in their broadest scope because it’s like arguing against updating on reality at all and that’s, as some might say, the core thing we’re about here. E.g., reflecting on your thought processes and updating them seems broadly good to me.
The thing that I feel more gripe-y about is something in the vicinity of these two claims: 1) Feedback loops work especially well in some domains (e.g., engineering) and poorly in others (e.g., early science). 2) Alignment, to the extent that it is a science, is early stage and using a feedback loop first mentality here seems actively harmful to me.
Where do feedback loops work well? Feedback loops (in particular, negative feedback loops), as they were originally construed, consist of a “goal state,” a way of checking whether your system is in line with the goal state or not, and a way of changing the current state (so as to eventually align it with the goal state). This setup is very back-chain focused. It assumes that you know what the target is and it assumes that you can progressively home in on it (i.e., converge on a particular state).
This works especially well in, e.g., engineering applications, where you have an end product in mind and you are trying out different strategies to get there. But one of the main difficulties with early stage science is that you don’t know what you’re aiming at, and this process seems (to me) to consist more of expanding the possibility space through exploration (i.e., hypothesis generation is about creating, not cleaving) rather than winnowing it.
For instance, it’s hard for me to imagine how the feedback loop first approach would have made Darwin much faster at noticing that species “gradually become modified.” This wasn’t even in his hypothesis space when he started his voyage on the Beagle (he assumed, like almost all other naturalists, that species were independently created and permanent). Like, it’s true that Darwin was employing feedback loops in other ways (e.g., trying to predict what rock formations would be like before he arrived there), and I buy that this sort of scientific eye may have helped him notice subtle differences that other people missed.
But what sort of feedback should he have used to arrive at the novel thought that species changed, when that wasn’t even on his radar to begin with? And what sort of training would make someone better at this? It doesn’t seem to me like practicing thinking via things like Thinking Physics questions is really the thing here, where, e.g., the right question has already been formulated. The whole deal with early stage science, imo, is in figuring out how to ask the right questions in the first place, without access to what the correct variables and relationships are beforehand. (I’m not saying there is no way to improve at this skill, or to practice it, I just have my doubts that a feedback loop first approach is the right one, here).
Where (and why) feedback loops are actively harmful. Basically, I think a feedback loop first approach overemphasizes legibility which incentivizes either a) pretending that things are legible where they aren’t and/or b) filtering out domains with high illegibility. As you can probably guess, I think early science is high on the axis of illegibility, and I worry that focusing too hard on feedback loops either a) causes people to dismiss the activity or b) causes people to prematurely formalize their work.
I think that one of the main things that sets early stage scientific work apart from other things, and what makes it especially difficult, is that it often requires holding onto confusion for a very long time (on the order of years). And usually that confusion is not well-formed, since if it were the path forward would be much more obvious. Which means that the confusion is often hard to communicate to other people, i.e., it’s illegible.
This is a pretty tricky situation for a human to be in. It means that a) barely anyone, and sometimes no one, has any idea what you’re doing and to the extent they do, they think that it’s probably pointless or doomed, b) this makes getting money is a bunch harder, and c) it is psychologically taxing for most people to be in a state of confusion—in general, people like feeling like they understand what’s going on. In other words, the overwhelming incentive is just to do the easily communicable thing, and it takes something quite abnormal for a human to spend years on a project that doesn’t have a specific end goal, and little to no outside-view legible progress.
I think that the thing which usually supports this kind of sustained isolation is an intense curiosity and an obsession with the subject (e.g., Paul Graham’s bus ticket theory), and an inside view sense that your leads are promising. These are the qualities (aside from g) that I suspect strongly contribute to early stage scientific progress and I don’t think they’re ones that you train via feedback loops, at least not as the direct focus, so much as playful thinking, boggling, and so on.
More than that, though, I suspect that a feedback loop first focus is actively harmful here. Feedback loops ask people to make their objectives clear-cut. But sort of the whole point of early science is that we don’t know how to talk about the concepts correctly yet (nor how to formalize the right questions or objectives). So the incentive, here, is to cut off confusion too early, e.g., by rounding it off to the closest formalized concept and moving on. This sucks! Prematurely formalizing is harmful when the main difficulty of early science is in holding onto confusion, and only articulating it when it’s clear that it is carving the world correctly.
To make a very bold and under-defended claim: I think this is a large part of the reason why a lot of science sucks now—people began mistaking the outcome (crisp, formalized principles) for the process, and now research isn’t “real” unless it has math in it. But most of the field-founding books (e.g., Darwin, Carnot) have zero or close to zero math! It is, in my opinion, a big mistake to throw formalizations at things before you know what the things are, much like it is a mistake to pick legible benchmarks before you know what you want a benchmark for.
Alignment is early stage science. I feel like this claim is obvious enough to not need defending but, e.g., we don’t know what any of the concepts are in any remotely precise (and agreed upon) sense: intelligence, optimization, agency, situational awareness, deception, and so on… This is distinct from saying that we need to solve alignment through science, e.g., it could be that alignment is super easy, or that engineering efforts are enough. But to the extent that we are trying to tackle alignment as a natural science, I think it’s safe to say it is in its infancy.
I don’t want feedback loop first culture to become the norm for this sort of work, for the reasons I outlined above (it’s also the sort of work I personally feel most excited about for making progress on the problem). So, the main point of this comment is like “yes, this seems good in certain contexts, but please let’s not overdo it here, nor have our expectations set that it ought to be the norm of what happens in the early stages of science (of which alignment is a member).”
So I agree with all the knobs-on-the-equation you and Adam are bringing up. I’ve spent a lot of time pushing for LessWrong to be a place where people feel more free to explore early stage ideas without having to justify them at every step.
I stand by my claim, although a) I want to clarify some detail about what I’m actually claiming, b) after clarifying, I expect we’ll still disagree, albeit for somewhat vague aesthetic-sense reasons, but I think my disagreement is important.
Main Clarifications:
This post is primarily talking about training, rather than executing. (although I’ll say more nuanced things in a bit). i.e. yes you definitely don’t want to goodhart your actual research process on “what makes for a good feedback loop?”. The topic of this post is “how can we develop better training methods for fuzzy, early stage science?”
Reminder that a central claim here is that Thinking Physics style problems are inadequate. In 6 months, if feedbackloop-rationality hadn’t resulted in a much better set of exercises/feedbackloops than thinking physics puzzles, I’d regard it as failing and I’d probably give up. (I wouldn’t give up at 6 months if I thought we had more time, but it’d have become clear at least that this isn’t obviously easier than just directly working on the alignment problem itself rather than going meta-on it)
It sounds like you’re worried about the impact of this being “people who might have curiously, openendedly approached alignment instead Goodhart on something concrete and feedbackloop-able”.
But a major motivation of mine here is that I think that failure mode is already happening by default – IMO, people are gravitating towards “do stuff in ML with clearer feedback-loops because it’s easier to demonstrate you’re doing something at least plausibly ‘real’ there”, while failing to engage with the harder problems that actually need solving. And meanwhile, maybe contributing to capabilities advances that are net-negative.
So one of my goals here is to help provide traction on how to think in more openended domains, such that it’s possible to do anything other than either “gravitate towards high-feedback approaches” or “pick a direction to curiously explore for months/years and… hope it turns out you have good research taste / you won-the-bus-ticket-lottery?”
If those were the only two approaches, I think “have a whole bunch of people do Option B and hope some of them win the research-taste-lottery” would be among my strategies, but it seems like something we should be pretty sad about.
I agree that if you’re limiting yourself to “what has good feedbackloops”, you get a Goodharty outcome, but the central claim here is “actually, it’s just real important to learn how to invent better feedback loops.” And that includes figuring out how to take fuzzy things and operationalize them without losing what was actually important about them. And yeah that’s hard, but it seems at least not harder than solving Alignment in the first place (and IMO it just seems pretty tractable? It seems “relatively straightforward” to design exercises for, it’s just that it’d take awhile to design enough exercises to make a full fledged training program + test set)
(Put another way: I would be extremely surprised if you and Adam spent a day thinking about “okay, what sort of feedbackloops would actually be good, given what we believe about how early stage science works?” and you didn’t come up with anything that didn’t seem worth trying, by both your lights and mine)
Yeah, my impression is similarly that focus on feedback loops is closer to “the core thing that’s gone wrong so far with alignment research,” than to “the core thing that’s been missing.” I wouldn’t normally put it this way, since I think many types of feedback loops are great, and since obviously in the end alignment research is useless unless it helps us better engineer AI systems in the actual territory, etc.
(And also because some examples of focus on tight feedback loops, like Faraday’s research, strike me as exceedingly excellent, although I haven’t really figured out yet why his work seems so much closer to the spirit we need than e.g. thinking physics problems).
Like, all else equal, it clearly seems better to have better empirical feedback; I think my objection is mostly that in practice, focus on this seems to lead people to premature formalization, or to otherwise constraining their lines of inquiry to those whose steps are easy to explain/justify along the way.
Another way to put this: most examples I’ve seen of people trying to practice attending to tight feedback have involved them focusing on trivial problems, like simple video games or toy already-solved science problems, and I think this isn’t a coincidence. So while I share your sense Raemon that transfer learning seems possible here, my guess is that this sort of practice mostly transfers within the domain of other trivial problems, where solutions (or at least methods for locating solutions) are already known, and hence where it’s easy to verify you’re making progress along the way.
Another way to put this: most examples I’ve seen of people trying to practice attending to tight feedback have involved them focusing on trivial problems, like simple video games or toy already-solved science problems
One thing is I just… haven’t actually seen instances of feedbackloops on already-solved-science-problems being used? Maybe they are used and I haven’t run into them. but I’ve barely heard of anyone tackling exercises with the frame “get 95% accuracy on Thinking-Physics-esque problems, taking as long as you want to think, where the primary thing you’re grading yourself on is ‘did you invent better ways of thinking?’”. So it seemed like the obvious place to start.
(And also because some examples of focus on tight feedback loops, like Faraday’s research, strike me as exceedingly excellent, although I haven’t really figured out yet why his work seems so much closer to the spirit we need than e.g. thinking physics problems).
I just meant that Faraday’s research strikes me as counterevidence for the claim I was making—he had excellent feedback loops, yet also seems to me to have had excellent pre-paradigmatic research taste/next-question-generating skill of the sort my prior suggests generally trades off against strong focus on quickly-checkable claims. So maybe my prior is missing something!
Meta: I have some gripes about the feedback loop focus in rationality culture, and I think this comment unfairly mixes a bunch of my thoughts about this topic in general with my thoughts in response to this post in particular—sorry in advance for that. I wish I was better at delineating between them, but that turned out to be kind of hard, and I have limited time and so on…
It is quite hard to argue against feedback loops in their broadest scope because it’s like arguing against updating on reality at all and that’s, as some might say, the core thing we’re about here. E.g., reflecting on your thought processes and updating them seems broadly good to me.
The thing that I feel more gripe-y about is something in the vicinity of these two claims: 1) Feedback loops work especially well in some domains (e.g., engineering) and poorly in others (e.g., early science). 2) Alignment, to the extent that it is a science, is early stage and using a feedback loop first mentality here seems actively harmful to me.
Where do feedback loops work well? Feedback loops (in particular, negative feedback loops), as they were originally construed, consist of a “goal state,” a way of checking whether your system is in line with the goal state or not, and a way of changing the current state (so as to eventually align it with the goal state). This setup is very back-chain focused. It assumes that you know what the target is and it assumes that you can progressively home in on it (i.e., converge on a particular state).
This works especially well in, e.g., engineering applications, where you have an end product in mind and you are trying out different strategies to get there. But one of the main difficulties with early stage science is that you don’t know what you’re aiming at, and this process seems (to me) to consist more of expanding the possibility space through exploration (i.e., hypothesis generation is about creating, not cleaving) rather than winnowing it.
For instance, it’s hard for me to imagine how the feedback loop first approach would have made Darwin much faster at noticing that species “gradually become modified.” This wasn’t even in his hypothesis space when he started his voyage on the Beagle (he assumed, like almost all other naturalists, that species were independently created and permanent). Like, it’s true that Darwin was employing feedback loops in other ways (e.g., trying to predict what rock formations would be like before he arrived there), and I buy that this sort of scientific eye may have helped him notice subtle differences that other people missed.
But what sort of feedback should he have used to arrive at the novel thought that species changed, when that wasn’t even on his radar to begin with? And what sort of training would make someone better at this? It doesn’t seem to me like practicing thinking via things like Thinking Physics questions is really the thing here, where, e.g., the right question has already been formulated. The whole deal with early stage science, imo, is in figuring out how to ask the right questions in the first place, without access to what the correct variables and relationships are beforehand. (I’m not saying there is no way to improve at this skill, or to practice it, I just have my doubts that a feedback loop first approach is the right one, here).
Where (and why) feedback loops are actively harmful. Basically, I think a feedback loop first approach overemphasizes legibility which incentivizes either a) pretending that things are legible where they aren’t and/or b) filtering out domains with high illegibility. As you can probably guess, I think early science is high on the axis of illegibility, and I worry that focusing too hard on feedback loops either a) causes people to dismiss the activity or b) causes people to prematurely formalize their work.
I think that one of the main things that sets early stage scientific work apart from other things, and what makes it especially difficult, is that it often requires holding onto confusion for a very long time (on the order of years). And usually that confusion is not well-formed, since if it were the path forward would be much more obvious. Which means that the confusion is often hard to communicate to other people, i.e., it’s illegible.
This is a pretty tricky situation for a human to be in. It means that a) barely anyone, and sometimes no one, has any idea what you’re doing and to the extent they do, they think that it’s probably pointless or doomed, b) this makes getting money is a bunch harder, and c) it is psychologically taxing for most people to be in a state of confusion—in general, people like feeling like they understand what’s going on. In other words, the overwhelming incentive is just to do the easily communicable thing, and it takes something quite abnormal for a human to spend years on a project that doesn’t have a specific end goal, and little to no outside-view legible progress.
I think that the thing which usually supports this kind of sustained isolation is an intense curiosity and an obsession with the subject (e.g., Paul Graham’s bus ticket theory), and an inside view sense that your leads are promising. These are the qualities (aside from g) that I suspect strongly contribute to early stage scientific progress and I don’t think they’re ones that you train via feedback loops, at least not as the direct focus, so much as playful thinking, boggling, and so on.
More than that, though, I suspect that a feedback loop first focus is actively harmful here. Feedback loops ask people to make their objectives clear-cut. But sort of the whole point of early science is that we don’t know how to talk about the concepts correctly yet (nor how to formalize the right questions or objectives). So the incentive, here, is to cut off confusion too early, e.g., by rounding it off to the closest formalized concept and moving on. This sucks! Prematurely formalizing is harmful when the main difficulty of early science is in holding onto confusion, and only articulating it when it’s clear that it is carving the world correctly.
To make a very bold and under-defended claim: I think this is a large part of the reason why a lot of science sucks now—people began mistaking the outcome (crisp, formalized principles) for the process, and now research isn’t “real” unless it has math in it. But most of the field-founding books (e.g., Darwin, Carnot) have zero or close to zero math! It is, in my opinion, a big mistake to throw formalizations at things before you know what the things are, much like it is a mistake to pick legible benchmarks before you know what you want a benchmark for.
Alignment is early stage science. I feel like this claim is obvious enough to not need defending but, e.g., we don’t know what any of the concepts are in any remotely precise (and agreed upon) sense: intelligence, optimization, agency, situational awareness, deception, and so on… This is distinct from saying that we need to solve alignment through science, e.g., it could be that alignment is super easy, or that engineering efforts are enough. But to the extent that we are trying to tackle alignment as a natural science, I think it’s safe to say it is in its infancy.
I don’t want feedback loop first culture to become the norm for this sort of work, for the reasons I outlined above (it’s also the sort of work I personally feel most excited about for making progress on the problem). So, the main point of this comment is like “yes, this seems good in certain contexts, but please let’s not overdo it here, nor have our expectations set that it ought to be the norm of what happens in the early stages of science (of which alignment is a member).”
So I agree with all the knobs-on-the-equation you and Adam are bringing up. I’ve spent a lot of time pushing for LessWrong to be a place where people feel more free to explore early stage ideas without having to justify them at every step.
I stand by my claim, although a) I want to clarify some detail about what I’m actually claiming, b) after clarifying, I expect we’ll still disagree, albeit for somewhat vague aesthetic-sense reasons, but I think my disagreement is important.
Main Clarifications:
This post is primarily talking about training, rather than executing. (although I’ll say more nuanced things in a bit). i.e. yes you definitely don’t want to goodhart your actual research process on “what makes for a good feedback loop?”. The topic of this post is “how can we develop better training methods for fuzzy, early stage science?”
Reminder that a central claim here is that Thinking Physics style problems are inadequate. In 6 months, if feedbackloop-rationality hadn’t resulted in a much better set of exercises/feedbackloops than thinking physics puzzles, I’d regard it as failing and I’d probably give up. (I wouldn’t give up at 6 months if I thought we had more time, but it’d have become clear at least that this isn’t obviously easier than just directly working on the alignment problem itself rather than going meta-on it)
It sounds like you’re worried about the impact of this being “people who might have curiously, openendedly approached alignment instead Goodhart on something concrete and feedbackloop-able”.
But a major motivation of mine here is that I think that failure mode is already happening by default – IMO, people are gravitating towards “do stuff in ML with clearer feedback-loops because it’s easier to demonstrate you’re doing something at least plausibly ‘real’ there”, while failing to engage with the harder problems that actually need solving. And meanwhile, maybe contributing to capabilities advances that are net-negative.
So one of my goals here is to help provide traction on how to think in more openended domains, such that it’s possible to do anything other than either “gravitate towards high-feedback approaches” or “pick a direction to curiously explore for months/years and… hope it turns out you have good research taste / you won-the-bus-ticket-lottery?”
If those were the only two approaches, I think “have a whole bunch of people do Option B and hope some of them win the research-taste-lottery” would be among my strategies, but it seems like something we should be pretty sad about.
I agree that if you’re limiting yourself to “what has good feedbackloops”, you get a Goodharty outcome, but the central claim here is “actually, it’s just real important to learn how to invent better feedback loops.” And that includes figuring out how to take fuzzy things and operationalize them without losing what was actually important about them. And yeah that’s hard, but it seems at least not harder than solving Alignment in the first place (and IMO it just seems pretty tractable? It seems “relatively straightforward” to design exercises for, it’s just that it’d take awhile to design enough exercises to make a full fledged training program + test set)
(Put another way: I would be extremely surprised if you and Adam spent a day thinking about “okay, what sort of feedbackloops would actually be good, given what we believe about how early stage science works?” and you didn’t come up with anything that didn’t seem worth trying, by both your lights and mine)
Yeah, my impression is similarly that focus on feedback loops is closer to “the core thing that’s gone wrong so far with alignment research,” than to “the core thing that’s been missing.” I wouldn’t normally put it this way, since I think many types of feedback loops are great, and since obviously in the end alignment research is useless unless it helps us better engineer AI systems in the actual territory, etc.
(And also because some examples of focus on tight feedback loops, like Faraday’s research, strike me as exceedingly excellent, although I haven’t really figured out yet why his work seems so much closer to the spirit we need than e.g. thinking physics problems).
Like, all else equal, it clearly seems better to have better empirical feedback; I think my objection is mostly that in practice, focus on this seems to lead people to premature formalization, or to otherwise constraining their lines of inquiry to those whose steps are easy to explain/justify along the way.
Another way to put this: most examples I’ve seen of people trying to practice attending to tight feedback have involved them focusing on trivial problems, like simple video games or toy already-solved science problems, and I think this isn’t a coincidence. So while I share your sense Raemon that transfer learning seems possible here, my guess is that this sort of practice mostly transfers within the domain of other trivial problems, where solutions (or at least methods for locating solutions) are already known, and hence where it’s easy to verify you’re making progress along the way.
One thing is I just… haven’t actually seen instances of feedbackloops on already-solved-science-problems being used? Maybe they are used and I haven’t run into them. but I’ve barely heard of anyone tackling exercises with the frame “get 95% accuracy on Thinking-Physics-esque problems, taking as long as you want to think, where the primary thing you’re grading yourself on is ‘did you invent better ways of thinking?’”. So it seemed like the obvious place to start.
Can you say more about what you mean here?
I just meant that Faraday’s research strikes me as counterevidence for the claim I was making—he had excellent feedback loops, yet also seems to me to have had excellent pre-paradigmatic research taste/next-question-generating skill of the sort my prior suggests generally trades off against strong focus on quickly-checkable claims. So maybe my prior is missing something!