you actively punish attempts that had a high expectation of rabbits but didn’t produce rabbits. This just straightforwardly punishes high variance strategies. You need at least some people doing low variance strategies, otherwise a week where nobody brings home any rabbits, everyone dies. But if you punish high variance strategies whenever they are employed, you’re going to end up with a lot fewer rabbits.
You should neither reward nor punish strategies or attempts at all, but results. If I am executing a high-variance strategy, and you punish poor results, and reward good results, in accordance with how poor/good they are, then (if I am right about my strategy having a positive expectation) I will—in expectation—be rewarded. This will incentivize me to execute said strategy (assuming I am not risk-averse—but if I am, then I’m not going to be the one trying the high-variance strategy anyway).
you systematically incentive legibly producing countable rabbits, in a world where it turned out a lot of value wasn’t just about the rabbits, or that some rabbits were actually harder to notice. I think one of the major problems with goodhart in the 20th century comes from expectation of legible results.
I was talking about rabbits (or things very similar to rabbits). I made, and make, no guarantees that the analysis applies when analogized to anything very different. (It seems clear that the analysis does apply in some very different situations, and does not apply in others.) Reasoning by analogy is dangerous; if we propose to attempt it, we need to be very clear about what the assumptions of the model are, and how the situations we are analogizing differ, and what that does to our assumptions.
You should neither reward nor punish strategies or attempts at all, but results.
This statement is presented in a way that suggests the reader ought to find it obvious, but in fact I don’t see why it’s obvious at all. If we take the quoted statement at face value, it appears to be suggesting that we apply our rewards and punishments (whatever they may be) to something which is causally distant from the agent whose behavior we are trying to influence—namely, “results”—and, moreover, that this approach is superior to the approach of applying those same rewards/punishments to something which is causally immediate—namely, “strategies”.
I see no reason this should be the case, however! Indeed, it seems to me that the opposite is true: if the rewards and punishments for a given agent are applied based on a causal node which is separated from the agent by multiple causal links, then there is a greater number of ancestor nodes that said rewards/punishments must propagate through before reaching the agent itself. The consequences of this are twofold: firstly, the impact of the reward/punishment is diluted, since it must be divided among a greater number of potential ancestor nodes. And secondly, because the agent has no way to identify which of these ancestor nodes we “meant” to reward or punish, our rewards/punishments may end up impacting aspects of the agent’s behavior we did not intend to influence, sometimes in ways that go against what we would prefer. (Moreover, the probability of such a thing occurring increases drastically as the thing we reward/punish becomes further separated from the agent itself.)
The takeaway from this, of course, is that strategically rewarding and punishing things grows less effective as the proxy on which said rewards and punishments are based grows further from the thing we are trying to influence—a result which sometimes goes by a more well-known name. This then suggests that punishing results over strategies, far from being a superior approach, is actually inferior: it has lower chances of influencing behavior we would like to influence, and higher chances of influencing behavior we would not like to influence.
(There are, of course, benefits as well as costs to rewarding and punishing results (rather than strategies). The most obvious benefit is that it is far easier for the party doing the rewarding and punishing: very little cognitive effort is required to assess whether a given result is positive or negative, in stark contrast to the large amounts of effort necessary to decide whether a given strategy has positive or negative expectation. This is why, for example, large corporations—which are often bottlenecked on cognitive effort—generally reward and punish their employees on the basis of easily measurable metrics. But, of course, this is a far cry from claiming that such an approach is simply superior to the alternative. (It is also why large corporations so often fall prey to Goodhart’s Law.))
Strange. You bring up Goodhart’s Law, but the way you apply it seems exactly backwards to me. If you’re rewarding strategies instead of results, and someone comes up with a new strategy that has far better results than the strategy you’re rewarding, you fail to reward people for developing better strategies or getting better results. This seems like it’s exactly what Goodhart was trying to warn us about.
I agree this is a weird place to bring up Goodhart that requires extra justification. But I do think it makes sense here. (Though I also agree with Said elsethread that it matters a lot what we’re actually talking about – rabbits, widgets, companies, scientific papers and blogposts might all behave a bit differently)
The two main issues are:
it’s hard to directly incentivize results with fuzzy, unpredictable characteristics.
it’s hard to directly incentivize results over long timescales
In short: preoccupation with “rewarding results”, in situations where you can’t actually reward results, can result in goodharting for all the usual reasons.
Two examples here are:
Scientific papers. Probably the closest to a directly relevant example before we start talking about blogposts in particular. My impression [epistemic status: relatively weak, based on anecodotes, but it seems like everyone I’ve heard talk about this roughly agreed with these anecdotes] is that the publish-or-perish mindset for academic output has been a pretty direct example of “we tried directly incentivizing results, and instead of getting more science we got shittier science.”
Founding Companies. There are ecosystems for founding and investing in companies (startups and otherwise), which are ultimately about a particular result (making money). But, this requires very long time horizons, which many people a) literally can’t pull off because they don’t have the runway, b) are often risk averse, and might not be willing to do it if they had to just risk their own money.
The business venture world works because there’s been a lot of infrastructural effort put into enabling particular strategies (in the case of startups, an established concept of seed funding, series A, etc). In the case of business sometimes more straightforward loans.
The relation to goodhart here is a bit weirder here because yeah, overfocus on “known strategies” is also one of the pathologies that results in goodharting (i.e. everything thinks social media is Hype, so everyone founds social media companies, but maybe by this point social media is overdone and you actually need to be looking for weirder things that people haven’t already saturated the market with)
But, the goodhart is instead “if you don’t put effort into maintaining the early stages of the strategy, despite many instances of that strategy failing… you just end up with less money.”
My sense [again, epistemic status fairly weak based on things Paul Graham said, but that I haven’t heard explicitly argued against] is venture capitalists make the most money from the long tail of companies they invest in. Being willing to get “real results” requires being willing to tolerate lots of things that don’t pay off in results. And many of those companies in the long tail were startup ideas that didn’t sound like sure bets.
There is some sense in which “directly rewarding results” is of course the best way to avoid goodharting, but since we don’t actually have access to “direct results that actually represent the real thing” to reward, the impulse to directly reward results can often result in rewarding not-actually-results.
Sure, that all makes sense, but at least on LW it seems like we ought to insist on saying “rewarding results” when we mean rewarding results, and “deceiving ourselves into thinking we’re rewarding results” when we mean deceiving ourselves into thinking we’re rewarding results.
That makes sense, although I’m not actually sure either “rewarding results” or “deceiving ourselves into thinking we’re rewarding results” quite capture what’s going on here.
Like, I do think it’s possible to reward individual good things (whether blogposts or scientific papers) when you find them. The question is how this shapes the overall system. When you expect “good/real results” to be few and far between, the process of “only reward things that are obvious good and/or great” might technically be rewarding results, while still outputting fewer results on average than if you had rewarded people for following overall strategies like “pursue things you’re earnestly curious about”, and giving people positive rewards for incremental steps along the way.
(Seems good to be precise about language here but I’m not in fact sure how to word this optimally. Meanwhile, earlier parts of the conversation were more explicitly about how ‘reward final results, and only final results’ just isn’t the strategy used in most of the business world)
Strong upvote for clear articulation of points I wanted to see made.
The most obvious benefit is that it is far easier for the party doing the rewarding and punishing: very little cognitive effort is required to assess whether a given result is positive or negative, in stark contrast to the large amounts of effort necessary to decide whether a given strategy has positive or negative expectation.
This part isn’t obviously/exactly correct to me. If we’re talking about posts and comments on LessWrong, it can be quite hard for me to assess whether a given post is correct or not (although even incorrect posts are often quite valuable parts of the discourse). It might also take a lot of information/effort to arrive that the belief that the strategy of “invest more effort, generate more ideas” leads ultimately to more good ideas such that incentivizing generation itself is good. However, once I hold that belief, it’s relatively easy to apply it. I see someone investing effort in adding to communal knowledge in a way that is plausibly correct/helpful; I then encourage this pro-social contribution despite the fact evaluating whether the post was actually correct or not* can be extremely difficult.
*”Correct or not” is a bit binary, but even assessing the overall “quality” or “value” of a post doesn’t make it much easier to assess. Far harder than number of rabbits. However, if a post doesn’t seem obviously wrong (or even if it’s clearly wrong but because understandable mistake many people might make), I can often confidently say that it is contributing to communal knowledge (often via the discussion it sparks or simply because someone could correct a reasonable misunderstanding) and I overall want to encourage more of whatever generated it. I’m happy to get more posts like that, even if I seek push for refinements in the process, say.
(Reacts or separate upvote/downvotes vs agree/disagree buttons will hopefully make it easier in the future to encourage effort even while expressing that I think something is wrong. )
We didn’t (or rather, shouldn’t) intend to reward or punish those “ancestor nodes”. We should intend to reward or punish the results.
You seem to have interpreted my comments as saying that we’re trying to reward some particular behavior, but we should do this by rewarding the results of that behavior. As you point out, this is not a wise plan.
But it’s also not what I am saying, at all. I am saying that we are (or, again, should be) trying to reward the results. Not the behavior that led to those results, but the results themselves.
I don’t know why you’re assuming that we’re actually trying to encourage some specific behavior. It’s certainly not what I am assuming. Doing so would not be a very good idea at all.
I think with that approach there are a great many results you’d fail to achieve. People can get animals to do remarkable things with shaping and I would wager that you can’t do them at all otherwise.
We first give the bird food when it turns slightly in the direction of the spot from any part of the cage. This increases the frequency of such behavior. We then withhold reinforcement until a slight movement is made toward the spot. This again alters the general distribution of behavior without producing a new unit. We continue by reinforcing positions successively closer to the spot, then by reinforcing only when the head is moved slightly forward, and finally only when the beak actually makes contact with the spot. … The original probability of the response in its final form is very low; in some cases it may even be zero. In this way we can build complicated operants which would never appear in the repertoire of the organism otherwise. By reinforcing a series of successive approximations, we bring a rare response to a very high probability in a short time. … The total act of turning toward the spot from any point in the box, walking toward it, raising the head, and striking the spot may seem to be a functionally coherent unit of behavior; but it is constructed by a continual process of differential reinforcement from undifferentiated behavior, just as the sculptor shapes his figure from a lump of clay.
Humans are more sophisticated than birds, but producing highly complex and abstruse truths in a format understandable to others is also a lot more complicated than getting a bird to put its beak in a particular spot. I think all the same mechanics are at work. If you want to get someone (including yourself) to do something as complex and difficult as producing valuable, novel, correct, expositions of true things on LessWrong—you’re going to have to reward the predictable intermediary steps.
We don’t go to five year olds and say “the desired result is that you can write fluently, therefore no positive feedback on your marginal efforts until you can do so, in fact, I’m going to strike your knuckles every time you make a spelling error or anything which isn’t what we hope to see from you when you’re 12, we will only reward the final desired result and you can back propagate from that to get figure out what’s good.” That’s really only a recipe for children who are unwilling to put any effort in learning to write, not those who progressively put in effort over years to learn what it even looks like to a be a competent writer.
This is beyond my earlier point that verifying results in our cases is often much harder than verifying that good steps were being taken.
We didn’t (or rather, shouldn’t) intend to reward or punish those “ancestor nodes”. We should intend to reward or punish the results.
I’m afraid this sentence doesn’t parse for me. You seem to be speaking of “results” as something which to which the concept of rewards and punishments are applicable. However, I’m not aware of any context in which this is a meaningful (rather than nonsensical) thing to say. All theories of behavior I’ve encountered that make mention of the concept of rewards and punishments (e.g. operant conditioning) refer to them as a means of influencing behavior. If there’s something else you’re referring to when you say “reward or punish the results”, I would appreciate it if you clarified what exactly that thing is.
I don’t see what could be simpler. Alice does something. That action has some result. We reward Alice, or punish her, based on the results of her action. There is nothing unusual or obscure here; I mean just what I say.
(There are cases where we do not want to take this approach, but they tend to both be controversial and to be unusual in certain important respects.)
Edit: And if you’re trying to use operant conditioning, of all things, to decide what social norms to have on a forum devoted to the art of rationality, then you’ve already admitted defeat, and this entire project is pointless.
assuming I am not risk-averse—but if I am, then I’m not going to be the one trying the high-variance strategy anyway
But, of course, everyone is risk averse in almost every resource. Even the most ambitious startup founders are still risk averse in total payment, just less so than others. I care less about my 10th million dollar than any of my first 9 million dollars, which already creates risk aversion. The same is true for status or almost any other resource with which you might want to reward people.
You should neither reward nor punish strategies or attempts at all, but results. If I am executing a high-variance strategy, and you punish poor results, and reward good results, in accordance with how poor/good they are, then (if I am right about my strategy having a positive expectation) I will—in expectation—be rewarded. This will incentivize me to execute said strategy (assuming I am not risk-averse—but if I am, then I’m not going to be the one trying the high-variance strategy anyway).
I was talking about rabbits (or things very similar to rabbits). I made, and make, no guarantees that the analysis applies when analogized to anything very different. (It seems clear that the analysis does apply in some very different situations, and does not apply in others.) Reasoning by analogy is dangerous; if we propose to attempt it, we need to be very clear about what the assumptions of the model are, and how the situations we are analogizing differ, and what that does to our assumptions.
This statement is presented in a way that suggests the reader ought to find it obvious, but in fact I don’t see why it’s obvious at all. If we take the quoted statement at face value, it appears to be suggesting that we apply our rewards and punishments (whatever they may be) to something which is causally distant from the agent whose behavior we are trying to influence—namely, “results”—and, moreover, that this approach is superior to the approach of applying those same rewards/punishments to something which is causally immediate—namely, “strategies”.
I see no reason this should be the case, however! Indeed, it seems to me that the opposite is true: if the rewards and punishments for a given agent are applied based on a causal node which is separated from the agent by multiple causal links, then there is a greater number of ancestor nodes that said rewards/punishments must propagate through before reaching the agent itself. The consequences of this are twofold: firstly, the impact of the reward/punishment is diluted, since it must be divided among a greater number of potential ancestor nodes. And secondly, because the agent has no way to identify which of these ancestor nodes we “meant” to reward or punish, our rewards/punishments may end up impacting aspects of the agent’s behavior we did not intend to influence, sometimes in ways that go against what we would prefer. (Moreover, the probability of such a thing occurring increases drastically as the thing we reward/punish becomes further separated from the agent itself.)
The takeaway from this, of course, is that strategically rewarding and punishing things grows less effective as the proxy on which said rewards and punishments are based grows further from the thing we are trying to influence—a result which sometimes goes by a more well-known name. This then suggests that punishing results over strategies, far from being a superior approach, is actually inferior: it has lower chances of influencing behavior we would like to influence, and higher chances of influencing behavior we would not like to influence.
(There are, of course, benefits as well as costs to rewarding and punishing results (rather than strategies). The most obvious benefit is that it is far easier for the party doing the rewarding and punishing: very little cognitive effort is required to assess whether a given result is positive or negative, in stark contrast to the large amounts of effort necessary to decide whether a given strategy has positive or negative expectation. This is why, for example, large corporations—which are often bottlenecked on cognitive effort—generally reward and punish their employees on the basis of easily measurable metrics. But, of course, this is a far cry from claiming that such an approach is simply superior to the alternative. (It is also why large corporations so often fall prey to Goodhart’s Law.))
Strange. You bring up Goodhart’s Law, but the way you apply it seems exactly backwards to me. If you’re rewarding strategies instead of results, and someone comes up with a new strategy that has far better results than the strategy you’re rewarding, you fail to reward people for developing better strategies or getting better results. This seems like it’s exactly what Goodhart was trying to warn us about.
I agree this is a weird place to bring up Goodhart that requires extra justification. But I do think it makes sense here. (Though I also agree with Said elsethread that it matters a lot what we’re actually talking about – rabbits, widgets, companies, scientific papers and blogposts might all behave a bit differently)
The two main issues are:
it’s hard to directly incentivize results with fuzzy, unpredictable characteristics.
it’s hard to directly incentivize results over long timescales
In short: preoccupation with “rewarding results”, in situations where you can’t actually reward results, can result in goodharting for all the usual reasons.
Two examples here are:
Scientific papers. Probably the closest to a directly relevant example before we start talking about blogposts in particular. My impression [epistemic status: relatively weak, based on anecodotes, but it seems like everyone I’ve heard talk about this roughly agreed with these anecdotes] is that the publish-or-perish mindset for academic output has been a pretty direct example of “we tried directly incentivizing results, and instead of getting more science we got shittier science.”
Founding Companies. There are ecosystems for founding and investing in companies (startups and otherwise), which are ultimately about a particular result (making money). But, this requires very long time horizons, which many people a) literally can’t pull off because they don’t have the runway, b) are often risk averse, and might not be willing to do it if they had to just risk their own money.
The business venture world works because there’s been a lot of infrastructural effort put into enabling particular strategies (in the case of startups, an established concept of seed funding, series A, etc). In the case of business sometimes more straightforward loans.
The relation to goodhart here is a bit weirder here because yeah, overfocus on “known strategies” is also one of the pathologies that results in goodharting (i.e. everything thinks social media is Hype, so everyone founds social media companies, but maybe by this point social media is overdone and you actually need to be looking for weirder things that people haven’t already saturated the market with)
But, the goodhart is instead “if you don’t put effort into maintaining the early stages of the strategy, despite many instances of that strategy failing… you just end up with less money.”
My sense [again, epistemic status fairly weak based on things Paul Graham said, but that I haven’t heard explicitly argued against] is venture capitalists make the most money from the long tail of companies they invest in. Being willing to get “real results” requires being willing to tolerate lots of things that don’t pay off in results. And many of those companies in the long tail were startup ideas that didn’t sound like sure bets.
There is some sense in which “directly rewarding results” is of course the best way to avoid goodharting, but since we don’t actually have access to “direct results that actually represent the real thing” to reward, the impulse to directly reward results can often result in rewarding not-actually-results.
Sure, that all makes sense, but at least on LW it seems like we ought to insist on saying “rewarding results” when we mean rewarding results, and “deceiving ourselves into thinking we’re rewarding results” when we mean deceiving ourselves into thinking we’re rewarding results.
That makes sense, although I’m not actually sure either “rewarding results” or “deceiving ourselves into thinking we’re rewarding results” quite capture what’s going on here.
Like, I do think it’s possible to reward individual good things (whether blogposts or scientific papers) when you find them. The question is how this shapes the overall system. When you expect “good/real results” to be few and far between, the process of “only reward things that are obvious good and/or great” might technically be rewarding results, while still outputting fewer results on average than if you had rewarded people for following overall strategies like “pursue things you’re earnestly curious about”, and giving people positive rewards for incremental steps along the way.
(Seems good to be precise about language here but I’m not in fact sure how to word this optimally. Meanwhile, earlier parts of the conversation were more explicitly about how ‘reward final results, and only final results’ just isn’t the strategy used in most of the business world)
Strong upvote for clear articulation of points I wanted to see made.
This part isn’t obviously/exactly correct to me. If we’re talking about posts and comments on LessWrong, it can be quite hard for me to assess whether a given post is correct or not (although even incorrect posts are often quite valuable parts of the discourse). It might also take a lot of information/effort to arrive that the belief that the strategy of “invest more effort, generate more ideas” leads ultimately to more good ideas such that incentivizing generation itself is good. However, once I hold that belief, it’s relatively easy to apply it. I see someone investing effort in adding to communal knowledge in a way that is plausibly correct/helpful; I then encourage this pro-social contribution despite the fact evaluating whether the post was actually correct or not* can be extremely difficult.
*”Correct or not” is a bit binary, but even assessing the overall “quality” or “value” of a post doesn’t make it much easier to assess. Far harder than number of rabbits. However, if a post doesn’t seem obviously wrong (or even if it’s clearly wrong but because understandable mistake many people might make), I can often confidently say that it is contributing to communal knowledge (often via the discussion it sparks or simply because someone could correct a reasonable misunderstanding) and I overall want to encourage more of whatever generated it. I’m happy to get more posts like that, even if I seek push for refinements in the process, say.
(Reacts or separate upvote/downvotes vs agree/disagree buttons will hopefully make it easier in the future to encourage effort even while expressing that I think something is wrong. )
You’re still missing my point.
We didn’t (or rather, shouldn’t) intend to reward or punish those “ancestor nodes”. We should intend to reward or punish the results.
You seem to have interpreted my comments as saying that we’re trying to reward some particular behavior, but we should do this by rewarding the results of that behavior. As you point out, this is not a wise plan.
But it’s also not what I am saying, at all. I am saying that we are (or, again, should be) trying to reward the results. Not the behavior that led to those results, but the results themselves.
I don’t know why you’re assuming that we’re actually trying to encourage some specific behavior. It’s certainly not what I am assuming. Doing so would not be a very good idea at all.
I think with that approach there are a great many results you’d fail to achieve. People can get animals to do remarkable things with shaping and I would wager that you can’t do them at all otherwise.
From the Wikipedia article on Shaping (psychology):
Humans are more sophisticated than birds, but producing highly complex and abstruse truths in a format understandable to others is also a lot more complicated than getting a bird to put its beak in a particular spot. I think all the same mechanics are at work. If you want to get someone (including yourself) to do something as complex and difficult as producing valuable, novel, correct, expositions of true things on LessWrong—you’re going to have to reward the predictable intermediary steps.
We don’t go to five year olds and say “the desired result is that you can write fluently, therefore no positive feedback on your marginal efforts until you can do so, in fact, I’m going to strike your knuckles every time you make a spelling error or anything which isn’t what we hope to see from you when you’re 12, we will only reward the final desired result and you can back propagate from that to get figure out what’s good.” That’s really only a recipe for children who are unwilling to put any effort in learning to write, not those who progressively put in effort over years to learn what it even looks like to a be a competent writer.
This is beyond my earlier point that verifying results in our cases is often much harder than verifying that good steps were being taken.
See the “Edit:” part of this comment, which is my response to your comment also.
I’m afraid this sentence doesn’t parse for me. You seem to be speaking of “results” as something which to which the concept of rewards and punishments are applicable. However, I’m not aware of any context in which this is a meaningful (rather than nonsensical) thing to say. All theories of behavior I’ve encountered that make mention of the concept of rewards and punishments (e.g. operant conditioning) refer to them as a means of influencing behavior. If there’s something else you’re referring to when you say “reward or punish the results”, I would appreciate it if you clarified what exactly that thing is.
I don’t see what could be simpler. Alice does something. That action has some result. We reward Alice, or punish her, based on the results of her action. There is nothing unusual or obscure here; I mean just what I say.
(There are cases where we do not want to take this approach, but they tend to both be controversial and to be unusual in certain important respects.)
Edit: And if you’re trying to use operant conditioning, of all things, to decide what social norms to have on a forum devoted to the art of rationality, then you’ve already admitted defeat, and this entire project is pointless.
But, of course, everyone is risk averse in almost every resource. Even the most ambitious startup founders are still risk averse in total payment, just less so than others. I care less about my 10th million dollar than any of my first 9 million dollars, which already creates risk aversion. The same is true for status or almost any other resource with which you might want to reward people.