No, you want to incentivize rabbits, not expected rabbits. Trying to incentivize expected rabbits is mixing levels. Incentivize rabbits, and people will attend to expectation of rabbits themselves.
Like, just because there is any chance of a variable becoming disconnected in a principal-agent problem doesn’t mean that it’s always a bad idea to incentivize intermediary metrics. I am not fully sure how to understand your point as anything besides “never incentivize any lead-metrics whatsoever, only ever incentivize successful output”, which seems like a recipe for sparse reward landscapes, and also not a common practice in almost any domain in which humans deal with principal-agent problems.
Your employer pays you if you show up for work, not only if you successfully get work done (at least on the day-to-day or month-to-month level). You pay your plumber if they show up, not only if they successfully fix your toilet.
Like, if you see a friend taking an action that you know and they know has a 50% chance of making $10 for you and your friend (let’s say for a communal club) and a 50% chance of losing $5, and then turns out they lose $5, then it seems better to still reward your friend for taking that action, instead of punishing them, given that you know the action had positive expected value.
(assuming you have mostly linear value of money at these stakes)
If they think the odds are 90% $10 and 10% −5$ and you think the odds are 10% $10 and 90% −5$ should you reward for trying to benefit or punish for having wrong beliefs that materially matter?
No, because humans are risk-averse, at least in money terms, but also in most other currencies. If you do this, you increase the total risk for your friend, for no particular gain.
Punishment is also usually net-negative, whereas rewards tend to be zero-sum, so by adding a bunch of worlds where you added punishments, you destroyed a bunch of value, with no gain (in the world where you both have certainty about the payoff matrix).
One model here is that humans have diminishing returns on money, so in order to reward someone 2x with dollars, you have to pay more than 2x the dollar amount, so your total cost is higher.
A scenario with zero-sum actions and net-negative actions can only go downhill. This would seem to imply that if you have an opportunity to give feedback or not give feedback you should opt to get a guaranteed zero rather than risk destroying value.
Rewards are usually a transfer of resources (e.g. me giving you money), which tend to preserve total wealth (or status, or whatever other resource you are thinking about).
Unilateral punishments are usually not transfers of resource, they are usually one party imposing a cost on another party (like hitting them with a stick and injuring them), in a way that does not preserve total wealth (or health, or whatever other resource applies to the situation).
You certainly shouldn’t hit your friend with a stick if he loses $5 of your club’s money. I think this is fairly obvious, and it seems quite improbable that you were assuming that I was suggesting any such thing. So, given that we can’t possibly be talking about injuring anyone, or doing any such thing, how can your point about net-negative punishment apply? The more sensible assumption is that the punishment is of the same kind as the reward.
I think social punishments usually have the same form. Where rewards tend to be more of a transfer of status, and punishments more of a destruction of status (two people can destroy each others reputation with repeated social punishments).
There is also the bandwidth cost of punishment, as well as the simple fact that giving people praise usually comes with a positive emotional component for the receiver (in addition to the status and the reputation), whereas punishments usually come with an addition of stress and discomfort that reduces total output for a while.
In either case, I think the simpler case is made by simply looking at the assumption of diminishing returns in resources and realizing that the cost of giving someone a reward they care 2x about is usually larger than the cost of giving the reward twice, meaning that there is an inherent cost to high-variance reward landscapes.
Your employer pays you if you show up for work, not only if you successfully get work done (at least on the day-to-day or month-to-month level).
If you show up, but don’t get work done, you get fired. (How quickly that happens varies from workplace to workplace, of course—but in many places it happens very quickly indeed.)
Yeah, but the fact that it takes a while and we have monthly wages instead of just all being contractors that are paid by the piece is kind of my point. Most of the economy does not pay for completed output, but for intermediary metrics that allow a much higher-level of stability.
But note that even if you don’t get fired immediately for failing to produce satisfactory work, you are likely to receive a dressing-down from your boss, poor evaluations, etc., or even something so simple as your team leader being visibly disappointed with you, even if they take no immediate action.
Now consider what that analogizes to, in the case at hand. Is a downvote, or a critical comment, more like being fired, or more like your boss telling you that your work isn’t up to par and that you should really try to do better?
My experience is definitely the opposite. Random Quora question also suggests that it’s common practice in plumbing to pay someone for the attempt, not for the solution. As someone who recently hired plumbers and electricians to fix a bunch of stuff in a new house we rented, this also matches with my experience. Not sure where your experience comes from.
In general, most contractors bill by the hour, not for completed output, and definitely not “output that the client thinks is worth it”, at least in my experience (there are obviously exceptions, though I found them relatively rare).
My experience comes from the same sort of thing: having, on many occasions, hired various people to do various sorts of work; and also from having worked for several years working at a computer store that specialized in on-the-premises repair/service.
The Quora answer you linked doesn’t really support your point, as it’s quite clear about the prerequisite being an informed, explicit agreement between plumber and customer that the latter will pay the former regardless of outcome. (And even with that caveat, some of what the answer-giver says is suspect, and is not consistent with my experience.)
I do not know of any industry in which contractor agreements with variable payments that are dependent on the quality of the output are common practice. There is often an agreement on what it means to “complete the work” but in almost any case both your downside and your upside are limited by a guaranteed upfront payment, and a conditional final payment. But it’s almost never the case that you can get 2x the money depending on the quality of your output, which seems like a necessary requirement for some of the incentive schemes you outlined.
What does this have to do with anything? You originally said:
Youu pay your plumber if they show up, not only if they successfully fix your toilet.
I don’t see the connection between “should you pay your plumber even if they don’t actually fix your toilet” and “should you pay your plumber twice as much if they fix your toilet twice as well”; the latter seems like a nonsensical question, and unrelated to the former.
(someone else downvoted. I was sort of torn between downvoting because it seemed importantly wrong, but upvoting for stating the disagreement clearly, ended up not voting either way. Someday we’ll probably have disagree reacts or something)
This isn’t intrinsically wrong, but basically wrong in many circumstances. I’m guessing this is a fairly important crux of disagreement.
The problem comes if either:
a) you actively punish attempts that had a high expectation of rabbits but didn’t produce rabbits. This just straightforwardly punishes high variance strategies. You need at least some people doing low variance strategies, otherwise a week where nobody brings home any rabbits, everyone dies. But if you punish high variance strategies whenever they are employed, you’re going to end up with a lot fewer rabbits.
[there might be a further disagreement about how human psychology works and what counts as punishing]
b) you systematically incentive legibly producing countable rabbits, in a world where it turned out a lot of value wasn’t just about the rabbits, or that some rabbits were actually harder to notice. I think one of the major problems with goodhart in the 20th century comes from expectation of legible results.
Figuring out how to navigate the tension between “much of value isn’t yet legible to us and overfocus on it is goodharting / destroying value” and “but, also, if you don’t focus on legible results you get nonsense” is, in my frame, basically the problem.
you actively punish attempts that had a high expectation of rabbits but didn’t produce rabbits. This just straightforwardly punishes high variance strategies. You need at least some people doing low variance strategies, otherwise a week where nobody brings home any rabbits, everyone dies. But if you punish high variance strategies whenever they are employed, you’re going to end up with a lot fewer rabbits.
You should neither reward nor punish strategies or attempts at all, but results. If I am executing a high-variance strategy, and you punish poor results, and reward good results, in accordance with how poor/good they are, then (if I am right about my strategy having a positive expectation) I will—in expectation—be rewarded. This will incentivize me to execute said strategy (assuming I am not risk-averse—but if I am, then I’m not going to be the one trying the high-variance strategy anyway).
you systematically incentive legibly producing countable rabbits, in a world where it turned out a lot of value wasn’t just about the rabbits, or that some rabbits were actually harder to notice. I think one of the major problems with goodhart in the 20th century comes from expectation of legible results.
I was talking about rabbits (or things very similar to rabbits). I made, and make, no guarantees that the analysis applies when analogized to anything very different. (It seems clear that the analysis does apply in some very different situations, and does not apply in others.) Reasoning by analogy is dangerous; if we propose to attempt it, we need to be very clear about what the assumptions of the model are, and how the situations we are analogizing differ, and what that does to our assumptions.
You should neither reward nor punish strategies or attempts at all, but results.
This statement is presented in a way that suggests the reader ought to find it obvious, but in fact I don’t see why it’s obvious at all. If we take the quoted statement at face value, it appears to be suggesting that we apply our rewards and punishments (whatever they may be) to something which is causally distant from the agent whose behavior we are trying to influence—namely, “results”—and, moreover, that this approach is superior to the approach of applying those same rewards/punishments to something which is causally immediate—namely, “strategies”.
I see no reason this should be the case, however! Indeed, it seems to me that the opposite is true: if the rewards and punishments for a given agent are applied based on a causal node which is separated from the agent by multiple causal links, then there is a greater number of ancestor nodes that said rewards/punishments must propagate through before reaching the agent itself. The consequences of this are twofold: firstly, the impact of the reward/punishment is diluted, since it must be divided among a greater number of potential ancestor nodes. And secondly, because the agent has no way to identify which of these ancestor nodes we “meant” to reward or punish, our rewards/punishments may end up impacting aspects of the agent’s behavior we did not intend to influence, sometimes in ways that go against what we would prefer. (Moreover, the probability of such a thing occurring increases drastically as the thing we reward/punish becomes further separated from the agent itself.)
The takeaway from this, of course, is that strategically rewarding and punishing things grows less effective as the proxy on which said rewards and punishments are based grows further from the thing we are trying to influence—a result which sometimes goes by a more well-known name. This then suggests that punishing results over strategies, far from being a superior approach, is actually inferior: it has lower chances of influencing behavior we would like to influence, and higher chances of influencing behavior we would not like to influence.
(There are, of course, benefits as well as costs to rewarding and punishing results (rather than strategies). The most obvious benefit is that it is far easier for the party doing the rewarding and punishing: very little cognitive effort is required to assess whether a given result is positive or negative, in stark contrast to the large amounts of effort necessary to decide whether a given strategy has positive or negative expectation. This is why, for example, large corporations—which are often bottlenecked on cognitive effort—generally reward and punish their employees on the basis of easily measurable metrics. But, of course, this is a far cry from claiming that such an approach is simply superior to the alternative. (It is also why large corporations so often fall prey to Goodhart’s Law.))
Strange. You bring up Goodhart’s Law, but the way you apply it seems exactly backwards to me. If you’re rewarding strategies instead of results, and someone comes up with a new strategy that has far better results than the strategy you’re rewarding, you fail to reward people for developing better strategies or getting better results. This seems like it’s exactly what Goodhart was trying to warn us about.
I agree this is a weird place to bring up Goodhart that requires extra justification. But I do think it makes sense here. (Though I also agree with Said elsethread that it matters a lot what we’re actually talking about – rabbits, widgets, companies, scientific papers and blogposts might all behave a bit differently)
The two main issues are:
it’s hard to directly incentivize results with fuzzy, unpredictable characteristics.
it’s hard to directly incentivize results over long timescales
In short: preoccupation with “rewarding results”, in situations where you can’t actually reward results, can result in goodharting for all the usual reasons.
Two examples here are:
Scientific papers. Probably the closest to a directly relevant example before we start talking about blogposts in particular. My impression [epistemic status: relatively weak, based on anecodotes, but it seems like everyone I’ve heard talk about this roughly agreed with these anecdotes] is that the publish-or-perish mindset for academic output has been a pretty direct example of “we tried directly incentivizing results, and instead of getting more science we got shittier science.”
Founding Companies. There are ecosystems for founding and investing in companies (startups and otherwise), which are ultimately about a particular result (making money). But, this requires very long time horizons, which many people a) literally can’t pull off because they don’t have the runway, b) are often risk averse, and might not be willing to do it if they had to just risk their own money.
The business venture world works because there’s been a lot of infrastructural effort put into enabling particular strategies (in the case of startups, an established concept of seed funding, series A, etc). In the case of business sometimes more straightforward loans.
The relation to goodhart here is a bit weirder here because yeah, overfocus on “known strategies” is also one of the pathologies that results in goodharting (i.e. everything thinks social media is Hype, so everyone founds social media companies, but maybe by this point social media is overdone and you actually need to be looking for weirder things that people haven’t already saturated the market with)
But, the goodhart is instead “if you don’t put effort into maintaining the early stages of the strategy, despite many instances of that strategy failing… you just end up with less money.”
My sense [again, epistemic status fairly weak based on things Paul Graham said, but that I haven’t heard explicitly argued against] is venture capitalists make the most money from the long tail of companies they invest in. Being willing to get “real results” requires being willing to tolerate lots of things that don’t pay off in results. And many of those companies in the long tail were startup ideas that didn’t sound like sure bets.
There is some sense in which “directly rewarding results” is of course the best way to avoid goodharting, but since we don’t actually have access to “direct results that actually represent the real thing” to reward, the impulse to directly reward results can often result in rewarding not-actually-results.
Sure, that all makes sense, but at least on LW it seems like we ought to insist on saying “rewarding results” when we mean rewarding results, and “deceiving ourselves into thinking we’re rewarding results” when we mean deceiving ourselves into thinking we’re rewarding results.
That makes sense, although I’m not actually sure either “rewarding results” or “deceiving ourselves into thinking we’re rewarding results” quite capture what’s going on here.
Like, I do think it’s possible to reward individual good things (whether blogposts or scientific papers) when you find them. The question is how this shapes the overall system. When you expect “good/real results” to be few and far between, the process of “only reward things that are obvious good and/or great” might technically be rewarding results, while still outputting fewer results on average than if you had rewarded people for following overall strategies like “pursue things you’re earnestly curious about”, and giving people positive rewards for incremental steps along the way.
(Seems good to be precise about language here but I’m not in fact sure how to word this optimally. Meanwhile, earlier parts of the conversation were more explicitly about how ‘reward final results, and only final results’ just isn’t the strategy used in most of the business world)
Strong upvote for clear articulation of points I wanted to see made.
The most obvious benefit is that it is far easier for the party doing the rewarding and punishing: very little cognitive effort is required to assess whether a given result is positive or negative, in stark contrast to the large amounts of effort necessary to decide whether a given strategy has positive or negative expectation.
This part isn’t obviously/exactly correct to me. If we’re talking about posts and comments on LessWrong, it can be quite hard for me to assess whether a given post is correct or not (although even incorrect posts are often quite valuable parts of the discourse). It might also take a lot of information/effort to arrive that the belief that the strategy of “invest more effort, generate more ideas” leads ultimately to more good ideas such that incentivizing generation itself is good. However, once I hold that belief, it’s relatively easy to apply it. I see someone investing effort in adding to communal knowledge in a way that is plausibly correct/helpful; I then encourage this pro-social contribution despite the fact evaluating whether the post was actually correct or not* can be extremely difficult.
*”Correct or not” is a bit binary, but even assessing the overall “quality” or “value” of a post doesn’t make it much easier to assess. Far harder than number of rabbits. However, if a post doesn’t seem obviously wrong (or even if it’s clearly wrong but because understandable mistake many people might make), I can often confidently say that it is contributing to communal knowledge (often via the discussion it sparks or simply because someone could correct a reasonable misunderstanding) and I overall want to encourage more of whatever generated it. I’m happy to get more posts like that, even if I seek push for refinements in the process, say.
(Reacts or separate upvote/downvotes vs agree/disagree buttons will hopefully make it easier in the future to encourage effort even while expressing that I think something is wrong. )
We didn’t (or rather, shouldn’t) intend to reward or punish those “ancestor nodes”. We should intend to reward or punish the results.
You seem to have interpreted my comments as saying that we’re trying to reward some particular behavior, but we should do this by rewarding the results of that behavior. As you point out, this is not a wise plan.
But it’s also not what I am saying, at all. I am saying that we are (or, again, should be) trying to reward the results. Not the behavior that led to those results, but the results themselves.
I don’t know why you’re assuming that we’re actually trying to encourage some specific behavior. It’s certainly not what I am assuming. Doing so would not be a very good idea at all.
I think with that approach there are a great many results you’d fail to achieve. People can get animals to do remarkable things with shaping and I would wager that you can’t do them at all otherwise.
We first give the bird food when it turns slightly in the direction of the spot from any part of the cage. This increases the frequency of such behavior. We then withhold reinforcement until a slight movement is made toward the spot. This again alters the general distribution of behavior without producing a new unit. We continue by reinforcing positions successively closer to the spot, then by reinforcing only when the head is moved slightly forward, and finally only when the beak actually makes contact with the spot. … The original probability of the response in its final form is very low; in some cases it may even be zero. In this way we can build complicated operants which would never appear in the repertoire of the organism otherwise. By reinforcing a series of successive approximations, we bring a rare response to a very high probability in a short time. … The total act of turning toward the spot from any point in the box, walking toward it, raising the head, and striking the spot may seem to be a functionally coherent unit of behavior; but it is constructed by a continual process of differential reinforcement from undifferentiated behavior, just as the sculptor shapes his figure from a lump of clay.
Humans are more sophisticated than birds, but producing highly complex and abstruse truths in a format understandable to others is also a lot more complicated than getting a bird to put its beak in a particular spot. I think all the same mechanics are at work. If you want to get someone (including yourself) to do something as complex and difficult as producing valuable, novel, correct, expositions of true things on LessWrong—you’re going to have to reward the predictable intermediary steps.
We don’t go to five year olds and say “the desired result is that you can write fluently, therefore no positive feedback on your marginal efforts until you can do so, in fact, I’m going to strike your knuckles every time you make a spelling error or anything which isn’t what we hope to see from you when you’re 12, we will only reward the final desired result and you can back propagate from that to get figure out what’s good.” That’s really only a recipe for children who are unwilling to put any effort in learning to write, not those who progressively put in effort over years to learn what it even looks like to a be a competent writer.
This is beyond my earlier point that verifying results in our cases is often much harder than verifying that good steps were being taken.
We didn’t (or rather, shouldn’t) intend to reward or punish those “ancestor nodes”. We should intend to reward or punish the results.
I’m afraid this sentence doesn’t parse for me. You seem to be speaking of “results” as something which to which the concept of rewards and punishments are applicable. However, I’m not aware of any context in which this is a meaningful (rather than nonsensical) thing to say. All theories of behavior I’ve encountered that make mention of the concept of rewards and punishments (e.g. operant conditioning) refer to them as a means of influencing behavior. If there’s something else you’re referring to when you say “reward or punish the results”, I would appreciate it if you clarified what exactly that thing is.
I don’t see what could be simpler. Alice does something. That action has some result. We reward Alice, or punish her, based on the results of her action. There is nothing unusual or obscure here; I mean just what I say.
(There are cases where we do not want to take this approach, but they tend to both be controversial and to be unusual in certain important respects.)
Edit: And if you’re trying to use operant conditioning, of all things, to decide what social norms to have on a forum devoted to the art of rationality, then you’ve already admitted defeat, and this entire project is pointless.
assuming I am not risk-averse—but if I am, then I’m not going to be the one trying the high-variance strategy anyway
But, of course, everyone is risk averse in almost every resource. Even the most ambitious startup founders are still risk averse in total payment, just less so than others. I care less about my 10th million dollar than any of my first 9 million dollars, which already creates risk aversion. The same is true for status or almost any other resource with which you might want to reward people.
But you do want to incentivize expected rabbits.
No, you want to incentivize rabbits, not expected rabbits. Trying to incentivize expected rabbits is mixing levels. Incentivize rabbits, and people will attend to expectation of rabbits themselves.
I… am confused?
Like, just because there is any chance of a variable becoming disconnected in a principal-agent problem doesn’t mean that it’s always a bad idea to incentivize intermediary metrics. I am not fully sure how to understand your point as anything besides “never incentivize any lead-metrics whatsoever, only ever incentivize successful output”, which seems like a recipe for sparse reward landscapes, and also not a common practice in almost any domain in which humans deal with principal-agent problems.
Your employer pays you if you show up for work, not only if you successfully get work done (at least on the day-to-day or month-to-month level). You pay your plumber if they show up, not only if they successfully fix your toilet.
Like, if you see a friend taking an action that you know and they know has a 50% chance of making $10 for you and your friend (let’s say for a communal club) and a 50% chance of losing $5, and then turns out they lose $5, then it seems better to still reward your friend for taking that action, instead of punishing them, given that you know the action had positive expected value.
(assuming you have mostly linear value of money at these stakes)
If they think the odds are 90% $10 and 10% −5$ and you think the odds are 10% $10 and 90% −5$ should you reward for trying to benefit or punish for having wrong beliefs that materially matter?
You should punish your friend for the loss, and reward them (twice as much) for a win. This creates the correct incentives.
No, because humans are risk-averse, at least in money terms, but also in most other currencies. If you do this, you increase the total risk for your friend, for no particular gain.
Punishment is also usually net-negative, whereas rewards tend to be zero-sum, so by adding a bunch of worlds where you added punishments, you destroyed a bunch of value, with no gain (in the world where you both have certainty about the payoff matrix).
One model here is that humans have diminishing returns on money, so in order to reward someone
2x
with dollars, you have to pay more than 2x the dollar amount, so your total cost is higher.A scenario with zero-sum actions and net-negative actions can only go downhill. This would seem to imply that if you have an opportunity to give feedback or not give feedback you should opt to get a guaranteed zero rather than risk destroying value.
Could you elaborate on this? I’m not at all sure what this is referring to.
Rewards are usually a transfer of resources (e.g. me giving you money), which tend to preserve total wealth (or status, or whatever other resource you are thinking about).
Unilateral punishments are usually not transfers of resource, they are usually one party imposing a cost on another party (like hitting them with a stick and injuring them), in a way that does not preserve total wealth (or health, or whatever other resource applies to the situation).
You certainly shouldn’t hit your friend with a stick if he loses $5 of your club’s money. I think this is fairly obvious, and it seems quite improbable that you were assuming that I was suggesting any such thing. So, given that we can’t possibly be talking about injuring anyone, or doing any such thing, how can your point about net-negative punishment apply? The more sensible assumption is that the punishment is of the same kind as the reward.
I think social punishments usually have the same form. Where rewards tend to be more of a transfer of status, and punishments more of a destruction of status (two people can destroy each others reputation with repeated social punishments).
There is also the bandwidth cost of punishment, as well as the simple fact that giving people praise usually comes with a positive emotional component for the receiver (in addition to the status and the reputation), whereas punishments usually come with an addition of stress and discomfort that reduces total output for a while.
In either case, I think the simpler case is made by simply looking at the assumption of diminishing returns in resources and realizing that the cost of giving someone a reward they care 2x about is usually larger than the cost of giving the reward twice, meaning that there is an inherent cost to high-variance reward landscapes.
If you show up, but don’t get work done, you get fired. (How quickly that happens varies from workplace to workplace, of course—but in many places it happens very quickly indeed.)
Yeah, but the fact that it takes a while and we have monthly wages instead of just all being contractors that are paid by the piece is kind of my point. Most of the economy does not pay for completed output, but for intermediary metrics that allow a much higher-level of stability.
But note that even if you don’t get fired immediately for failing to produce satisfactory work, you are likely to receive a dressing-down from your boss, poor evaluations, etc., or even something so simple as your team leader being visibly disappointed with you, even if they take no immediate action.
Now consider what that analogizes to, in the case at hand. Is a downvote, or a critical comment, more like being fired, or more like your boss telling you that your work isn’t up to par and that you should really try to do better?
I think it’s sort of like your boss telling you your work isn’t good, when your boss also isn’t paying you and you’re there as a volunteer.
If your boss isn’t paying you, then what’s the point of the employment analogy? That’s not employment at all, is it?
… what? Of course you only pay your plumber if they successfully fix your toilet!
My experience is definitely the opposite. Random Quora question also suggests that it’s common practice in plumbing to pay someone for the attempt, not for the solution. As someone who recently hired plumbers and electricians to fix a bunch of stuff in a new house we rented, this also matches with my experience. Not sure where your experience comes from.
In general, most contractors bill by the hour, not for completed output, and definitely not “output that the client thinks is worth it”, at least in my experience (there are obviously exceptions, though I found them relatively rare).
My experience comes from the same sort of thing: having, on many occasions, hired various people to do various sorts of work; and also from having worked for several years working at a computer store that specialized in on-the-premises repair/service.
The Quora answer you linked doesn’t really support your point, as it’s quite clear about the prerequisite being an informed, explicit agreement between plumber and customer that the latter will pay the former regardless of outcome. (And even with that caveat, some of what the answer-giver says is suspect, and is not consistent with my experience.)
I do not know of any industry in which contractor agreements with variable payments that are dependent on the quality of the output are common practice. There is often an agreement on what it means to “complete the work” but in almost any case both your downside and your upside are limited by a guaranteed upfront payment, and a conditional final payment. But it’s almost never the case that you can get 2x the money depending on the quality of your output, which seems like a necessary requirement for some of the incentive schemes you outlined.
What does this have to do with anything? You originally said:
I don’t see the connection between “should you pay your plumber even if they don’t actually fix your toilet” and “should you pay your plumber twice as much if they fix your toilet twice as well”; the latter seems like a nonsensical question, and unrelated to the former.
(someone else downvoted. I was sort of torn between downvoting because it seemed importantly wrong, but upvoting for stating the disagreement clearly, ended up not voting either way. Someday we’ll probably have disagree reacts or something)
This isn’t intrinsically wrong, but basically wrong in many circumstances. I’m guessing this is a fairly important crux of disagreement.
The problem comes if either:
a) you actively punish attempts that had a high expectation of rabbits but didn’t produce rabbits. This just straightforwardly punishes high variance strategies. You need at least some people doing low variance strategies, otherwise a week where nobody brings home any rabbits, everyone dies. But if you punish high variance strategies whenever they are employed, you’re going to end up with a lot fewer rabbits.
[there might be a further disagreement about how human psychology works and what counts as punishing]
b) you systematically incentive legibly producing countable rabbits, in a world where it turned out a lot of value wasn’t just about the rabbits, or that some rabbits were actually harder to notice. I think one of the major problems with goodhart in the 20th century comes from expectation of legible results.
Figuring out how to navigate the tension between “much of value isn’t yet legible to us and overfocus on it is goodharting / destroying value” and “but, also, if you don’t focus on legible results you get nonsense” is, in my frame, basically the problem.
You should neither reward nor punish strategies or attempts at all, but results. If I am executing a high-variance strategy, and you punish poor results, and reward good results, in accordance with how poor/good they are, then (if I am right about my strategy having a positive expectation) I will—in expectation—be rewarded. This will incentivize me to execute said strategy (assuming I am not risk-averse—but if I am, then I’m not going to be the one trying the high-variance strategy anyway).
I was talking about rabbits (or things very similar to rabbits). I made, and make, no guarantees that the analysis applies when analogized to anything very different. (It seems clear that the analysis does apply in some very different situations, and does not apply in others.) Reasoning by analogy is dangerous; if we propose to attempt it, we need to be very clear about what the assumptions of the model are, and how the situations we are analogizing differ, and what that does to our assumptions.
This statement is presented in a way that suggests the reader ought to find it obvious, but in fact I don’t see why it’s obvious at all. If we take the quoted statement at face value, it appears to be suggesting that we apply our rewards and punishments (whatever they may be) to something which is causally distant from the agent whose behavior we are trying to influence—namely, “results”—and, moreover, that this approach is superior to the approach of applying those same rewards/punishments to something which is causally immediate—namely, “strategies”.
I see no reason this should be the case, however! Indeed, it seems to me that the opposite is true: if the rewards and punishments for a given agent are applied based on a causal node which is separated from the agent by multiple causal links, then there is a greater number of ancestor nodes that said rewards/punishments must propagate through before reaching the agent itself. The consequences of this are twofold: firstly, the impact of the reward/punishment is diluted, since it must be divided among a greater number of potential ancestor nodes. And secondly, because the agent has no way to identify which of these ancestor nodes we “meant” to reward or punish, our rewards/punishments may end up impacting aspects of the agent’s behavior we did not intend to influence, sometimes in ways that go against what we would prefer. (Moreover, the probability of such a thing occurring increases drastically as the thing we reward/punish becomes further separated from the agent itself.)
The takeaway from this, of course, is that strategically rewarding and punishing things grows less effective as the proxy on which said rewards and punishments are based grows further from the thing we are trying to influence—a result which sometimes goes by a more well-known name. This then suggests that punishing results over strategies, far from being a superior approach, is actually inferior: it has lower chances of influencing behavior we would like to influence, and higher chances of influencing behavior we would not like to influence.
(There are, of course, benefits as well as costs to rewarding and punishing results (rather than strategies). The most obvious benefit is that it is far easier for the party doing the rewarding and punishing: very little cognitive effort is required to assess whether a given result is positive or negative, in stark contrast to the large amounts of effort necessary to decide whether a given strategy has positive or negative expectation. This is why, for example, large corporations—which are often bottlenecked on cognitive effort—generally reward and punish their employees on the basis of easily measurable metrics. But, of course, this is a far cry from claiming that such an approach is simply superior to the alternative. (It is also why large corporations so often fall prey to Goodhart’s Law.))
Strange. You bring up Goodhart’s Law, but the way you apply it seems exactly backwards to me. If you’re rewarding strategies instead of results, and someone comes up with a new strategy that has far better results than the strategy you’re rewarding, you fail to reward people for developing better strategies or getting better results. This seems like it’s exactly what Goodhart was trying to warn us about.
I agree this is a weird place to bring up Goodhart that requires extra justification. But I do think it makes sense here. (Though I also agree with Said elsethread that it matters a lot what we’re actually talking about – rabbits, widgets, companies, scientific papers and blogposts might all behave a bit differently)
The two main issues are:
it’s hard to directly incentivize results with fuzzy, unpredictable characteristics.
it’s hard to directly incentivize results over long timescales
In short: preoccupation with “rewarding results”, in situations where you can’t actually reward results, can result in goodharting for all the usual reasons.
Two examples here are:
Scientific papers. Probably the closest to a directly relevant example before we start talking about blogposts in particular. My impression [epistemic status: relatively weak, based on anecodotes, but it seems like everyone I’ve heard talk about this roughly agreed with these anecdotes] is that the publish-or-perish mindset for academic output has been a pretty direct example of “we tried directly incentivizing results, and instead of getting more science we got shittier science.”
Founding Companies. There are ecosystems for founding and investing in companies (startups and otherwise), which are ultimately about a particular result (making money). But, this requires very long time horizons, which many people a) literally can’t pull off because they don’t have the runway, b) are often risk averse, and might not be willing to do it if they had to just risk their own money.
The business venture world works because there’s been a lot of infrastructural effort put into enabling particular strategies (in the case of startups, an established concept of seed funding, series A, etc). In the case of business sometimes more straightforward loans.
The relation to goodhart here is a bit weirder here because yeah, overfocus on “known strategies” is also one of the pathologies that results in goodharting (i.e. everything thinks social media is Hype, so everyone founds social media companies, but maybe by this point social media is overdone and you actually need to be looking for weirder things that people haven’t already saturated the market with)
But, the goodhart is instead “if you don’t put effort into maintaining the early stages of the strategy, despite many instances of that strategy failing… you just end up with less money.”
My sense [again, epistemic status fairly weak based on things Paul Graham said, but that I haven’t heard explicitly argued against] is venture capitalists make the most money from the long tail of companies they invest in. Being willing to get “real results” requires being willing to tolerate lots of things that don’t pay off in results. And many of those companies in the long tail were startup ideas that didn’t sound like sure bets.
There is some sense in which “directly rewarding results” is of course the best way to avoid goodharting, but since we don’t actually have access to “direct results that actually represent the real thing” to reward, the impulse to directly reward results can often result in rewarding not-actually-results.
Sure, that all makes sense, but at least on LW it seems like we ought to insist on saying “rewarding results” when we mean rewarding results, and “deceiving ourselves into thinking we’re rewarding results” when we mean deceiving ourselves into thinking we’re rewarding results.
That makes sense, although I’m not actually sure either “rewarding results” or “deceiving ourselves into thinking we’re rewarding results” quite capture what’s going on here.
Like, I do think it’s possible to reward individual good things (whether blogposts or scientific papers) when you find them. The question is how this shapes the overall system. When you expect “good/real results” to be few and far between, the process of “only reward things that are obvious good and/or great” might technically be rewarding results, while still outputting fewer results on average than if you had rewarded people for following overall strategies like “pursue things you’re earnestly curious about”, and giving people positive rewards for incremental steps along the way.
(Seems good to be precise about language here but I’m not in fact sure how to word this optimally. Meanwhile, earlier parts of the conversation were more explicitly about how ‘reward final results, and only final results’ just isn’t the strategy used in most of the business world)
Strong upvote for clear articulation of points I wanted to see made.
This part isn’t obviously/exactly correct to me. If we’re talking about posts and comments on LessWrong, it can be quite hard for me to assess whether a given post is correct or not (although even incorrect posts are often quite valuable parts of the discourse). It might also take a lot of information/effort to arrive that the belief that the strategy of “invest more effort, generate more ideas” leads ultimately to more good ideas such that incentivizing generation itself is good. However, once I hold that belief, it’s relatively easy to apply it. I see someone investing effort in adding to communal knowledge in a way that is plausibly correct/helpful; I then encourage this pro-social contribution despite the fact evaluating whether the post was actually correct or not* can be extremely difficult.
*”Correct or not” is a bit binary, but even assessing the overall “quality” or “value” of a post doesn’t make it much easier to assess. Far harder than number of rabbits. However, if a post doesn’t seem obviously wrong (or even if it’s clearly wrong but because understandable mistake many people might make), I can often confidently say that it is contributing to communal knowledge (often via the discussion it sparks or simply because someone could correct a reasonable misunderstanding) and I overall want to encourage more of whatever generated it. I’m happy to get more posts like that, even if I seek push for refinements in the process, say.
(Reacts or separate upvote/downvotes vs agree/disagree buttons will hopefully make it easier in the future to encourage effort even while expressing that I think something is wrong. )
You’re still missing my point.
We didn’t (or rather, shouldn’t) intend to reward or punish those “ancestor nodes”. We should intend to reward or punish the results.
You seem to have interpreted my comments as saying that we’re trying to reward some particular behavior, but we should do this by rewarding the results of that behavior. As you point out, this is not a wise plan.
But it’s also not what I am saying, at all. I am saying that we are (or, again, should be) trying to reward the results. Not the behavior that led to those results, but the results themselves.
I don’t know why you’re assuming that we’re actually trying to encourage some specific behavior. It’s certainly not what I am assuming. Doing so would not be a very good idea at all.
I think with that approach there are a great many results you’d fail to achieve. People can get animals to do remarkable things with shaping and I would wager that you can’t do them at all otherwise.
From the Wikipedia article on Shaping (psychology):
Humans are more sophisticated than birds, but producing highly complex and abstruse truths in a format understandable to others is also a lot more complicated than getting a bird to put its beak in a particular spot. I think all the same mechanics are at work. If you want to get someone (including yourself) to do something as complex and difficult as producing valuable, novel, correct, expositions of true things on LessWrong—you’re going to have to reward the predictable intermediary steps.
We don’t go to five year olds and say “the desired result is that you can write fluently, therefore no positive feedback on your marginal efforts until you can do so, in fact, I’m going to strike your knuckles every time you make a spelling error or anything which isn’t what we hope to see from you when you’re 12, we will only reward the final desired result and you can back propagate from that to get figure out what’s good.” That’s really only a recipe for children who are unwilling to put any effort in learning to write, not those who progressively put in effort over years to learn what it even looks like to a be a competent writer.
This is beyond my earlier point that verifying results in our cases is often much harder than verifying that good steps were being taken.
See the “Edit:” part of this comment, which is my response to your comment also.
I’m afraid this sentence doesn’t parse for me. You seem to be speaking of “results” as something which to which the concept of rewards and punishments are applicable. However, I’m not aware of any context in which this is a meaningful (rather than nonsensical) thing to say. All theories of behavior I’ve encountered that make mention of the concept of rewards and punishments (e.g. operant conditioning) refer to them as a means of influencing behavior. If there’s something else you’re referring to when you say “reward or punish the results”, I would appreciate it if you clarified what exactly that thing is.
I don’t see what could be simpler. Alice does something. That action has some result. We reward Alice, or punish her, based on the results of her action. There is nothing unusual or obscure here; I mean just what I say.
(There are cases where we do not want to take this approach, but they tend to both be controversial and to be unusual in certain important respects.)
Edit: And if you’re trying to use operant conditioning, of all things, to decide what social norms to have on a forum devoted to the art of rationality, then you’ve already admitted defeat, and this entire project is pointless.
But, of course, everyone is risk averse in almost every resource. Even the most ambitious startup founders are still risk averse in total payment, just less so than others. I care less about my 10th million dollar than any of my first 9 million dollars, which already creates risk aversion. The same is true for status or almost any other resource with which you might want to reward people.