I’m more confident that we can’t solve AI alignment without fixing this, than I am that we can fix it.
Accounts of late-20th-Century business practices seem like they report much more Werewolfing than accounts of late-19th-Century business practices—advice on how to get ahead has changed a lot, as have incidental accounts of how things work. If something’s changed recently, we should have at least some hope of changing it back, though obviously we need to understand the reasons. Taking a longer view, new civilizations have emerged from time to time, and it looks to me like often rising civilizations have superior incentive-alignment and information-processing than the ones they displace or conquer. This suggests that at worst people get lucky from time to time.
Pragmatically, a higher than historically usual freedom of speech and liberalism more generally seem like they ought to both make it easier to think collectively about political problems than it’s been in the past, and make it more obviously appealing, since public reason seemed to do really well at improving a lot of people’s lives pretty recently.
I’m more confident that we can’t solve AI alignment without fixing this, than I am that we can fix it.
Can you give some examples of technical AI alignment efforts going wrong as a result of bad credit assignment (assuming that’s what you mean)? To me it seems that to the extent things in that field aren’t headed in the right direction, it’s more a result of people underestimating the philosophical difficulty, or being too certain about some philosophical assumptions, or being too optimistic in general, that kind of thing.
Accounts of late-20th-Century business practices seem like they report much more Werewolfing than accounts of late-19th-Century business practices—advice on how to get ahead has changed a lot, as have incidental accounts of how things work.
This seems easily explainable by the fact that businesses have gotten a lot bigger to take advantage of economies of scale offered by new technologies, so coordination / principal-agent problems have gotten a lot worse as a result.
Taking a longer view, new civilizations have emerged from time to time, and it looks to me like often rising civilizations have superior incentive-alignment and information-processing than the ones they displace or conquer. This suggests that at worst people get lucky from time to time.
I guess this gives some hope, but not much.
Pragmatically, a higher than historically usual freedom of speech and liberalism more generally seem like they ought to both make it easier to think collectively about political problems than it’s been in the past, and make it more obviously appealing, since public reason seemed to do really well at improving a lot of people’s lives pretty recently.
Not sure I understand this part. Are you proposing to increase freedom of speech and liberalism beyond the current baseline in western societies? If so how?
Are you proposing to increase freedom of speech and liberalism beyond the current baseline in western societies?
No, just saying that while I agree the problem looks quite hard—like, world-historically, a robust solution would be about as powerful as, well, cities - current conditions seem like they’re unusually favorable to people trying to improve social coordination via explicit reasoning. Conditions are slightly less structurally favorable than the Enlightenment era, but on the other hand we have the advantage of being able to look at the Enlightenment’s track record and try to explicitly account for its failures.
Can you give some examples of technical AI alignment efforts going wrong as a result of bad credit assignment (assuming that’s what you mean)?
If orgs like OpenAI and Open Philanthropy Project are sincerely trying to promote technical AI alignment efforts, then they’re obviously confused about the fundamental concept of differential intellectual progress.
If, on the other hand, we think they’re just being cynical and collecting social credit for labeling things AI safety rather than making a technical error, then the honest AI safety community seems to have failed to create clarity about this fact among their supporters. Not only would this level of coordination fail to create an FAI due to “treacherous turn” considerations, it can’t even be bothered to try to deny resources to optimizing processes that are already known to be trying to deceive us!
If orgs like OpenAI and Open Philanthropy Project are sincerely trying to promote technical AI alignment efforts, then they’re obviously confused about the fundamental concept of differential intellectual progress.
I think I have more uncertainty than you do about whether OpenAI/OpenPhil is doing the right thing, but conditional on them not doing the right thing, and also not just being cynical, I don’t think being confused about the fundamental concept of differential intellectual progress is the best explanation of why they’re not doing the right thing. It seems more likely that they’re wrong about how much of a broad base of ML expertise/capability is needed internally in an organization to make progress in AI safety, or about what is the best strategy to cause differential intellectual progress or bring about an aligned AGI or prevent AI risk.
If, on the other hand, we think they’re just being cynical and collecting social credit for labeling things AI safety rather than making a technical error, then the honest AI safety community seems to have failed to create clarity about this fact among their supporters.
I personally assign less than 20% probability that “they’re just being cynical and collecting social credit” so I don’t see why I would want to “create clarity about this fact”. If you think “the honest AI safety community” should assign a much higher credence to this, it doesn’t seem like you’ve made enough of a case for it. (I’m not sure that OpenAI/OpenPhil is doing the wrong thing, and if they are there seem to be a lot of explanations for it besides “being cynical and collecting social credit”. Paul Christiano works there and I’m pretty sure he isn’t being cynical and wouldn’t continue to work there if most people or the leaders there are being cynical.)
It seems more plausible that their technical errors are caused by subconscious biases that ultimately stem from motivations or evolutionary pressures related to social credit. But in that case I probably have similar biases and I don’t see a strong reason to think I’m less affected by them than OpenAI/OpenPhil, so it doesn’t seem right to accuse them of that when I’m trying to argue for my own positions. Accusing others of bias also seems less effective in terms of changing minds (i.e., it seems likely to antagonize people and make them stop listening to you) than just making technical arguments. To the extent I do think technical errors are caused by biases related to social credit, I think a lot of those biases are “baked in” by evolution and won’t go away quickly if we do improve credit assignment.
But in that case I probably have similar biases and I don’t see a strong reason to think I’m less affected by them than OpenAI/OpenPhil, so it doesn’t seem right to accuse them of that when I’m trying to argue for my own positions.
This seems backwards to me. Surely, if you’re likely to make error X which you don’t want to make, it would be helpful to build shared models of the incidence of error X and help establish a norm of pointing it out when it occurs in others, so that others will be willing and able to correct you in the analogous situation.
It doesn’t make any sense to avoid trying to help someone by pointing out their mistake because you might need the same kind of help in the future, at least for nonrivalrous goods like criticism. If you don’t think of correcting this kind of error as help, then you’re actually just declaring intent to commit fraud. And if you’d find it helpful but expect others to think of it as unwanted interference, then we’ve found an asymmetric weaponthat helps with honesty but not with dishonesty.
Accusing others of bias also seems less effective in terms of changing minds (i.e., it seems likely to antagonize people and make them stop listening to you) than just making technical arguments. To the extent I do think technical errors are caused by biases related to social credit, I think a lot of those biases are “baked in” by evolution and won’t go away quickly if we do improve credit assignment.
A major problem here is that people can collect a disproportionate amount of social credit for “working on” a problem by doing things that look vaguely similar to a coherent program of action for addressing the problem, while making “errors” in directions that systematically further their private or institutional interests. (For instance, OpenAI’s habit of taking both positions on questions like “should AI be open?”, depending on the audience.) We shouldn’t expect people to stop being tempted to employ this strategy, as long as it’s effective at earning social capital. That’s a reason to be more clear, not less clear, about what’s going on—as long as people who understand what’s going on obscure the issue to be polite, this strategy will continue to work.
This seems backwards to me. Surely, if you’re likely to make error X which you don’t want to make, it would be helpful to build shared models of the incidence of error X and help establish a norm of pointing it out when it occurs in others, so that others will be willing and able to correct you in the analogous situation.
I think that would make sense if I had a clear sense of how exactly biases related to social credit is causing someone to make a technical error, but usually it’s more like “someone disagrees with me on a technical issue and we can’t resolve the disagreement, it seems pretty likely that one or both of us is affected by some sort of bias that’s related to social credit and that’s the root cause of the disagreement, but it could also be something else like being naturally optimistic vs pessimistic, or different past experiences/backgrounds”. How am I supposed to “create clarity” in that case?
That’s a reason to be more clear, not less clear, about what’s going on—as long as people who understand what’s going on obscure the issue to be polite, this strategy will continue to work.
As I mentioned before, I don’t entirely understand what is going on, in other words I have a lot of uncertainty about what is going on. Maybe that subjective uncertainty is itself a kind of subconscious “werewolfy” or blame-avoiding behavior on my part, but I also have uncertainty about that, so given the potential downsides if you’re wrong, overall I don’t see enough reason to adopt the kind of policy that you’re suggesting.
To backtrack a bit in our discussion, I think I now have a better sense of what kind of problems you think bad credit assignment is causing in AI alignment. It seems good that someone is working in this area (e.g., maybe you could figure out the answers to my questions above) but I wish you had more sympathy for people like me (I’m guessing my positions are pretty typical for what you think of as the “honest AI safety community”).
BTW you seem to be strongly upvoting many (but not all?) of your own comments, which I think most people are not doing or doing very rarely. Is it an attempt to signal that some of your comments are especially important?
In a competitive attention market without active policing of the behavior pattern I’m describing, it seems wrong to expect participants getting lots of favorable attention and resources to be honest, as that’s not what’s being selected for.
There’s a weird thing going on when, if I try to discuss this, I either get replies like Raemon’s claim elsewhere that the problem seems intractable at scale (and it seems like you’re saying a similar thing at times), or replies to the effect that there are lots of other good reasons why people might be making mistakes, and it’s likely to hurt people’s feelings if we overtly assign substantial probability to dishonesty, which will make it harder to persuade them of the truth. The obvious thing that’s missing is the intermediate stance of “this is probably a big pervasive problem, and we should try at all to fix it by the obvious means before giving up.”
It doesn’t seem very surprising to me that a serious problem has already been addressed to the extent that it’s true that both 1) it’s very hard to make any further progress on the problem and 2) the remaining cost from not fully solving the problem can be lived with.
The obvious thing that’s missing is the intermediate stance of “this is probably a big pervasive problem, and we should try at all to fix it by the obvious means before giving up.”
It seems to me that people like political scientists, business leaders, and economists have been attacking the problem for a while, so it doesn’t seem that likely there’s a lot of low hanging fruit to be found by “obvious means”. I have some more hope that the situation with AI alignment is different enough from what people thought about in the past (e.g., a lot of people involved are at least partly motivated by altruism compared to the kinds of people described in Moral Mazes) that you can make progress on credit assigning as applied to AI alignment, but you still seem to be too optimistic.
What are a couple clear examples of people trying to fix the problem locally in an integrated way, rather than just talking about the problem or trying to fix it at scale using corrupt power structures for enforcement?
It seems to me like the nearest thing to a direct attempt was the Quakers. As far as I understand, while they at least tried to coordinate around high-integrity discourse, they put very little work into explicitly modeling the problem of adversarial behavior or developing robust mechanisms for healing or routing around damage to shared information processing.
I’d have much more hope about existing AI alignment efforts if it seemed like what we’ve learned so far had been integrated into the coordination methods of AI safety orgs, and technical development were more focused on current alignment problems.
But in that case I probably have similar biases and I don’t see a strong reason to think I’m less affected by them than OpenAI/OpenPhil, so it doesn’t seem right to accuse them of that when I’m trying to argue for my own positions.
You’re not, as far as I know, promoting an AI safety org raising the kind of funds, or attracting the kind of attention, that OpenAI is. Likewise you’re not claiming mainstream media attention or attracting a largedonor base the way GiveWell / Open Philanthropy Project is. So there’s a pretty strong reason to expect that you haven’t been selected for cognitive distortions that make you better at those things anywhere near as strongly as people in those orgs have.
I’m more confident that we can’t solve AI alignment without fixing this, than I am that we can fix it.
Accounts of late-20th-Century business practices seem like they report much more Werewolfing than accounts of late-19th-Century business practices—advice on how to get ahead has changed a lot, as have incidental accounts of how things work. If something’s changed recently, we should have at least some hope of changing it back, though obviously we need to understand the reasons. Taking a longer view, new civilizations have emerged from time to time, and it looks to me like often rising civilizations have superior incentive-alignment and information-processing than the ones they displace or conquer. This suggests that at worst people get lucky from time to time.
Pragmatically, a higher than historically usual freedom of speech and liberalism more generally seem like they ought to both make it easier to think collectively about political problems than it’s been in the past, and make it more obviously appealing, since public reason seemed to do really well at improving a lot of people’s lives pretty recently.
Can you give some examples of technical AI alignment efforts going wrong as a result of bad credit assignment (assuming that’s what you mean)? To me it seems that to the extent things in that field aren’t headed in the right direction, it’s more a result of people underestimating the philosophical difficulty, or being too certain about some philosophical assumptions, or being too optimistic in general, that kind of thing.
This seems easily explainable by the fact that businesses have gotten a lot bigger to take advantage of economies of scale offered by new technologies, so coordination / principal-agent problems have gotten a lot worse as a result.
I guess this gives some hope, but not much.
Not sure I understand this part. Are you proposing to increase freedom of speech and liberalism beyond the current baseline in western societies? If so how?
No, just saying that while I agree the problem looks quite hard—like, world-historically, a robust solution would be about as powerful as, well, cities - current conditions seem like they’re unusually favorable to people trying to improve social coordination via explicit reasoning. Conditions are slightly less structurally favorable than the Enlightenment era, but on the other hand we have the advantage of being able to look at the Enlightenment’s track record and try to explicitly account for its failures.
If orgs like OpenAI and Open Philanthropy Project are sincerely trying to promote technical AI alignment efforts, then they’re obviously confused about the fundamental concept of differential intellectual progress.
If, on the other hand, we think they’re just being cynical and collecting social credit for labeling things AI safety rather than making a technical error, then the honest AI safety community seems to have failed to create clarity about this fact among their supporters. Not only would this level of coordination fail to create an FAI due to “treacherous turn” considerations, it can’t even be bothered to try to deny resources to optimizing processes that are already known to be trying to deceive us!
I think I have more uncertainty than you do about whether OpenAI/OpenPhil is doing the right thing, but conditional on them not doing the right thing, and also not just being cynical, I don’t think being confused about the fundamental concept of differential intellectual progress is the best explanation of why they’re not doing the right thing. It seems more likely that they’re wrong about how much of a broad base of ML expertise/capability is needed internally in an organization to make progress in AI safety, or about what is the best strategy to cause differential intellectual progress or bring about an aligned AGI or prevent AI risk.
I personally assign less than 20% probability that “they’re just being cynical and collecting social credit” so I don’t see why I would want to “create clarity about this fact”. If you think “the honest AI safety community” should assign a much higher credence to this, it doesn’t seem like you’ve made enough of a case for it. (I’m not sure that OpenAI/OpenPhil is doing the wrong thing, and if they are there seem to be a lot of explanations for it besides “being cynical and collecting social credit”. Paul Christiano works there and I’m pretty sure he isn’t being cynical and wouldn’t continue to work there if most people or the leaders there are being cynical.)
It seems more plausible that their technical errors are caused by subconscious biases that ultimately stem from motivations or evolutionary pressures related to social credit. But in that case I probably have similar biases and I don’t see a strong reason to think I’m less affected by them than OpenAI/OpenPhil, so it doesn’t seem right to accuse them of that when I’m trying to argue for my own positions. Accusing others of bias also seems less effective in terms of changing minds (i.e., it seems likely to antagonize people and make them stop listening to you) than just making technical arguments. To the extent I do think technical errors are caused by biases related to social credit, I think a lot of those biases are “baked in” by evolution and won’t go away quickly if we do improve credit assignment.
This seems backwards to me. Surely, if you’re likely to make error X which you don’t want to make, it would be helpful to build shared models of the incidence of error X and help establish a norm of pointing it out when it occurs in others, so that others will be willing and able to correct you in the analogous situation.
It doesn’t make any sense to avoid trying to help someone by pointing out their mistake because you might need the same kind of help in the future, at least for nonrivalrous goods like criticism. If you don’t think of correcting this kind of error as help, then you’re actually just declaring intent to commit fraud. And if you’d find it helpful but expect others to think of it as unwanted interference, then we’ve found an asymmetric weapon that helps with honesty but not with dishonesty.
A major problem here is that people can collect a disproportionate amount of social credit for “working on” a problem by doing things that look vaguely similar to a coherent program of action for addressing the problem, while making “errors” in directions that systematically further their private or institutional interests. (For instance, OpenAI’s habit of taking both positions on questions like “should AI be open?”, depending on the audience.) We shouldn’t expect people to stop being tempted to employ this strategy, as long as it’s effective at earning social capital. That’s a reason to be more clear, not less clear, about what’s going on—as long as people who understand what’s going on obscure the issue to be polite, this strategy will continue to work.
I think that would make sense if I had a clear sense of how exactly biases related to social credit is causing someone to make a technical error, but usually it’s more like “someone disagrees with me on a technical issue and we can’t resolve the disagreement, it seems pretty likely that one or both of us is affected by some sort of bias that’s related to social credit and that’s the root cause of the disagreement, but it could also be something else like being naturally optimistic vs pessimistic, or different past experiences/backgrounds”. How am I supposed to “create clarity” in that case?
As I mentioned before, I don’t entirely understand what is going on, in other words I have a lot of uncertainty about what is going on. Maybe that subjective uncertainty is itself a kind of subconscious “werewolfy” or blame-avoiding behavior on my part, but I also have uncertainty about that, so given the potential downsides if you’re wrong, overall I don’t see enough reason to adopt the kind of policy that you’re suggesting.
To backtrack a bit in our discussion, I think I now have a better sense of what kind of problems you think bad credit assignment is causing in AI alignment. It seems good that someone is working in this area (e.g., maybe you could figure out the answers to my questions above) but I wish you had more sympathy for people like me (I’m guessing my positions are pretty typical for what you think of as the “honest AI safety community”).
BTW you seem to be strongly upvoting many (but not all?) of your own comments, which I think most people are not doing or doing very rarely. Is it an attempt to signal that some of your comments are especially important?
In a competitive attention market without active policing of the behavior pattern I’m describing, it seems wrong to expect participants getting lots of favorable attention and resources to be honest, as that’s not what’s being selected for.
There’s a weird thing going on when, if I try to discuss this, I either get replies like Raemon’s claim elsewhere that the problem seems intractable at scale (and it seems like you’re saying a similar thing at times), or replies to the effect that there are lots of other good reasons why people might be making mistakes, and it’s likely to hurt people’s feelings if we overtly assign substantial probability to dishonesty, which will make it harder to persuade them of the truth. The obvious thing that’s missing is the intermediate stance of “this is probably a big pervasive problem, and we should try at all to fix it by the obvious means before giving up.”
It doesn’t seem very surprising to me that a serious problem has already been addressed to the extent that it’s true that both 1) it’s very hard to make any further progress on the problem and 2) the remaining cost from not fully solving the problem can be lived with.
It seems to me that people like political scientists, business leaders, and economists have been attacking the problem for a while, so it doesn’t seem that likely there’s a lot of low hanging fruit to be found by “obvious means”. I have some more hope that the situation with AI alignment is different enough from what people thought about in the past (e.g., a lot of people involved are at least partly motivated by altruism compared to the kinds of people described in Moral Mazes) that you can make progress on credit assigning as applied to AI alignment, but you still seem to be too optimistic.
What are a couple clear examples of people trying to fix the problem locally in an integrated way, rather than just talking about the problem or trying to fix it at scale using corrupt power structures for enforcement?
It seems to me like the nearest thing to a direct attempt was the Quakers. As far as I understand, while they at least tried to coordinate around high-integrity discourse, they put very little work into explicitly modeling the problem of adversarial behavior or developing robust mechanisms for healing or routing around damage to shared information processing.
I’d have much more hope about existing AI alignment efforts if it seemed like what we’ve learned so far had been integrated into the coordination methods of AI safety orgs, and technical development were more focused on current alignment problems.
I generally have a bias towards strong upvote or strong downvote, and I don’t except my own comments from this.
You’re not, as far as I know, promoting an AI safety org raising the kind of funds, or attracting the kind of attention, that OpenAI is. Likewise you’re not claiming mainstream media attention or attracting a large donor base the way GiveWell / Open Philanthropy Project is. So there’s a pretty strong reason to expect that you haven’t been selected for cognitive distortions that make you better at those things anywhere near as strongly as people in those orgs have.