Can’t answer for habryka, but my current guess of where you’re point at here is something like: “the sort of drive towards consistency is part of an overall pattern that seems net harmful, and that the correct action is more like stopping and thinking than like ‘trying to do better at what you were currently doing’.”
(You haven’t yet told me if this comment was successfully passing your ITT, but it’s my working model of your frame)
I think habryka (and separately, but not coincidentally, me) has a belief that he’s the sort of person where looking for opportunities to improve consistency is beneficial. I’m not sure whether you’re disagreeing with that, or if you’re point is more that the median-lesswrong will be taking wrong advice from this?
[Assuming I’ve got your frame right, I obviously disagree quite a bit – but I’m not sure what to do about it locally, here]
Thanks for checking—I’m trying to say something pretty different.
It seems like the frame of the OP is lumping together the kind of consistency that comes from using the native architecture to model the deep structure of reality (see also Geometers, Scribes, and the structure of intelligence), and the kind of consistency that comes from trying to perform a guaranteed level of service for an outside party (see also Unreal’s idea of Dependability), and an important special case of the latter is rule-following as a form of submission or blame-avoidance. These are very different mental structures, respond very differently to incentives, and learn very different things from criticism. (Nightmare of the Perfectly Principled is my most direct attempt to point to this distinction.)
People who are trying to submit or avoid blame will try to alleviate the pressure of criticism with minimal effort, in ways that aren’t connected to their other beliefs. On the other hand, people with structured models will sometimes leapfrog past the critic, or jump in another direction entirely, as Benito pointed out in A Sketch of Good Communication.
If we don’t distinguish between these cases, then attempts to reason about the “optimal” attitude towards integrity or accountability will end up a lumpy, unsatisfactory linear compromise between the following policy goals:
Helping people with structurally integrated models notice tensions in their models that they can learn from.
Distinguishing people with structurally integrated models from those who (at least in the relevant domain) are mostly just trying not to stick out as wrong, so we can stop listening to the second group.
Establishing and enforcing the norms needed to coordinate actions among equals (e.g. shared expectations about promises).
Compelling a complicated performance from inferiors, or avoiding punishment by superiors trying to compel a complicated performance from you.
Converting people without structurally integrated models into people with structurally integrated models (or vice versa).
Depending on what problem you’re trying to solve, habryka’s statement that “if someone changes their stated principles in an unpredictable fashion every day (or every hour), then I think most of the benefits of openly stating your principles disappear” can be almost exactly backwards.
If your principles predictably change based on your circumstances, that’s reasonably likely to be a kind of adversarial optimization similar to A/B testing of communication. They don’t mean their literal content, at least.
But there’s plenty of point in principles consistent with learning new things fast. In that case, change represents noise, which is costly, but much less costly than messaging optimized for extraction. And of course changing principles doesn’t need to imply a change in behavior to match—your new principles can and should take into account the fact that people may have committed resources based on your old stated principles.
In summary, my objection is that habryka seems to be thinking of beliefs as a special case of promises, while I think that if we’re trying to succeed based on epistemic rationality, we should be modeling promises as a special case of beliefs. For more detail on that, see Bindings and Assurances.
Agree strongly with this decomposition of integrity. They’re definitely different (although correlated) things.
My biggest disagreement with this model is that the first form (structurally integrated models) seems to me to be something broader? Something like, you have structurally integrated models of how things work and what matters to you, and take the actions suggested by the models to achieve what matters to you based on how things work?
Need to think through this in more detail. One can have what one might call integrity of thought without what one might call integrity of action based on that thought—you have the models, but others/you can’t count on you to act on them. And you can have integrity of action without integrity of thought, in the sense that you can be counted on to perform certain actions in certain circumstances, without integrity of thought, in which case you’ll do them whether or not it makes any sense, but you can at least be counted on. Or you can have both.
And I agree you have to split integrity of action into keeping promises when you make them slash following one’s own code, and keeping to the rules of the system slash following others’ codes, especially codes that determine what is blameworthy. To me, that third special case isn’t integrity. It’s often a good thing, but it’s a different thing—it counts as integrity if and only if one is following those rules because of one’s own code saying one should follow the outside code. We can debate under what circumstances that is or isn’t the right code, and should.
So I think for now I have it as Integrity-1 (Integrity of Thought) and Integrity-2 (Integrity of Action), and a kind of False-Integrity-3 (Integrity of Blamelessness) that is worth having a name for, and tracking who has and doesn’t have it in what circumstances to what extent, like the other two, but isn’t obviously something it’s better to increase than decrease by default. Whereas Integrity-1 is by default to be increased, as is Integrity-2, and if you disagree with that, this implies to me there’s a conflict causing you to want others to be less effective, or you’re otherwise trying to do extraction or be zero sum.
It seems to me that integrity of thought is actually quite a lot easier if it constrains the kind of anticipations that authentically and intuitively affect actions. Actions can still diverge from beliefs if someone with integrity of thought gets distracted enough to drop into a stereotyped habit (e.g. if I’m a bit checked out while driving and end up at a location I’m used to going to instead of the one I need to be at) or is motivated to deceive (e.g. corvids that think carefully about how to hide their food from other corvids).
The kind of belief-action split we’re used to seeing, I think, involves a school-broken sort of “believing” that’s integrated with the structures that are needed to give coherent answers on tests, but severed from thinking about one’s actual environment and interests.
The most important thing I did for my health in the last few years was healing this split.
Something I still can’t tell about your concern, though: one of the things that seemed like the “primary takeaway” here, at least to me, is the concept of thinking carefully about who you want to be accountable to (and to be wary of holding yourself accountable to too many people that won’t be able to understand more complex moral positions you might hold)
So far, having thought through that concept through the various lenses of integrity you list here, it doesn’t seem like something that likely to make things worse. Do you think it is?
(the other claim, to think about what sort of incentives you want for yourself, does seem like the sort of thing some people might interpret as instruction to create coercive environments for themselves. This was fairly different from what I think habryka was aiming at, but I’d agree that might not be clear from this post)
I don’t think I find that objectionable, it didn’t seem particularly interesting as a claim. It’s as old as “you can only serve one master,” god vs mammon, etc etc—you can’t do well at accountability to mutually incompatible standards. I think it depends a lot on the type and scope of accountability, though.
If the takeaway were what mattered about the post, why include all the other stuff?
If the takeaway were what mattered about the post, why include all the other stuff?
I think habryka was trying to get across a more cohesive worldview rather than just a few points. I also don’t know that my interpretation is the same as his. But here are some points that I took from this. (Hmm. These may not have been even slightly obvious in the current post, but were part of the background conversation that prompted it, and would probably have eventually been brought up in a future post. And I think the OP at least hints at them)
On Accountability
First, I think there are people in the LessWrong readership who still have some naive conception of “be accountable to the public”, which is in fact a recipe for
It’s as old as “you can only serve one master,”
This is pretty different from how I’d describe this.
In some sense, you only get to serve one master. But value is complex and fragile. So there many be many facets of integrity or morality that you find important to pay attention to, and you might be missing some of them. Your master may contain multitudes.
Any given facet of morality (or more generally, things I care about it) is complicated. I might want, to hold myself to a higher standard that I currently meet, is to have several people whom I hold myself accountable to, each of which pays deep attention to a different facet.
If I’m running a company, I might want to be accountable to
people who deeply understand the industry I’m working in
people who deeply understand human needs, coercion, etc, who can tell me if I’m mistreating my workers,
people who understand how my industry interacts with other industries, the general populace, or the environment, who can call me out if I’m letting negative externalities run wild.
[edit] maybe just having a kinda regular person who just sanity checks if I seem crazy
I might want multiple people for each facet, who look at that facet through a different lens.
By having these facets represented in concrete individuals, I also improve my ability to resolve confusions about how to trade off multiple sacred values. Each individual might deeply understand their domain and see it as most important. But if they disagree, they can doublecrux with each other, or with me, and I can try to integrate their views into something coherent and actionable.
There’s also the important operationalization of “what does accountable mean?” There’s different powers you could give these people, possibly including:
Emergency Doublecrux button – they can demand N hours of your time per year, at least forcing you to have a conversation to justify yourself
Vote of No Confidence – i.e. if your project still seems good but you seem corrupt, they can fire you and replace you
Shut down your project, if it seems net negative
There might be some people you trust with some powers but not others (i.e. you might think someone has good perspectives that justify the emergency double crux button but not the “fire you” button)
There’s a somewhat different conception you could have of all this that’s more coalition focused than personal development focused.
I feel like all of this mixes together info sources and incentives, so it feels a bit wrong to say I agree, but also feels a bit wrong to say I disagree.
I agree that there’s a better, crisper version of this that has those more distinct.
I’m not sure if the end product, for most people, should keep them distinct because by default humans seem to use blurry clusters of concepts to simplify things into something manageable.
But, I think if you’re aiming to be a robust agent, or build a robustly agentic organization, there’s is something valuable about keeping these crisply separate so you can reason about them well. (you’ve previously mentioned that this is analogous to the friendly AI problem and I agree). I think it’s a good project for many people in the rationalsphere to have undertaken to deepen our understanding, even if it turns out not to be practical for the average person.
The “different masters” thing is a special case of the problem of accepting feedback (i.e. learning from approval/disapproval or reward/punishment) from approval functions in conflict with each other or your goals. Multiple humans trying to do the same or compatible things with you aren’t “different masters” in this sense, since the same logical-decision-theoretic perspective (with some noise) is instantiated on both.
But also, there’s all sorts of gathering data from others’ judgment that doesn’t fit the accountability/commitment paradigm.
To give a concrete example, I expect math prodigies to have the easiest time solving any given math problem, but even so, I don’t expect that a system that punishes the students who don’t complete their assignments correctly will serve the math prodigies well. This, even if under other, totally different circumstances it’s completely appropriate to compel performance of arbitrary assignments through the threat of punishment.
Can’t answer for habryka, but my current guess of where you’re point at here is something like: “the sort of drive towards consistency is part of an overall pattern that seems net harmful, and that the correct action is more like stopping and thinking than like ‘trying to do better at what you were currently doing’.”
(You haven’t yet told me if this comment was successfully passing your ITT, but it’s my working model of your frame)
I think habryka (and separately, but not coincidentally, me) has a belief that he’s the sort of person where looking for opportunities to improve consistency is beneficial. I’m not sure whether you’re disagreeing with that, or if you’re point is more that the median-lesswrong will be taking wrong advice from this?
[Assuming I’ve got your frame right, I obviously disagree quite a bit – but I’m not sure what to do about it locally, here]
Thanks for checking—I’m trying to say something pretty different.
It seems like the frame of the OP is lumping together the kind of consistency that comes from using the native architecture to model the deep structure of reality (see also Geometers, Scribes, and the structure of intelligence), and the kind of consistency that comes from trying to perform a guaranteed level of service for an outside party (see also Unreal’s idea of Dependability), and an important special case of the latter is rule-following as a form of submission or blame-avoidance. These are very different mental structures, respond very differently to incentives, and learn very different things from criticism. (Nightmare of the Perfectly Principled is my most direct attempt to point to this distinction.)
People who are trying to submit or avoid blame will try to alleviate the pressure of criticism with minimal effort, in ways that aren’t connected to their other beliefs. On the other hand, people with structured models will sometimes leapfrog past the critic, or jump in another direction entirely, as Benito pointed out in A Sketch of Good Communication.
If we don’t distinguish between these cases, then attempts to reason about the “optimal” attitude towards integrity or accountability will end up a lumpy, unsatisfactory linear compromise between the following policy goals:
Helping people with structurally integrated models notice tensions in their models that they can learn from.
Distinguishing people with structurally integrated models from those who (at least in the relevant domain) are mostly just trying not to stick out as wrong, so we can stop listening to the second group.
Establishing and enforcing the norms needed to coordinate actions among equals (e.g. shared expectations about promises).
Compelling a complicated performance from inferiors, or avoiding punishment by superiors trying to compel a complicated performance from you.
Converting people without structurally integrated models into people with structurally integrated models (or vice versa).
Depending on what problem you’re trying to solve, habryka’s statement that “if someone changes their stated principles in an unpredictable fashion every day (or every hour), then I think most of the benefits of openly stating your principles disappear” can be almost exactly backwards.
If your principles predictably change based on your circumstances, that’s reasonably likely to be a kind of adversarial optimization similar to A/B testing of communication. They don’t mean their literal content, at least.
But there’s plenty of point in principles consistent with learning new things fast. In that case, change represents noise, which is costly, but much less costly than messaging optimized for extraction. And of course changing principles doesn’t need to imply a change in behavior to match—your new principles can and should take into account the fact that people may have committed resources based on your old stated principles.
In summary, my objection is that habryka seems to be thinking of beliefs as a special case of promises, while I think that if we’re trying to succeed based on epistemic rationality, we should be modeling promises as a special case of beliefs. For more detail on that, see Bindings and Assurances.
Agree strongly with this decomposition of integrity. They’re definitely different (although correlated) things.
My biggest disagreement with this model is that the first form (structurally integrated models) seems to me to be something broader? Something like, you have structurally integrated models of how things work and what matters to you, and take the actions suggested by the models to achieve what matters to you based on how things work?
Need to think through this in more detail. One can have what one might call integrity of thought without what one might call integrity of action based on that thought—you have the models, but others/you can’t count on you to act on them. And you can have integrity of action without integrity of thought, in the sense that you can be counted on to perform certain actions in certain circumstances, without integrity of thought, in which case you’ll do them whether or not it makes any sense, but you can at least be counted on. Or you can have both.
And I agree you have to split integrity of action into keeping promises when you make them slash following one’s own code, and keeping to the rules of the system slash following others’ codes, especially codes that determine what is blameworthy. To me, that third special case isn’t integrity. It’s often a good thing, but it’s a different thing—it counts as integrity if and only if one is following those rules because of one’s own code saying one should follow the outside code. We can debate under what circumstances that is or isn’t the right code, and should.
So I think for now I have it as Integrity-1 (Integrity of Thought) and Integrity-2 (Integrity of Action), and a kind of False-Integrity-3 (Integrity of Blamelessness) that is worth having a name for, and tracking who has and doesn’t have it in what circumstances to what extent, like the other two, but isn’t obviously something it’s better to increase than decrease by default. Whereas Integrity-1 is by default to be increased, as is Integrity-2, and if you disagree with that, this implies to me there’s a conflict causing you to want others to be less effective, or you’re otherwise trying to do extraction or be zero sum.
It seems to me that integrity of thought is actually quite a lot easier if it constrains the kind of anticipations that authentically and intuitively affect actions. Actions can still diverge from beliefs if someone with integrity of thought gets distracted enough to drop into a stereotyped habit (e.g. if I’m a bit checked out while driving and end up at a location I’m used to going to instead of the one I need to be at) or is motivated to deceive (e.g. corvids that think carefully about how to hide their food from other corvids).
The kind of belief-action split we’re used to seeing, I think, involves a school-broken sort of “believing” that’s integrated with the structures that are needed to give coherent answers on tests, but severed from thinking about one’s actual environment and interests.
The most important thing I did for my health in the last few years was healing this split.
False-Integrity-3 seems to me that it’s name could be Integrity of Innocence.
The concerns here make sense.
Something I still can’t tell about your concern, though: one of the things that seemed like the “primary takeaway” here, at least to me, is the concept of thinking carefully about who you want to be accountable to (and to be wary of holding yourself accountable to too many people that won’t be able to understand more complex moral positions you might hold)
So far, having thought through that concept through the various lenses of integrity you list here, it doesn’t seem like something that likely to make things worse. Do you think it is?
(the other claim, to think about what sort of incentives you want for yourself, does seem like the sort of thing some people might interpret as instruction to create coercive environments for themselves. This was fairly different from what I think habryka was aiming at, but I’d agree that might not be clear from this post)
I don’t think I find that objectionable, it didn’t seem particularly interesting as a claim. It’s as old as “you can only serve one master,” god vs mammon, etc etc—you can’t do well at accountability to mutually incompatible standards. I think it depends a lot on the type and scope of accountability, though.
If the takeaway were what mattered about the post, why include all the other stuff?
I think habryka was trying to get across a more cohesive worldview rather than just a few points. I also don’t know that my interpretation is the same as his. But here are some points that I took from this. (Hmm. These may not have been even slightly obvious in the current post, but were part of the background conversation that prompted it, and would probably have eventually been brought up in a future post. And I think the OP at least hints at them)
On Accountability
First, I think there are people in the LessWrong readership who still have some naive conception of “be accountable to the public”, which is in fact a recipe for
This is pretty different from how I’d describe this.
In some sense, you only get to serve one master. But value is complex and fragile. So there many be many facets of integrity or morality that you find important to pay attention to, and you might be missing some of them. Your master may contain multitudes.
Any given facet of morality (or more generally, things I care about it) is complicated. I might want, to hold myself to a higher standard that I currently meet, is to have several people whom I hold myself accountable to, each of which pays deep attention to a different facet.
If I’m running a company, I might want to be accountable to
people who deeply understand the industry I’m working in
people who deeply understand human needs, coercion, etc, who can tell me if I’m mistreating my workers,
people who understand how my industry interacts with other industries, the general populace, or the environment, who can call me out if I’m letting negative externalities run wild.
[edit] maybe just having a kinda regular person who just sanity checks if I seem crazy
I might want multiple people for each facet, who look at that facet through a different lens.
By having these facets represented in concrete individuals, I also improve my ability to resolve confusions about how to trade off multiple sacred values. Each individual might deeply understand their domain and see it as most important. But if they disagree, they can doublecrux with each other, or with me, and I can try to integrate their views into something coherent and actionable.
There’s also the important operationalization of “what does accountable mean?” There’s different powers you could give these people, possibly including:
Emergency Doublecrux button – they can demand N hours of your time per year, at least forcing you to have a conversation to justify yourself
Vote of No Confidence – i.e. if your project still seems good but you seem corrupt, they can fire you and replace you
Shut down your project, if it seems net negative
There might be some people you trust with some powers but not others (i.e. you might think someone has good perspectives that justify the emergency double crux button but not the “fire you” button)
There’s a somewhat different conception you could have of all this that’s more coalition focused than personal development focused.
I feel like all of this mixes together info sources and incentives, so it feels a bit wrong to say I agree, but also feels a bit wrong to say I disagree.
I agree that there’s a better, crisper version of this that has those more distinct.
I’m not sure if the end product, for most people, should keep them distinct because by default humans seem to use blurry clusters of concepts to simplify things into something manageable.
But, I think if you’re aiming to be a robust agent, or build a robustly agentic organization, there’s is something valuable about keeping these crisply separate so you can reason about them well. (you’ve previously mentioned that this is analogous to the friendly AI problem and I agree). I think it’s a good project for many people in the rationalsphere to have undertaken to deepen our understanding, even if it turns out not to be practical for the average person.
The “different masters” thing is a special case of the problem of accepting feedback (i.e. learning from approval/disapproval or reward/punishment) from approval functions in conflict with each other or your goals. Multiple humans trying to do the same or compatible things with you aren’t “different masters” in this sense, since the same logical-decision-theoretic perspective (with some noise) is instantiated on both.
But also, there’s all sorts of gathering data from others’ judgment that doesn’t fit the accountability/commitment paradigm.
To give a concrete example, I expect math prodigies to have the easiest time solving any given math problem, but even so, I don’t expect that a system that punishes the students who don’t complete their assignments correctly will serve the math prodigies well. This, even if under other, totally different circumstances it’s completely appropriate to compel performance of arbitrary assignments through the threat of punishment.