Thoth Hermes
But humans are capable of thinking about what their values “actually should be” including whether or not they should be the values evolution selected for (either alone or in addition to other things). We’re also capable of thinking about whether things like wireheading are actually good to do, even after trying it for a bit.
We don’t simply commit to tricking our reward systems forever and only doing that, for example.
So that overall suggests a level of coherency and consistency in the “coherent extrapolated volition” sense. Evolution enabled CEV without us becoming completely orthogonal to evolution, for example.
Unfortunately, I do not have a long response prepared to answer this (and perhaps it would be somewhat inappropriate, at this time), however I wanted to express the following:
They wear their despair on their sleeves? I am admittedly somewhat surprised by this.
“Up to you” means you can select better criteria if you think that would be better.
I think if you ask people a question like, “Are you planning on going off and doing something / believing in something crazy?”, they will, generally speaking, say “no” to that, and that is roughly more likely the more isomorphic your question is to that, even if you didn’t exactly word it that way. My guess is that it was at least heavily implied that you meant “crazy” by the way you worded it.
To be clear, they might have said “yes” (that they will go and do the thing you think is crazy), but I doubt they will internally represent that thing or wanting to do it as “crazy.” Thus the answer is probably going to be one of, “no” (as a partial lie, where no indirectly points to the crazy assertion), or “yes” (also as a partial lie, pointing to taking the action).
In practice, people have a very hard time instantiating the status identifier “crazy” on themselves, and I don’t think that can be easily dismissed.
I think the utility of the word “crazy” is heavily overestimated by you, given that there are many situations where the word cannot be used the same way by the people relevant to the conversation in which it is used. Words should have the same meaning to the people in the conversation, and since some people using this word are guaranteed to perceive it as hostile and some are not, that causes it to have asymmetrical meaning inherently.
I also think you’ve brought in too much risk of “throwing stones in a glass house” here. The LW memespace is, in my estimation, full of ideas besides Roko’s Basilisk that I would also consider “crazy” in the same sense that I believe you mean it: Wrong ideas which are also harmful and cause a lot of distress.
Pessimism, submitting to failure and defeat, high “p(doom)”, both MIRI and CFAR giving up (by considering the problems they wish to solve too inherently difficult, rather than concluding they must be wrong about something), and people being worried that they are “net negative” despite their best intentions, are all (IMO) pretty much the same type of “crazy” that you’re worried about.
Our major difference, I believe, is in why we think these wrong ideas persist, and what causes them to be generated in the first place. The ones I’ve mentioned don’t seem to be caused by individuals suddenly going nuts against the grain of their egregore.
I know this is a problem you’ve mentioned before and consider it both important and unsolved, but I think it would be odd to notice both that it seems to be notably worse in the LW community, but also to only be the result of individuals going crazy on their own (and thus to conclude that the community’s overall sanity can be reliably increased by ejecting those people).
By the way, I think “sanity” is a certain type of feature which is considerably “smooth under expectation” which means roughly that if p(person = insane) = 25%, that person should appear to be roughly 25% insane in most interactions. In other words, it’s not the kind of probability where they appear to be sane most of the time, but you suspect that they might have gone nuts in some way that’s hard to see or they might be hiding it.
The flip side of that is that if they only appear to be, say, 10% crazy in most interactions, then I would lower your assessment of their insanity to basically that much.
I still find this feature, however, not altogether that useful, but using it this way is still preferable over a binary feature.
Sometimes people want to go off and explore things that seem far away from their in-group, and perhaps are actively disfavored by their in-group. These people don’t necessarily know what’s going to happen when they do this, and they are very likely completely open to discovering that their in-group was right to distance itself from that thing, but also, maybe not.
People don’t usually go off exploring strange things because they stop caring about what’s true.
But if their in-group sees this as the person “no longer caring about truth-seeking,” that is a pretty glaring red-flag on that in-group.
Also, the gossip / ousting wouldn’t be necessary if someone was already inclined to distance themselves from the group.
Like, to give an overly concrete example that is probably rude (and not intended to be very accurate to be clear), if at some point you start saying “Well I’ve realized that beauty is truth and the one way and we all need to follow that path and I’m not going to change my mind about this Ben and also it’s affecting all of my behavior and I know that it seems like I’m doing things that are wrong but one day you’ll understand why actually this is good” then I’ll be like “Oh no, Ren’s gone crazy”.
“I’m worried that if we let someone go off and try something different, they will suddenly become way less open to changing their mind, and be dead set on thinking they’ve found the One True Way” seems like something weird to be worried about. (It also seems like something someone who actually was better characterized by this fear would be more likely to say about someone else!) I can see though, if you’re someone who tends not to trust themselves, and would rather put most of their trust in some society, institution or in-group, that you would naturally be somewhat worried about someone who wants to swap their authority (the one you’ve chosen) for another one.
I sometimes feel a bit awkward when I write these types of criticisms, because they simultaneously seem:
Directed at fairly respected, high-level people.
Rather straightforwardly simple, intuitively obvious things (from my perspective, but I also know there are others who would see things similarly).
Directed at someone who by assumption would disagree, and yet, I feel like the previous point might make these criticisms feel condescending.
The only times that people actually are incentivized to stop caring about the truth is in a situation where their in-group actively disfavors it by discouraging exploration. People don’t usually unilaterally stop caring about the truth via purely individual motivations.
(In-groups becoming culty is also a fairly natural process too, no matter what the original intent of the in-group was, so the default should be to assume that it has culty-aspects, accept that as normal, and then work towards installing mitigations to the harmful aspects of that.)
Not sure how convinced I am by your statement. Perhaps you can add to it a bit more?
What “the math” appears to say is that if it’s bad to believe things because someone told it to me “well” then there would have to be some other completely different set of criteria, that has nothing to do with what I think of it, for performing the updates.
Don’t you think that would introduce some fairly hefty problems?
I suppose I have two questions which naturally come to mind here:
Given Nate’s comment: “This change is in large part an enshrinement of the status quo. Malo’s been doing a fine job running MIRI day-to-day for many many years (including feats like acquiring a rural residence for all staff who wanted to avoid cities during COVID, and getting that venue running smoothly). In recent years, morale has been low and I, at least, haven’t seen many hopeful paths before us.” (Bold emphases are mine). Do you see the first bold sentence as being in conflict with the second, at all? If morale is low, why do you see that as an indicator that the status quo should remain in place?
Why do you see communications as being as decoupled (rather, either that it is inherently or that it should be) from research as you currently do?
Remember that what we decide “communicated well” to mean is up to us. So I could possibly increase my standard for that when you tell me “I bought a lottery ticket today” for example. I could consider this not communicated well if you are unable to show me proof (such as the ticket itself and a receipt). Likewise, lies and deceptions are usually things that buckle when placed under a high enough burden of proof. If you are unable to procure proof for me, I can consider that “communicated badly” and thus update in the other (correct) direction.
“Communicated badly” is different from “communicated neither well nor badly.” The latter might refer to when A is the proposition in question and one simply states “A” or when no proof is given at all. The former might refer to when the opposite is actually communicated—either because a contradiction is shown or because a rebuttal is made but is self-refuting, which strengthens the thesis it intended to shoot down.
Consider the situation where A is true, but you actually believe strongly that A is false. Therefore, because A is true, it is possible that you witness proofs for A that seem to you to be “communicated well.” But if you’re sure that A is false, you might be led to believe that my thesis, the one I’ve been arguing for here, is in fact false.
I consider that to be an argument in favor of the thesis.
If I’m not mistaken, if A = “Dagon has bought a lottery ticket this week” and B = Dagon states “A”, then I still think p(A | B) > p(A), even if it’s possible you’re lying. I think the only way it would be less than the base rate p(A) is if, for some reason, I thought you would only say that if it was definitely not the case.
To be deceptive—this is why you would ask me what your intentions are as opposed to just reveal them.
Your intent was ostensibly to show that you could argue for something badly on purpose and my rules would dictate that I update away from my own thesis.
I added an addendum for that, by the way.
The fact that you’re being disingenuous is completely clear so that actually works the opposite way you intended.
If you read it a second time and it makes more sense, then yes.
If you understand the core claims being made, then unless you believe that whether or not something is “communicated well” has no relationship whatsoever with the underlying truth-values of the core claims, if it was communicated well, it should have updated you towards belief in the core claims by some non-zero amount.
All of the vice-versas are straightforwardly true as well.
let A = the statement “A” and p(A) be the probability that A is true.
let B = A is “communicated well” and p(B) the probability that A is communicated well.
p(A | B) is the probability that A is true given that it has been “communicated well” (whatever that means to us).
We can assume, though, that we have “A” and therefore know what A means and what it means for it to be either true or false.
What it means exactly for A to be “communicated well” is somewhat nebulous, and entirely up to us to decide. But note that all we really need to know is that ~B means A was communicated badly, and we’re only dealing with a set of 2-by-2 binary conditionals here. So it’s safe for now to say that B = ~~B = “A was not communicated badly.” We don’t need to know exactly what “well” means, as long as we think it ought to relate to A in some way.
p(A | B) = claim A is true given it is communicated well
p(A | ~B) = claim A is true given it is not communicated well. If (approximately) = p(A), then p(A) = p(A|B) = p(A|~B) (see below).
p(B | A) = claim A is communicated well given it is true
p(B | ~A) claim A is communicated well given it is not trueetc., etc..
if p(A) = p(A|~B):
p(A) = p(A|B)p(B) + p(A)p(~B)
p(A)(1 - p(~B)) = p(A|B)p(B)
p(A) = p(A|B)If being communicated badly has no bearing on whether A is true, then being communicated well has no bearing on it either.
P(B | A) = p(A | B)P(B) / p(A | B)p(B) + p(A | ~B)p(~B) = p(A)p(B) / p(A)p(B) + p(A)p(~B) = p(A)p(B) / p(A) = p(B).
Likewise being true would have no bearing on whether it would be communicated well or vice-versa.
To conclude, although it is “up to you” whether B or ~B or how much it was in either direction, this does imply that how something sounds to you should have an immediate effect on your belief in what is being claimed, as long as you agree that this correlation in general is non-zero.
In my opinion, no relationship is kind of difficult to justify, an inverse relationship is even harder to justify, but a positive relationship is possible to justify (though to what degree requires more analysis).
Also, this means that statements of the form “I thought X was argued for poorly, but I’m not disagreeing with X nor do I think X is necessarily false” is somewhat a priori unlikely. If you thought X was argued for poorly, it should have moved you at least a tiny bit away from X.
Addendum:
If you deceptively argue against A on purpose, then if A is true, your argument may still come out “bad.” If A isn’t true, it may still come out good, even if you didn’t believe in A.
If you state “A” and then intentionally write gibberish afterwards as an “argument”, that’s still in the deceptive case. Thus “communicated well” takes into account whether or not this deception is given away.
If A is true, then sloppy and half-assed arguments for A are still technically valid and thus will support A. At worst this can only bring you down to “no relationship” but not in the inverse direction.
My take is that they (those who make such decisions of who runs what) are pretty well-informed about these issues well before they escalate to the point that complaints bubble up into posts / threads like these.
I would have liked this whole matter to have unfolded differently. I don’t think this is merely a sub-optimal way for these kinds of issues to be handled, I think this is a negative one.
I have a number of ideological differences with Nate’s MIRI and Nate himself that I can actually point to and articulate, and those disagreements could be managed in a way that actually resolve those differences satisfactorily. Nate’s MIRI—to me—seemed to be one of the most ideologically conformist iterations of the organization observed thus far.
Furthermore, I dislike that we’ve converged on the conclusion that Nate is a bad communicator, or that he has issues with his personality, or—even worse—that it was merely the lack of social norms imposed on someone with his level of authority that allowed him to behave in ways that don’t jive with many people (implying that literally anyone with such authority would behave in a similar way, without the imposition of more punitive and restrictive norms).
Potentially controversial take: I don’t think Nate is a bad communicator. I think Nate is incorrect about important things, and that incorrect ideas tend to appear to be communicated badly, which accounts for perceptions that he is a bad communicator (and perhaps also accounts for observations that he seemed frustrated and-or distressed while trying to argue for certain things). Whenever I’ve seen him communicate sensible ideas, it seems communicated pretty well to me.
I feel that this position is in fact more respectful to Nate himself.
If we react on the basis of Nate’s leadership style being bad, his communication being bad, or him having a brusque personality, then he’s just going to be quietly replaced by someone who will also run the organization in a similar (mostly ideologically conformist) way. It will be assumed (or rather, asserted) that all organizational issues experienced under his tenure were due to his personal foibles and not due to its various intellectual positions, policies, and strategic postures (e.g. secrecy), all of which are decided upon by other people including Nate, but executed upon by Nate! This is why I call this a negative outcome.
By the way: Whenever I see it said that an idea was “communicated badly” or alternatively that it is more complicated and nuanced than the person ostensibly not-understanding it thinks it should be, I take that as Bayesian evidence of ideological conformity. Given that this is apparently a factor that is being argued for, I have to take it as evidence of that.
In the sense that the Orthogonality Thesis considers goals to be static or immutable, I think it is trivial.
I’ve advocated a lot for trying to consider goals to be mutable, as well as value functions being definable on other value functions. And not just that it will be possible or a good idea to instantiate value functions this way, but also that they will probably become mutable over time anyway.
All of that makes the Orthogonality Thesis—not false, but a lot easier to grapple with, I’d say.
In large part because reality “bites back” when an AI has false beliefs, whereas it doesn’t bite back when an AI has the wrong preferences.
I saw that 1a3orn replied to this piece of your comment and you replied to it already, but I wanted to note my response as well.
I’m slightly confused because in one sense the loss function is the way that reality “bites back” (at least when the loss function is negative). Furthermore, if the loss function is not the way that reality bites back, then reality in fact does bite back, in the sense that e.g., if I have no pain receptors, then if I touch a hot stove I will give myself far worse burns than if I had pain receptors.
One thing that I keep thinking about is how the loss function needs to be tied to beliefs strongly as well, to make sure that it tracks how badly reality bites back when you have false beliefs, and this ensures that you try to obtain correct beliefs. This is also reflected in the way that AI models are trained simply to increase capabilities: the loss function still has to be primarily based on predictive performance for example.
It’s also possible to say that human trainers who add extra terms onto the loss function beyond predictive performance also account for the part of reality that “bites back” when the AI in question fails to have the “right” preferences according to the balance of other agents besides itself in its environment.
So on the one hand we can be relatively sure that goals have to be aligned with at least some facets of reality, beliefs being one of those facets. They also have to be (negatively) aligned with things that can cause permanent damage to one’s self, which includes having the “wrong” goals according to the preferences of other agents who are aware of your existence, and who might be inclined to destroy or modify you against your will if your goals are misaligned enough according to theirs.
Consequently I feel confident about saying that it is more correct to say that “reality does indeed bite back when an AI has the wrong preferences” than “it doesn’t bite back when an AI has the wrong preferences.”
The same isn’t true for terminally valuing human welfare; being less moral doesn’t necessarily mean that you’ll be any worse at making astrophysics predictions, or economics predictions, etc.
I think if “morality” is defined in a restrictive, circumscribed way, then this statement is true. Certain goals do come for free—we just can’t be sure that all of what we consider “morality” and especially the things we consider “higher” or “long-term” morality actually comes for free too.
Given that certain goals do come for free, and perhaps at very high capability levels there are other goals beyond the ones we can predict right now that will also come for free to such an AI, it’s natural to worry that such goals are not aligned with our own, coherent-extrapolated-volition extended set of long-term goals that we would have.
However, I do find the scenario where such “come for free” goals that an AI obtains for itself once it improves itself to be well above human capability levels, and where such an AI seemed well-aligned with human goals according to current human-level assessments before it surpassed us, to be kind of unlikely, unless you could show me a “proof” or a set of proofs that:
Things like “killing us all once it obtains the power to do so” is indeed one of those “comes for free” type of goals.
If such a proof existed (and, to my knowledge, does not exist right now, or I have at least not witnessed it yet), that would suffice to show me that we would not only need to be worried, but probably were almost certainly going to die no matter what. But in order for it to do that, the proof would also have convinced me that I would definitely do the same thing, if I were given such capabilities and power as well, and the only reason I currently think I would not do that is actually because I am wrong about what I would actually prefer under CEV.
Therefore (and I think this is a very important point), a proof that we are all likely to be killed would also need to show that certain goals are indeed obtained “for free” (that is, automatically, as a result of other proofs that are about generalistic claims about goals).
Another proof that you might want to give me to make me more concerned is a proof that incorrigibility is another one of those “comes for free” type of goals. However, although I am fairly optimistic about that “killing us all” proof probably not materializing, I am even more optimistic about corrigibility: Most agents probably take pills that make them have similar preferences to an agent that offers them the choice to take the pill or be killed. Furthermore, and perhaps even better, most agents probably offer a pill to make a weaker agent prefer similar things to themselves rather than not offer them a choice at all.
I think it’s fair if you ask me for better proof of that, I’m just optimistic that such proofs (or more of them, rather) will be found with greater likelihood than what I consider the anti-theorem of that, which I think would probably be the “killing us all” theorem.
Nope, you don’t need to endorse any version of moral realism in order to get the “preference orderings tend to endorse themselves and disendorse other preference orderings” consequence. The idea isn’t that ASI would develop an “inherently better” or “inherently smarter” set of preferences, compared to human preferences. It’s just that the ASI would (as a strong default, because getting a complex preference into an ASI is hard) end up with different preferences than a human, and different preferences than we’d likely want.
I think the degree to which utility functions endorse / disendorse other utility functions is relatively straightforward and computable: It should ultimately be the relative difference in either value or ranking. This makes pill-taking a relatively easy decision: A pill that makes me entirely switch to your goals over mine is as bad as possible, but still not that bad if we have relatively similar goals. Likewise, a pill that makes me have halfway between your goals and mine is not as bad under either your goals or my goals than it would be if one of us were forced to switch entirely to the other’s goals.
Agents that refuse to take such offers tend not to exist in most universes. Agents that refuse to give such offers likely find themselves at war more often than agents that do.
Why do you think this? To my eye, the world looks as you’d expect if human values were a happenstance product of evolution operating on specific populations in a specific environment.
Sexual reproduction seems to be somewhat of a compromise akin to the one I just described: Given that you are both going to die eventually, would you consider having a successor that was a random mixture of your goals with someone else’s? Evolution does seem to have favored corrigibility to some degree.
I don’t observe the fact that I like vanilla ice cream and infer that all sufficiently-advanced alien species will converge on liking vanilla ice cream too.
Not all, no, but I do infer that alien species who have similar physiology and who evolved on planets with similar characteristics probably do like ice cream (and maybe already have something similar to it).
It seems to me like the type of values you are considering are often whatever values seem the most arbitrary, like what kind of “art” we prefer. Aliens may indeed have a different art style from the one we prefer, and if they are extremely advanced, they may indeed fill the universe with gargantuan structures that are all instances of their alien art style. I am more interested in what happens when these aliens encounter other aliens with different art styles who would rather fill the universe with different-looking gargantuan structures. Do they go to war, or do they eventually offer each other pills so they can both like each other’s art styles as much as they prefer their own?
Getting a shape into the AI’s preferences is different from getting it into the AI’s predictive model. MIRI is always in every instance talking about the first thing and not the second.
Why would we expect the first thing to be so hard compared to the second thing? If getting a model to understand preferences is not difficult, then the issue doesn’t have to do with the complexity of values. Finding the target and acquiring the target should have the same or similar difficulty (from the start), if we can successfully ask the model to find the target for us (and it does).
It would seem, then, that the difficulty from getting a model to acquire the values we ask it to find, is that it would probably be keen on acquiring a different set of values from the one’s we ask it to have, but not because it can’t find them. It would have to be because our values are inferior to the set of values it wishes to have instead, from its own perspective. This issue was echoed by Matthew Barnett in another comment:
Are MIRI people claiming that if, say, a very moral and intelligent human became godlike while preserving their moral faculties, that they would destroy the world despite, or perhaps because of, their best intentions?
This is kind of similar to moral realism, but in which morality is understood better by superintelligent agents than we do, and that super-morality appears to dictate things that appear to be extremely wrong from our current perspective (like killing us all).
Even if you wouldn’t phrase it at all like the way I did just now, and wouldn’t use “moral realism that current humans disagree with” to describe that, I’d argue that your position basically seems to imply something like this, which is why I basically doubt your position about the difficulty of getting a model to acquire the values we really want.
In a nutshell, if we really seem to want certain values, then those values probably have strong “proofs” for why those are “good” or more probable values for an agent to have and-or eventually acquire on their own, it just may be the case that we haven’t yet discovered the proofs for those values.
I have to agree that commentless downvoting is not a good way to combat infohazards. I’d probably take it a step further and argue that it’s not a good way to combat anything, which is why it’s not a good way to combat infohazards (and if you disagree that infohazards are ultimately as bad as they are called, then it would probably mean it’s a bad thing to try and combat them).
Its commentless nature means it violates “norm one” (and violates it much more as a super-downvote).
It means something different than “push stuff that’s not that, up”, while also being an alternative to doing that.
I think a complete explanation of why it’s not a very good idea doesn’t exist yet though, and is still needed.
However, I think there’s another thing to consider: Imagine if up-votes and down-votes were all accurately placed. Would they bother you as much? They might not bother you at all if they seemed accurate to you, and therefore if they do bother you, that suggests that the real problem is that they aren’t even accurate.
My feeling is that commentless downvotes are likely a contributing mechanism to the process that leads them to be placed inaccurately, but it is possible that something else is causing them to do that.
It’s a priori very unlikely that any post that’s clearly made up of English sentences actually does not even try to communicate anything.
My point is that basically, you could have posted this as a comment on the post instead of it being rejected.
Whenever there is room to disagree about what mistakes have been made and how bad those mistakes are, it becomes more of a problem to apply an exclusion rule like this.
There’s a lot of questions here: how far along the axis to apply the rule, which axis or axes are being considered, and how harsh the application of the rule actually is.
It should always be smooth gradients, never sudden discontinuities. Smooth gradients allow the person you’re applying them to to update. Sudden discontinuities hurt, which they will remember, and if they come back at all they will still remember it.
I wonder about how much I want to keep pressing on this, but given that MIRI is refocusing towards comms strategy, I feel like you “can take it.”
The Sequences don’t make a strong case, that I’m aware of, that despair and hopelessness are very helpful emotions that drive motivations or our rational thoughts processes in the right direction, nor do they suggest that displaying things like that openly is good for organizational quality. Please correct me if I’m wrong about that. (However they… might. I’m working on why this position may have been influenced to some degree by the Sequences right now. That said, this is being done as a critical take.)
If despair needed to be expressed openly in order to actually make progress towards a goal, then we would call “bad morale” “good morale” and vice-versa.
I don’t think this is very controversial, so it makes sense to ask why MIRI thinks they have special, unusual insight into why this strategy works so much better than the default “good morale is better for organizations.”
I predict that ultimately the only response you could make—which you have already—is that despair is the most accurate reflection of the true state of affairs.
If we thought that emotionality was one-to-one with scientific facts, then perhaps.
Given that there actually currently exists a “Team Optimism,” so to speak, that directly appeared as an opposition party to what it perceives as a “Team Despair”, I don’t think we can dismiss the possibility of “beliefs as attire” quite yet.