kaarelh AT gmail DOT com
Kaarel
(By the way, I’m pretty sure the position I outline is compatible with changing usual forecasting procedures in the presence of observer selection effects, in cases where secondary evidence which does not kill us is available. E.g. one can probably still justify [looking at the base rate of near misses to understand the probability of nuclear war instead of relying solely on the observed rate of nuclear war itself].)
I’m inside-view fairly confident that Bob should be putting a probability of 0.01% on surviving conditional on many worlds being true, but it seems possible I’m missing some crucial considerations having to do with observer selection stuff in general, so I’ll phrase the rest of this as more of a question.
What’s wrong with saying that Bob should put a probability of 0.01% of surviving conditional on many-worlds being true – doesn’t this just follow from the usual way that a many-worlder would put probabilities on things, or at least the simplest way for doing so (i.e. not post-normalizing only across the worlds in which you survive)? I’m pretty sure that the usual picture of Bayesianism as having a big (weighted) set of possible worlds in your head and, upon encountering evidence, discarding the ones which you found out you were not in, also motivates putting a probability of 0.01% on surviving conditional on many-worlds. (I’m assuming that for a many-worlder, weights on worlds are given by squared amplitudes or whatever.)
This contradicts a version of the conservation of expected evidence in which you only average over outcomes in which you survive (even in cases where you don’t survive in all outcomes), but that version seems wrong anyway, with Leslie’s firing squad seeming like an obvious counterexample to me, https://plato.stanford.edu/entries/fine-tuning/#AnthObje .
A big chunk of my uncertainty about whether at least 95% of the future’s potential value is realized comes from uncertainty about “the order of magnitude at which utility is bounded”. That is, if unbounded total utilitarianism is roughly true, I think there is a <1% chance in any of these scenarios that >95% of the future’s potential value would be realized. If decreasing marginal returns in the [amount of hedonium → utility] conversion kick in fast enough for 10^20 slightly conscious humans on heroin for a million years to yield 95% of max utility, then I’d probably give >10% of strong utopia even conditional on building the default superintelligent AI. Both options seem significantly probable to me, causing my odds to vary much less between the scenarios.
This is assuming that “the future’s potential value” is referring to something like the (expected) utility that would be attained by the action sequence recommended by an oracle giving humanity optimal advice according to our CEV. If that’s a misinterpretation or a bad framing more generally, I’d enjoy thinking again about the better question. I would guess that my disagreement with the probabilities is greatly reduced on the level of the underlying empirical outcome distribution.
Great post, thanks for writing this! In the version of “Alignment might be easier than we expect” in my head, I also have the following:
Value might not be that fragile. We might “get sufficiently many bits in the value specification right” sort of by default to have an imperfect but still really valuable future.
For instance, maybe IRL would just learn something close enough to pCEV-utility from human behavior, and then training an agent with that as the reward would make it close enough to a human-value-maximizer. We’d get some misalignment on both steps (e.g. because there are systematic ways in which the human is wrong in the training data, and because of inner misalignment), but maybe this is little enough to be fine, despite fragility of value and despite Goodhart.
Even if deceptive alignment were the default, it might be that the AI gets sufficiently close to correct values before “becoming intelligent enough” to start deceiving us in training, such that even if it is thereafter only deceptively aligned, it will still execute a future that’s fine when in deployment.
It doesn’t seem completely wild that we could get an agent to robustly understand the concept of a paperclip by default. Is it completely wild that we could get an agent to robustly understand the concept of goodness by default?
Is it so wild that we could by default end up with an AGI that at least does something like putting 10^30 rats on heroin? I have some significant probability on this being a fine outcome.
There’s some distance from the correct value specification such that stuff is fine if we get AGI with values closer than . Do we have good reasons to think that is far out of the range that default approaches would give us?
I still disagree / am confused. If it’s indeed the case that , then why would we expect ? (Also, in the second-to-last sentence of your comment, it looks like you say the former is an equality.) Furthermore, if the latter equality is true, wouldn’t it imply that the utility we get from [chocolate ice cream and vanilla ice cream] is the sum of the utility from chocolate ice cream and the utility from vanilla ice cream? Isn’t supposed to be equal to the utility of ?
My current best attempt to understand/steelman this is to accept , to reject , and to try to think of the embedding as something slightly strange. I don’t see a reason to think utility would be linear in current semantic embeddings of natural language or of a programming language, nor do I see an appealing other approach to construct such an embedding. Maybe we could figure out a correct embedding if we had access to lots of data about the agent’s preferences (possibly in addition to some semantic/physical data), but it feels like that might defeat the idea of this embedding in the context of this post as constituting a step that does not yet depend on preference data. Or alternatively, if we are fine with using preference data on this step, maybe we could find a cool embedding, but in that case, it seems very likely that it would also just give us a one-step solution to the entire problem of computing a set of rational preferences for the agent.
A separate attempt to steelman this would be to assume that we have access to a semantic embedding pretrained on preference data from a bunch of other agents, and then to tune the utilities of the basis to best fit the preferences of the agent we are currently dealing with. That seems like it a cool idea, although I’m not sure if it has strayed too far from the spirit of the original problem.
The link in this sentence is broken for me: “Second, it was proven recently that utilitarianism is the “correct” moral philosophy.” Unless this is intentional, I’m curious to know where it directed to.
I don’t know of a category-theoretic treatment of Heidegger, but here’s one of Hegel: https://ncatlab.org/nlab/show/Science+of+Logic. I think it’s mostly due to Urs Schreiber, but I’m not sure – in any case, we can be certain it was written by an Absolute madlad :)
Why should I care about similarities to pCEV when valuing people?
It seems to me that this matters in case your metaethical view is that one should do pCEV, or more generally if you think matching pCEV is evidence of moral correctness. If you don’t hold such metaethical views, then I might agree that (at least in the instrumentally rational sense, at least conditional on not holding any metametalevel views that contradict these) you shouldn’t care.
> Why is the first example explaining why someone could support taking money from people you value less to give to other people, while not supporting doing so with your own money? It’s obviously true under utilitarianism
I’m not sure if it answers the question, but I think it’s a cool consideration. I think most people are close to acting weighted-utilitarianly, but few realize how strong the difference between public and private charity is according to weighted-utilitarianism.
> It’s weird to bring up having kids vs. abortion and then not take a position on the latter. (Of course, people will be pissed at you for taking a position too.)My position is “subsidize having children, that’s all the regulation around abortion that’s needed”. So in particular, abortion should be legal at any time. (I intended what I wrote in the post to communicate this, but maybe I didn’t do a good job.)
> democracy plans for right now
I’m not sure I understand in what sense you mean this? Voters are voting according to preferences that partially involve caring about future selves. If what you have in mind is something like people being less attentive about costs policies cause 10 years into the future and this leads to discounting these more than the discount from caring alone, then I guess I could see that being possible. But that could also happen for people’s individual decisions, I think? I guess one might argue that people are more aware about long-term costs of personal decisions than of policies, but this is not clear to me, especially with more analysis going into policy decisions.
> As to your framing, the difference between you-now and you-future is mathematically bigger than the difference between others-now and others-future if you use a ratio for the number of links to get to them.
Suppose people change half as much in a year as your sibling is different from you, and you care about similarity for what value you place on someone. Thus, two years equals one link.
After 4 years, you are now two links away from yourself-now and your sibling is 3 from you now. They are 50% more different than future you (assuming no convergence). After eight years, you are 4 links away, while they are only 5, which makes them 25% more different to you than you are.
Alternately, they have changed by 67% more, and you have changed by 100% of how much how distant they were from you at 4 years.
It thus seems like they have changed far less than you have, and are more similar to who they were, thus why should you treat them as having the same rate.
That’s a cool observation! I guess this won’t work if we discount geometrically in the number of links. I’m not sure which is more justified.
There is lots of interesting stuff in your last comment which I still haven’t responded to. I might come back to this in the future if I have something interesting to say. Thanks again for your thoughts!
I proposed a method for detecting cheating in chess; cross-posting it here in the hopes of maybe getting better feedback than on reddit: https://www.reddit.com/r/chess/comments/xrs31z/a_proposal_for_an_experiment_well_data_analysis/
Thanks for the comments!
In ‘The inequivalence of society-level and individual charity’ they list the scenarios as 1, 1, and 2 instead of A, B, C, as they later use. Later, refers incorrectly to preferring C to A with different necessary weights when the second reference is is to prefer C to B.
I agree and I published an edit fixing this just now
The claim that money becomes utility as a log of the amount of money isn’t true, but is probably close enough for this kind of use. You should add a note to the effect. (The effects of money are discrete at the very least).
I mostly agree, but I think footnote 17 covers this?
The claim that the derivative of the log of y = 1/y is also incorrect. In general, log means either log base 10, or something specific to the area of study. If written generally, you must specify the base. (For instance, in Computer Science it is base-2, but I would have to explain that if I was doing external math with that.) The derivative of the natural log is 1/n, but that isn’t true of any other log. You should fix that statement by specifying you are using ln instead of log (or just prepending the word natural).
I think the standard in academic mathematics is that , https://en.wikipedia.org/wiki/Natural_logarithm#Notational_conventions, and I guess I would sort of like to spread that standard :). I think it’s exceedingly rare for someone to mean base 10 in this context, but I could be wrong. I agree that base 2 is also reasonable though. In any case, the base only changes utility by scaling by a constant, so everything in that subsection after the derivative should be true independently of the base. Nevertheless, I’m adding a footnote specifying this.
Just plain wrong in my opinion, for instance, claiming that a weight can’t be negative assumes away the existence of hate, but people do hate either themselves or others on occasion in non-instrumental ways, wanting them to suffer, which renders this claim invalid (unless they hate literally everyone).
I’m having a really hard time imagining thinking this about someone else (I can imagine hate in the sense of like… not wanting to spend time together with someone and/or assigning a close-to-zero weight), but I’m not sure – I mean, I agree there definitely are people who think they non-instrumentally want the people who killed their family or whatever to suffer, but I think that’s a mistake? That said, I think I agree that for the purposes of modeling people, we might want to let weights be negative sometimes.
I also don’t see how being perfectly altruistic necessitates valuing everyone else exactly the same as you. I could still value others different amounts without being any less altruistic, especially if the difference is between a lower value for me and the others higher. Relatedly, it is possible to not care about yourself at all, but this math can’t handle that.
I think it’s partly that I just wanted to have some shorthand for “assign equal weight to everyone”, but I also think it matches the commonsense notion of being perfectly altruistic. One argument for this is that 1) one should always assign a higher weight for oneself than for anyone else (also see footnote 12 here) and 2) if one assigns a lower weight to someone else, then one is not perfectly altruistic in interactions with that person – given this, the unique option is to assign equal weight to everyone.
A gentle primer on caring, including in strange senses, with applications
I’m updating my estimate of the return on investment into culture wars from being an epsilon fraction compared to canonical EA cause areas to epsilon+delta. This has to do with cases where AI locks in current values extrapolated “correctly” except with too much weight put on the practical (as opposed to the abstract) layer of current preferences. What follows is a somewhat more detailed status report on this change.
For me (and I’d guess for a large fraction of
autistic altruisticsmultipliers), the general feels regarding [being a culture war combatant in one’s professional capacity] seem to be that while the questions fought over have some importance, the welfare-produced-per-hour-worked from doing direct work is at least an order of magnitude smaller than the same quantities for any canonical cause area (also true for welfare/USD). I’m fairly certain one can reach this conclusion from direct object-level estimates, as I imagine e.g. OpenPhil has done, although I admit I haven’t carried out such calculations with much care myself. Considering the incentives of various people involved should also support this being a lower welfare-per-hour-worked cause area (whether an argument along these lines gives substantive support to the conclusion that there is an order-of-magnitude difference appears less clear).So anyway, until today part of my vague cloud of justification for these feels was that “and anyway, it’s fine if this culture war stuff is fixed in 30 years, after we have dealt with surviving AGI”. The small realization I had today was that maybe a significant fraction of the surviving worlds are those where something like corrigibility wasn’t attainable but AI value extrapolation sort of worked out fine, i.e. with the values that got locked in being sort of fine, but the relative weights of object-level intuitions/preferences was kinda high compared to the weight on simplicity/[meta-level intuitions], like in particular maybe the AI training did some Bayesian-ethics-evidential-double-counting of object-level intuitions about 10^10 similar cases (I realize it’s quite possible that this last clause won’t make sense to many readers, but unfortunately I won’t provide an explanation here; I intend to write about a few ideas on this picture of Bayesian ethics at some later time, but I want to read Beckstead’s thesis first, which I haven’t done yet; anyway the best I can offer is that I estimate a 75% of you understanding the rough idea I have in mind (which does not necessarily imply that the idea can actually be unfolded into a detailed picture that makes sense), conditional on understanding my writing in general and conditional on not having understood this clause yet, after reading Beckstead’s thesis; also: woke: Bayesian ethics, bespoke: INFRABAYESIAN ETHICS, am I right folks).
So anyway, finally getting to the point of all this at the end of the tunnel, in such worlds we actually can’t fix this stuff later on, because all the current opinions on culture war issues got locked in.
(One could argue that we can anyway be quite sure that this consideration matters little, because most expected value is not in such kinda-okay worlds, because even if these were 99% percent of the surviving worlds, assuming fun theory makes sense or simulated value-bearing minds are possible, there will be amazingly more value in each world where AGI worked out really well, as compared to a world tiled with Earth society 2030. But then again, this counterargument could be iffy to some, in sort of the same way in which fanaticism (in Bostrom’s sense) or the St. Petersburg paradox feel iffy to some, or perhaps in another way. I won’t be taking a further position on this at the moment.)
kh’s Shortform
Oops I realized that the argument given in the last paragraph of my previous comment applies to people maximizing their personal welfare or being totally altruistic or totally altruistic wrt some large group or some combination of these options, but maybe not so much to people who are e.g. genuinely maximizing the sum of their family members’ personal welfares, but this last case might well be entailed by what you mean by “love”, so maybe I missed the point earlier. In the latter case, it seems likely that an IQ boost would keep many parts of love in tact initially, but I’d imagine that for a significant fraction of people, the unequal relationship would cause sadness over the next 5 years, which with significant probability causes falling out of love. Of course, right after the IQ boost you might want to invent/implement mental tech which prevents this sadness or prevents the value drift caused by growing apart, but I’m not sure if there are currently feasible options which would be acceptable ways to fix either of these problems. Maybe one could figure out some contract to sign before the value drift, but this might go against some deeper values, and might not count as staying in love anyway.
Something that confuses me about your example’s relevance is that it’s like almost the unique case where it’s [[really directly] impossible] to succumb to optimization pressure, at least conditional on what’s good = something like coherent extrapolated volition. That is, under (my understanding of) a view of metaethics common in these corners, what’s good just is what a smarter version of you would extrapolate your intuitions/[basic principles] to, or something along these lines. And so this is almost definitionally almost the unique situation that we’d expect could only move you closer to better fulfilling your values, i.e. nothing could break for any reason, and in particular not break under optimization pressure (where breaking is measured w.r.t. what’s good). And being straightforwardly tautologically true would make it a not very interesting example.
editorial remark: I realized after writing the two paragraphs below that they probably do not move one much on the main thesis of your post, at least conditional on already having read Ege Erdil’s doubts about your example (except insofar as someone wants to defer to opinions of others or my opinion in particular), but I decided to post anyway in large part since these family matters might be a topic of independent interest for some:
I would bet that at least 25% of people would stop loving their (current) family in <5 years (i.e. not love them much beyond how much they presently love a generic acquaintance) if they got +30 IQ. That said, I don’t claim the main case of this happening is because of applying too much optimization pressure to one’s values, at least not in a way that’s unaligned with what’s good—I just think it’s likely to be the good thing to do (or like, part of all the close-to-optimal packages of actions, or etc.). So I’m not explicitly disagreeing with the last sentence of your comment, but I’m disagreeing with the possible implicit justification of the sentence that goes through [“I would stop loving my family” being false].
The argument for it being good to stop loving your family in such circumstances is just that it’s suboptimal for having an interesting life, or for [the sum over humans of interestingness of their lives] if you are altruistic, or whatever, for post-IQ-boost-you to spend a lot of time with people much dumber than you, which your family is now likely to be. (Here are 3 reasons to find a new family: you will have discussions which are more fun → higher personal interestingness; you will learn more from these discussions → increased productivity; and something like productivity being a convex function of IQ—this comes in via IQs of future kids, at least assuming the change in your IQ would be such as to partially carry over to kids. I admit there is more to consider here, e.g. some stuff with good incentives, breaking norms of keeping promises—my guess is that these considerations have smaller contributions.)
I started writing this but lost faith in it halfway through, and realized I was spending too much time on it for today. I figured it’s probably a net positive to post this mess anyway although I have now updated to believe somewhat less in it than the first paragraph indicates. Also I recommend updating your expected payoff from reading the rest of this somewhat lower than it was before reading this sentence. Okay, here goes:
{I think people here might be attributing too much of the explanatory weight on noise. I don’t have a strong argument for why the explanation definitely isn’t noise, but here is a different potential explanation that seems promising to me. (There is a sense in which this explanation is still also saying that noise dominates over any relation between the two variables—well, there is a formal sense in which that has to be the case since the correlation is small—so if this formal thing is what you mean by “noise”, I’m not really disagreeing with you here. In this case, interpret my comment as just trying to specify another sense in which the process might not be noisy at all.) This might be seen as an attempt to write down the “sigmoids spiking up in different parameter ranges” idea in a bit more detail.
First, note that if the performance on every task is a perfectly deterministic logistic function with midpoint x_0 and logistic growth rate k, i.e. there is “no noise”, with k and x_0 being the same across tasks, then these correlations would be exactly 0. (Okay, we need to be adding an epsilon of noise here so that we are not dividing by zero when calculating the correlation, but let’s just do that and ignore this point from now on.) Now as a slightly more complicated “noiseless” model, we might suppose that performance on each task is still given by a “deterministic” logistic function, but with the parameters k and x_0 being chosen at random according to some distribution. It would be cool to compute some integrals / program some sampling to check what correlation one gets when k and x_0 are both normally distributed with reasonable means and variances for this particular problem, with no noise beyond that.}
This is the point where I lost faith in this for now. I think there are parameter ranges for how k and x_0 are distributed where one gets a significant positive correlation and ranges where one gets a significant negative correlation in the % case. Negative correlations seem more likely for this particular problem. But more importantly, I no longer think I have a good explanation why this would be so close to 0. I think in logit space, the analysis (which I’m omitting here) becomes kind of easy to do by hand (essentially because the logit and logistic function are inverses), and the outcome I’m getting is that the correlation should be positive, if anything. Maybe it becomes negative if one assumes the logistic functions in our model are some other sigmoids instead, I’m not sure. It seems possible that the outcome would be sensitive to such details. One idea is that maybe if one assumes there is always eps of noise and bounds the sigmoid away from 1 by like 1%, it would change the verdict.
Anyway, the conclusion I was planning to reach here is that there is a plausible way in which all the underlying performance curves would be super nice, not noisy at all, but the correlations we are looking at would still be zero, and that I could also explain the negative correlations without noisy reversion to the mean (instead this being like a growth range somewhere decreasing the chance there is a growth range somewhere else) but the argument ended up being much less convincing than I anticipated. In general, I’m now thinking that most such simple models should have negative or positive correlation in the % case depending on the parameter range, and could be anything for logit. Maybe it’s just that these correlations are swamped by noise after all. I’ll think more about it.
That was interesting! Thank you!
There is also another way that super-intelligent AI could be aligned by definition. Namely, if your utility function isn’t “humans survive” but instead “I want the future to be filled with interesting stuff”. For all the hand-wringing about paperclip maximizers, the fact remains that any AI capable of colonizing the universe will probably be pretty cool/interesting. Humans don’t just create poetry/music/art because we’re bored all the time, but rather because expressing our creativity helps us to think better. It’s probably much harder to build an AI that wipes out all humans and then colonizes space and is also super-boring, than to make one that does those things in a way people who fantasize about giant robots would find cool.
I’m not convinced that (the world with) a superintelligent AI would probably be pretty cool/interesting. Does anyone know of a post/paper/(sci-fi )book/video/etc that discusses this? (I know there’s this :P and maybe this.) Perhaps let’s discuss this! I guess the answer depends on how human-centered/inspired (not quite the right term, but I couldn’t come up with a better one) our notion of interestingness is in this question. It would be cool to have a plot of expected interestingness of the first superintelligence (or well, instead of expectation it is better to look at more parameters, but you get the idea) as a function of human-centeredness of what’s meant by “interestingness”. Of course, figuring this out in detail would be complicated, but it nevertheless seems likely that something interesting could be said about it.
I think we (at least also) create poetry/music/art because of godshatter. To what extent should we expect AI to godshatter, vs do something like spending 5 minutes finding one way to optimally turn everything into paperclips and doing that for all eternity? The latter seems pretty boring. Or idk, maybe the “one way” is really an exciting enough assortment of methods that it’s still pretty interesting even if it’s repeated for all eternity?
more on 4: Suppose you have horribly cyclic preferences and you go to a rationality coach to fix this. In particular, your ice cream preferences are vanilla>chocolate>mint>vanilla. Roughly speaking, Hodge is the rationality coach that will tell you to consider the three types of ice cream equally good from now on, whereas Mr. Max Correct Pairs will tell you to switch one of the three preferences. Which coach is better? If you dislike breaking cycles arbitrarily, you should go with Hodge. If you think losing your preferences is worse than that, go with Max. Also, Hodge has the huge advantage of actually being done in a reasonable amount of time :)
3. Ahh okay thanks, I have a better picture of what you mean by a basis of possibility space now. I still doubt that utility interacts nicely with this linear structure though. The utility function is linear in lotteries, but this is distinct from being linear in possibilities. Like, if I understand your idea on that step correctly, you want to find a basis of possibility-space, not lottery space. (A basis on lottery space is easy to find—just take all the trivial lotteries, i.e. those where some outcome has probability 1.) To give an example of the contrast: if the utility I get from a life with vanilla ice cream is u_1 and the utility I get from a life with chocolate ice cream is u_2, then the utility of a lottery with 50% chance of each is indeed 0.5 u_1 + 0.5 u_2. But what I think you need on that step is something different. You want to say something like “the utility of the life where I get both vanilla ice cream and chocolate ice cream is u_1+u_2”. But this still seems just morally false to me. I think the mistake you are making in the derivation you give in your comment is interpreting the numerical coefficients in front of events as both probabilities of events or lotteries and as multiplication in the linear space you propose. The former is fine and correct, but I think the latter is not fine. So in particular, when you write u(2A), in the notation of the source you link, this can only mean “the utility you get from a lottery where the probability of A is 2″, which does not make sense assuming you don’t allow your probabilities to be >1. Or even if you do allow probabilities >1, it still won’t give you what you want. In particular, if A is a life with vanilla ice cream, then in their notation, 2A does not refer to a life with twice the quantity of vanilla ice cream, or whatever.
4. I think the gradient part of the Hodge decomposition is not (in general) the same as the ranking with the minimal number of incorrect pairs. Fun stuff
I took the main point of the post to be that there are fairly general conditions (on the utility function and on the bets you are offered) in which you should place each bet like your utility is linear, and fairly general conditions in which you should place each bet like your utility is logarithmic. In particular, the conditions are much weaker than your utility actually being linear, or than your utility actually being logarithmic, respectively, and I think this is a cool point. I don’t see the post as saying anything beyond what’s implied by this about Kelly betting vs max-linear-EV betting in general.