Thanks, this is exactly the kind of thing I was looking for.
burrito
Thanks for the reply.
GPT-4 is far below village idiot level at most things a village idiot uses their brain for, despite surpassing humans at next-token prediction.
Could you give some examples? I take it that what Eliezer meant by village-idiot intelligence is less “specifically does everything a village idiot can do” and more “is as generally intelligent as a village idiot”. I feel like the list of things GPT-4 can do that a village idiot can’t would look much more indicative of general intelligence than the list of things a village idiot can do that GPT-4 can’t. (As opposed to AlphaZero, where the extent of the list is “can play some board games really well”)
I just can’t imagine anyone interacting with a village idiot and GPT-4 and concluding that the village idiot is smarter. If the average village idiot had the same capabilities as today’s GPT-4, and GPT-4 had the same capabilities as today’s village idiots, I feel like it would be immediately obvious that we hadn’t gotten village-idiot level AI yet. My thinking on this is still pretty messy though so I’m very open to having my mind changed on this.
Something like this plausibly came up in the Eliezer/Paul dialogues from 2021, but I couldn’t find it with a cursory search. Eliezer has also in various places acknowledged being wrong about what kind of results the current ML paradigm would get, which probably is a superset of this specific thing.
Just skimmed the dialogues, couldn’t find it either. I have seen Eliezer acknowledge what you said but I don’t really see how it’s related; for example, if GPT-4 had been Einstein-level then that would look good for his intelligence-gap theory but bad for his suspicion of the current ML paradigm.
In My Childhood Role Model, Eliezer Yudkowsky says that the difference in intelligence between a village idiot and Einstein is tiny relative to the difference between a chimp and a village idiot.This seems to imply (I could be misreading) that {the time between the first AI with chimp intelligence and the first AI with village idiot intelligence} will be much larger than {the time between the first AI with village idiot intelligence and the first AI with Einstein intelligence}. If we consider GPT-2 to be roughly chimp-level, and GPT-4 to be above village idiot level, then it seems like this would predict that we’ll get an Einstein-level AI within at least the next year. This seems really unlikely and I don’t even think Eliezer currently believes this. If my interpretation of this is correct, this seems like an important prediction that he got wrong and I haven’t seen acknowledged.
So my question is: Is this a fair representation of Eliezer’s beliefs at the time? If so, has this prediction been acknowledged wrong, or was it actually not wrong and there’s something I’m missing? If the prediction was wrong, what might the implications be for fast vs slow takeoff? (Initial thoughts: If this prediction had been right, then we’d expect fast takeoff to be much more likely, because it seems like if you improve Einstein by the difference between him and a village idiot 20 times over 4 years (the time between GPT-2 and GPT-4, i.e. ~chimp level vs >village idiot level), you will definitely get a self-improving AI somewhere along the way)
(Meta: I know this seems like more of an argument than a question; the reason I put it here is that I expect someone to have an easy/obvious answer, since I haven’t really spent any time thinking about or working with AI beyond playing with GPT and watching/reading some AGI debates. I also don’t want to pollute the discourse in threads where the participants are expected to at least kind of know what they’re talking about.)
Strongly agree. As a relative beginner I’ve found the automatic code completion and method listing/descriptions incredibly useful.
Responding to 3):
What is your standing to judge your imagined version of someone’s experience? Maybe your preferences are different enough from the subject’s that you’re simply wrong in your comparison.
You’re right and I should have said “imagine the point at which they are indifferent”, “would they prefer the 10x experience or the 1x experience”, etc. Imagining whether I would prefer it could be a decent approximation of their preferences, though.
Responding to 2):
It’s likely that some experiences are non-linear in utility per intensity.
In the post, I defined intensity as linearly proportional to utility. If you think wording it as “intensity” is misleading because what we generally think of as “experience intensity” isn’t linearly proportional to utility, then I agree, but can’t think of a better term to use.
Or that you’d have to crank up some parts of the experience and not others. For instance, enjoying the contrast of bitter and fruity in a shot of espresso—there’s no way to scale the whole thing up 10x, you have to pick and choose what to intensify, and then your result is subject to those modeling choices.
If I’m understanding you correctly, you’re saying that the experience of enjoying the contrast of bitter and fruity can be modeled as the individual experiences “bitter 1, bitter 2, bitter 3, … ,fruity 1, fruity 2, fruity 3, …” and the total utility of the 10x scaled experience depends on which ones you group together when summing the utilities. For example, if your “utility groups” are bitter 1 + fruity 1, bitter 2 + fruity 2, etc., it comes out with higher utility than if you group them as bitter 1 +bitter 2 + etc and fruity 1 + fruity 2 + etc, because the specific combination of bitter and fruity is what makes it a good experience.
I disagree that the individual experiences are bitter 1, bitter 2, …, fruity 1, fruity 2, … ; I feel like it should be more like bitter-fruity 1, bitter-fruity 2, …, bitter 1, bitter 2, …, fruity 1, fruity 2, … The combination of bitter and fruity (“bitter-fruity”) is a distinct individual experience in the set of experiences occurring at that moment, and in that set might also be included individual “bitter” and “fruity” experiences. Here, we can just intensify each individual experience by 10 (i.e. multiply the utility of the experience by 10 while keeping it as the same “type” of experience) and sum their utilities.
Also, why 10x rather than 0.1x, or 0x (direct comparison of experienced to not-experienced).
I don’t really know what it would mean to prefer to “not experience” something. You’re always experiencing something; your baseline mental state is an experience. If your baseline mental state has exactly 0 utility, this would work, but your baseline mental state isn’t necessarily exactly 0 utility. If “not experiencing” something means shaving that much time off the rest of your life, this still feels like a conceptually weaker version of a “preference”. When comparing two experiences, I can imagine being on experience 1, then deciding I want to switch to experience 2, then actually deciding experience 1 is better, etc., until some sort of equilibrium is reached and I make up my mind. (Keep in mind that how tired you are of the experience is itself part of the experience, and held constant, so you wouldn’t keep jumping back and forth endlessly.) Theoretically, the analogous way to compare an experience with the absence of experience would be jumping between [your entire mind being turned off except for the part that allows you to think and make decisions] and the experience, but that’s a harder thought experiment to have than imagining jumping between two different experiences.
I hadn’t thought of comparing 1x to 0.1x; that’s a good idea, and I don’t object to it. I imagine 1x to 0.1x is more useful if each individual experience has a lot of utility (or disutility), and 1x to 10x is more useful if each individual experience only has a little utility (or disutility).
(I will respond to 3) and 4) tomorrow, it’s getting late and I should sleep)
First, I want to make sure we’re separating the validity of the model itself from concerns with applying it. I’ll try to be clear about which one I’m talking about for each part.
I’ll respond to each number in a separate reply because the format of the conversation will be a mess otherwise. Starting with 1):
Where do you get this list?
Could you be more specific? Is this question centered around how I know what other people are thinking, or how to separate a whole experience into individual experiences ?
And how do you account for future unexpected experiences?
I don’t know how to respond to this. What part of the model depends on accounting for unexpected future experiences? If you’re asking generally how I would predict future experiences, I don’t have a good answer, but this seems both separate from the philosophical model itself and not an objection to the application of this specific philosophical model (it applies to all experience-based consequentialism).
Vague idea for how to theoretically decide whether a mind’s existence at any given moment is net positive, from a utilitarian standpoint:
Get a list of every individual conscious experience the mind is having at the exact moment.
For each individual experience, crank up its “intensity” (i.e. magnitude of utility) by a factor of, say, 10; as intense as you can easily imagine and empathize with, but no more. This can be approximated by imagining the point at which you are indifferent to experiencing the “1x” experience for 10 seconds or the “10x” experience for 1 second. Try to imagine this independently of mental side effects caused by experiencing it for a long time.
Now imagine the entire collection of “10x” experiences together, for a “10x” existence. Would you prefer to experience this, or the “1x” existence? If you prefer 10x, the mind’s existence is net positive; if you prefer 1x, it’s net negative.
I feel like some version of this has to logically follow if you assume that the utility of multiple experiences at once is equal to the sum of the utilities of the individual experiences (might not be true), plus some uncontroversial utilitarian assumptions, but I can’t formalize the proof in my head.
If anyone has links to other attempts to get a utilitarian threshold for net positive/negative existence, they would be appreciated.
(I should note that I know approximately zero neuroscience, and don’t know if the concept of an “individual experience” as distinct from other “individual experiences” happening in the same mind at the same time is coherent)
Intentionally rationalizing against your beliefs could be a good strategy for doing a cost-benefit analysis. For example, if you currently support increasing the minimum wage, imagine yourself as someone who is against it, and from that perspective come up with as many disadvantages to it as possible. I’m sure I’m not the first to come up with this idea but I haven’t seen it anywhere else; is there a name for this concept?
Speculative Model For How Moral Arguments Work
Credibility warning: All of this is wild post-hoc speculation based on my vague intuitions. Don’t read it as if it has any semblance of authority. Feel free to bring up actual evidence if it confirms or denies my speculations, though. Also, this is my first post on LW so please point out if I violated any conventions, norms, etc.
Readability warning: This post was not very carefully edited, so the clarity, grammar, formatting, etc. might be a disaster, and there’s a good chance it’s nearly unreadable at times. Feel free to ask for clarification.
Word abuse warning: I might have unintentionally equivocated the meaning of “moral” between “not immoral” and “actively good” at some point. I think I consistently used it as “not immoral”, but let me know if I equivocated and I’ll fix it. I also might have unintentionally equivocated “immoral” between “bad but can be balanced out by good things” and “so bad that it can never be balanced out by any amount of good things”. Oh, I also might have used the word “axiom” in a slightly unintuitive way. Oops.
_________________________________________________________
Note: When I refer to morality in this post, I do so in the sense of “these are our most fundamental goals/things we should avoid” or “this is the fundamental goodness/badness of these actions”, not “we should act like these are our goals in order to achieve the real fundamental goals”.
Morality is fundamentally subjective and feelings-based but there seem to be ways that I can be persuaded of fundamental moral claims that isn’t just showing me pictures of starving African children. I’m currently a utilitarian, and there was some way I was talked into it, and I can articulate the thoughts and arguments that led me here, but I can’t describe exactly why or how they led me here. This a half-attempt at answering that question by discovering the underlying process that makes me convinced by some moral arguments and unconvinced by others.
I think of moral frameworks as sets of axioms about what is “moral” or “immoral” that follow deductive logic. There are a few ways I think I can be convinced or unconvinced of a moral axiom:
An axiom can have a strong emotional appeal on its own, and that’s enough to start. Examples:
The idea of killing someone and stealing their organs to save others feels bad to me, which makes it immoral to me until disproven through the methods below. [1]
The idea of causing nothing but suffering feels bad to me, which makes it immoral to me until disproven through the methods below.
Definition: Two axioms being “analogous” means that you have to invent some other stupid-feeling axiom for them not to have the same level of moral consideration. Examples:
Killing someone at 1:00 is analogous to killing the same person in the exact same circumstances in the exact same world except at 1:01, because for them not to have the same level of moral consideration, you need to invent an axiom about the inherent morality of 1:00 vs. 1:01, which feels stupid.
Killing (in the sense of ending a conscious experience) a human is analogous to killing a dog with the same human brain, because it feels stupid (to me, at least) to invent an axiom about the inherent morality of the positioning of the hair, skin, muscle, bones, DNA, etc. near the brain that generates the consciousness.
Rerouting an overflowing dam from a larger city to a smaller city is analogous to forcefully drowning a group of people in the ocean to save a larger group, like if there are a ton of human-hungry sharks nearby (assume suffering, financial resources lost, etc etc are all the same) because the only real difference is that one occurs within a city whereas one doesn’t, and it seems stupid to have an axiom that cares about “city-ness”. [2]
Analogous axioms must be brought to the same level of moral consideration. This is decided by choosing the one with the strongest emotional appeal. For the dog-with-human-brain example, you [3] likely feel more strongly that killing the human is immoral than you do that killing the dog with the human brain is moral, and since they’re analogous, you accept that both are immoral.
(Even more speculative than the rest of this post) The exact way these are reconciled might vaguely approximate the following: I feel with strength “+5” (positively) about scenario A, I feel with strength “-2“ (negatively) about scenario B, and they are analogous, so I readjust my feelings to 5-2=”+3” for both of them, meaning I think both of them are moral. I recognize the problem of there being infinite possible A-like [4] scenarios and infinite possible B-like scenarios, making this calculation impossible [5], but maybe there’s something to the addition idea.
Two axioms can also be “contradictory” (literally logically inconsistent or relying on a stupid-feeling axiom to make them have the same level of moral consideration). Too lazy to come up with an example right now but hopefully you get the idea; “contradictory” axioms are resolved more or less the same way “analogous” ones are.
I still have no idea how I derive more general principles; maybe it has something to do with recognizing patterns about which axioms are analogous to other axioms? Not sure.
I have no idea how I decide when my moral feelings about an axiom are “trustworthy” or “untrustworthy” (in the sense that I adjust its weight downward when having it “battle” against an analogous axiom). It seems obvious that I should trust it more with, for example, magnitudes of one person than magnitudes of a billion people, but what’s the underlying principle that causes me to feel this way? Can this be derived from the above “analogous axioms” strategy somehow?
Footnotes:
[1]: For what it’s worth, I do consider this one to be “disproven” in the sense that I’m no longer convinced it’s immoral, because I feel more strongly about increasing utility in all cases than I do about not killing people for their organs, which falls under the process I mentioned for resolving contradictory axioms, I think?
[2]: Now that I think about it, this is probably a horrible example because our intuitions on flooding the city are probably based on actual features about the city like buildings and culture and whatnot, so it’s kinda cheating to say “hold everything between the ocean and city constant and ignore all that”. But what makes this feel like valid reasoning for why this is a horrible example? I should have to justify that since it’s kind of what this post is about. Hmm.
[3] Phrased as “you” because I have mixed feelings about whether killing a human (i.e. causing net human death) is necessarily bad in the first place and didn’t want to mislead about my actual beliefs, but I assume most people here feel that killing a human is necessarily bad unless it’s balanced out by a positive.
[4] “A-like” here meaning that it’s analogous to A and carries the same “emotional reaction number” as A
[5] Uhh, maybe there’s some way to say they’re the same degree of infinity and it somehow cancels? Probably not.
Maybe a bit of a nitpick, but RLHF’d GPT-4o can still detect Eric Drexler’s writing (chat link). I gave it the first paragraph of his latest blog post, which was written in February 2024, past 4o’s knowledge cutoff date of October 2023. In general I’m not sure if RLHF actually makes the models worse at truesight. It would be interesting to see a benchmark comparing e.g. Llama base vs instruct on this capability.