Discord: LemonUniverse (lemonuniverse). Reddit: u/Smack-works. Substack: The Lost Jockey. About my situation: here.
I wrote some worse posts before 2024 because I was very uncertain how the events may develop.
Discord: LemonUniverse (lemonuniverse). Reddit: u/Smack-works. Substack: The Lost Jockey. About my situation: here.
I wrote some worse posts before 2024 because I was very uncertain how the events may develop.
My point is that chairs and humans can be considered in a similar way.
Please explain how your point connects to my original message: are you arguing with it or supporting it or want to learn how my idea applies to something?
I see. But I’m not talking about figuring out human preferences, I’m talking about finding world-models in which real objects (such as “strawberries” or “chairs”) can be identified. Sorry if it wasn’t clear in my original message because I mentioned “caring”.
Models or real objects or things capture something that is not literally present in the world. The world contains shadows of these things, and the most straightforward way of finding models is by looking at the shadows and learning from them.
You might need to specify what you mean a little bit.
The most straightforward way of finding a world-model is just predicting your sensory input. But then you’re not guaranteed to get a model in which something corresponding to “real objects” can be easily identified. That’s one of the main reasons why ELK is hard, I believe: in an arbitrary world-model, “Human Simulator” can be much simpler than “Direct Translator”.
So how do humans get world-models in which something corresponding to “real objects” can be easily identified? My theory is in the original message. Note that the idea is not just “predict sensory input”, it has an additional twist.
Creating an inhumanly good model of a human is related to formulating their preferences.
How does this relate to my idea? I’m not talking about figuring out human preferences.
Thus it’s a step towards eliminating path-dependence of particular life stories
What is “path-dependence of particular life stories”?
I think things (minds, physical objects, social phenomena) should be characterized by computations that they could simulate/incarnate.
Are there other ways to characterize objects? Feels like a very general (or even fully general) framework. I believe my idea can be framed like this, too.
There’s an alignment-related problem, the problem of defining real objects. Relevant topics: environmental goals; task identification problem; “look where I’m pointing, not at my finger”; The Pointers Problem; Eliciting Latent Knowledge.
I think I realized how people go from caring about sensory data to caring about real objects. But I need help with figuring out how to capitalize on the idea.
So… how do humans do it?
Humans create very small models for predicting very small/basic aspects of sensory input (mini-models).
Humans use mini-models as puzzle pieces for building models for predicting ALL of sensory input.
As a result, humans get models in which it’s easy to identify “real objects” corresponding to sensory input.
For example, imagine you’re just looking at ducks swimming in a lake. You notice that ducks don’t suddenly disappear from your vision (permanence), their movement is continuous (continuity) and they seem to move in a 3D space (3D space). All those patterns (“permanence”, “continuity” and “3D space”) are useful for predicting aspects of immediate sensory input. But all those patterns are also useful for developing deeper theories of reality, such as atomic theory of matter. Because you can imagine that atoms are small things which continuously move in 3D space, similar to ducks. (This image stops working as well when you get to Quantum Mechanics, but then aspects of QM feel less “real” and less relevant for defining object.) As a result, it’s easy to see how the deeper model relates to surface-level patterns.
In other words: reality contains “real objects” to the extent to which deep models of reality are similar to (models of) basic patterns in our sensory input.
I don’t understand Model-Utility Learning (MUL) section, what pathological behavior does AI do?
Since humans (or something) must be labeling the original training examples, the hypothesis that building bridges means “what humans label as building bridges” will always be at least as accurate as the intended classifier. I don’t mean “whatever humans would label”. I mean they hypothesis that “build a bridge” means specifically the physical situations which were recorded as training examples for this system in particular, and labeled by humans as such.
So it’s like overfitting? If I train MUL AI to play piano in a green room, MUL AI learns that “playing piano” means “playing piano in a green room” or “playing piano in a room which would be chosen for training me in the past”?
Now, we might reasonably expect that if the AI considers a novel way of “fooling itself” which hasn’t been given in a training example, it will reject such things for the right reasons: the plan does not involve physically building a bridge.
But “sensory data being a certain way” is a physical event which happens in reality, so MUL AI might still learn to be a solipsist? MUL doesn’t guarantee to solve misgeneralization in any way?
If the answer to my questions is “yes”, what did we even hope for with MUL?
I’m noticing two things:
It’s suspicious to me that values of humans-who-like-paperclips are inherently tied to acquiring an unlimited amount of resources (no matter in which way). Maybe I don’t treat such values as 100% innocent, so I’m OK keeping them in check. Though we can come up with thought experiments where the urge to get more resources is justified by something. Like, maybe instead of producing paperclips those people want to calculate Busy Beaver numbers, so they want more and more computronium for that.
How consensual were the trades if their outcome is predictable and other groups of people don’t agree with the outcome? Looks like coercion.
Often I see people dismiss the things the Epicureans got right with an appeal to their lack of the scientific method, which has always seemed a bit backwards to me.
The most important thing, I think, is not even hitting the nail on the head, but knowing (i.e. really acknowledging) that a nail can be hit in multiple places. If you know that, the rest is just a matter of testing.
But avoidance of value drift or of unendorsed long term instability of one’s personality is less obvious.
What if endorsed long term instability leads to negation of personal identity too? (That’s something I thought about.)
I think corrigibility is the ability to change a value/goal system. That the literal meaning of the term… “Correctable”. If an AI were fully aligned, there would be no need to correct it.
Perhaps I should make a better argument:
It’s possible that AGI is correctable, but (a) we don’t know what needs to be corrected or (b) we cause new, less noticeable problems, while correcting AGI.
So, I think there’s not two assumptions “alignment/interpretability is not solved + AGI is incorrigible”, but only one — “alignment/interpretability is not solved”. (A strong version of corrigibility counts as alignment/interpretability being solved.)
Yes, and that’s the specific argument I am addressing,not AI risk in general. Except that if it’s many many times smarter, it’s ASI, not AGI.
I disagree that “doom” and “AGI going ASI very fast” are certain (> 90%) too.
It’s not aligned at every possible point in time.
I think corrigibility is “AGI doesn’t try to kill everyone and doesn’t try to prevent/manipulate its modification”. Therefore, in some global sense such AGI is aligned at every point in time. Even if it causes a local disaster.
Over 90% , as I said
Then I agree, thank you for re-explaining your opinion. But I think other probabilities count as high too.
To me, the ingredients of danger (but not “> 90%”) are those:
1st. AGI can be built without Alignment/Interpretability being solved. If that’s true, building AGI slowly or being able to fix visible problems may not matter that much.
2nd and 3rd. AGI can have planning ability. AGI can come up with the goal pursuing which would kill everyone.
2nd (alternative). AIs and AGIs can kill most humans without real intention of doing so, by destabilizing the world/amplifying already existing risks.
If I remember correctly, Eliezer also believes in “intelligence explosion” (AGI won’t be just smarter than humanity, but many-many times smarter than humanity: like humanity is smarter than ants/rats/chimps). Haven’t you forgot to add that assumption?
why is “superintelligence + misalignment” highly conjunctive?
In the sense that matters, it needs to be fast, surreptitious, incorrigible, etc.
What opinion are you currently arguing? That the risk is below 90% or something else? What counts as “high probability” for you?
Incorrigible misalignment is at least one extra assumption.
I think “corrigible misalignment” doesn’t exist, corrigble AGI is already aligned (unless AGI can kill everyone very fast by pure accident). But we can have differently defined terms. To avoid confusion, please give examples of scenarios you’re thinking about. The examples can be very abstract.
If AGI is AGI, there won’t be any problems to notice
Huh?
I mean, you haven’t explained what “problems” you’re talking about. AGI suddenly declaring “I think killing humans is good, actually” after looking aligned for 1 year? If you didn’t understand my response, a more respectful answer than “Huh?” would be to clarify your own statement. What noticeable problems did you talk about in the first place?
Please, proactively describe your opinions. Is it too hard to do? Conversation takes two people.
I’ve confused you with people who deny that a misaligned AGI is even capable of killing most humans. Glad to be wrong about you.
But I am not saying that the doom is unlikely given superintelligence and misalignment, I am saying the argument that gets there—superintelligence + misalignment—is highly conjunctive. The final step., the execution as it were, is no highly conjunctive.
But I don’t agree that it’s highly conjunctive.
If AGI is possible, then its superintelligence is a given. Superintelligence isn’t given only if AGI stops at human level of intelligence + can’t think much faster than humans + can’t integrate abilities of narrow AIs naturally. (I.e. if AGI is basically just a simulation of a human and has no natural advantages.) I think most people don’t believe in such AGI.
I don’t think misalignment is highly conjunctive.
I agree that hard takeoff is highly conjunctive, but why is “superintelligence + misalignment” highly conjunctive?
I think its needed for the “likely”. Slow takeoff gives humans more time to notice and fix problems, so the likelihood of bad outcomes goes down. Wasn’t that obvious?
If AGI is AGI, there won’t be any problems to notice. That’s why I think probability doesn’t decrease enough.
...
I hope that Alignment is much easier to solve than it seems. But I’m not sure (a) how much weight to put into my own opinion and (b) how much my probability of being right decreases the risk.
Yes, I probably mean something other than “>90%”.
[lists of various catastrophes. many of which have nothing to do with AI]
Why are you doing this? I did not say there is zero risk of anything. (...) Are you using “risk” to mean the probability of the outcome , or the impact of the outcome?
My argument is based on comparing the phenomenon of AGI to other dangerous phenomena. The argument is intended to show that bad outcome is likely (if AGI wants to do a bad thing, it can achieve it) and that impact of the outcome can kill most humans.
I think its needed for the “likely”. Slow takeoff gives humans more time to notice and fix problems, so the likelihood of bad outcomes goes down. Wasn’t that obvious?
To me the likelihood doesn’t go down enough (to the tolerable levels).
Informal logic is more holistic than not, I think, because it relies on implicit assumptions.
It’s not black and white. I don’t think they are zero risk, and I don’t think it is Certain Doom, so it’s not what I am talking about. Why are you bringing it up? Do you think there is a simpler argument for Certain Doom?
Could you proactively describe your opinion? Or re-describe it, by adding relevant details. You seemed to say “if hard takeoff, then likely doom; but hard takeoff is unlikely, because hard takeoff requires a conjunction of things to be true”. I answered that I don’t think hard takeoff is required. You didn’t explain that part of your opinion. Now it seems your opinion is more general (not focused on hard takeoff), but you refuse to clarify it. So, what is the actual opinion I’m supposed to argue with? I won’t try to use every word against you, so feel free to write more.
Doom meaning what? It’s obvious that there is some level of risk, but some level of risk isn’t Certain Doom. Certain Doom is an extraordinary claim,and the burden of proof therefore is on (certain) doomers. But you seem to be switching between different definitions.
I think “AGI is possible” or “AGI can achieve extraordinary things” is the extraordinary claim. The worry about its possible extraordinary danger is natural. Therefore, I think AGI optimists bear the burden of proving that a) likely risk of AGI is bounded by something and b) AGI can’t amplify already existing dangers.
By “likely doom” I mean likely (near-)extinction. “Likely” doesn’t have to be 90%.
Saying “the most dangerous technology with the worst safety and the worst potential to control it” doesn’t actually imply a high level of doom (p>9) or a high level of risk (> 90% dead)-- it’s only a relative statement.
I think it does imply so, modulo “p > 90%”. Here’s a list of the most dangerous phenomena: (L1)
Nuclear warfare. World wars.
An evil and/or suicidal world-leader.
Deadly pandemics.
Crazy ideologies, e.g. fascism. Misinformation. Addictions. People being divided on everything. (Problems of people’s minds.)
And a list of the most dangerous qualities: (L2)
Being superintelligent.
Wanting, planning to kill everyone.
Having a cult-following. Humanity being dependent on you.
Having direct killing power (like a deadly pandemic or a set of atomic bombs).
Multiplicity/simultaneity. E.g. if we had TWO suicidal world-leaders at the same time.
Things from L1 can barely scrap two points from L2, yet they can cause mass disruptions and claim many victims and also trigger each other. Narrow AI could secure three points from the list (narrow superintelligence + cult-following, dependency + multiplicity/simultaneity) — weakly, but potentially better than a powerful human ever could. However, AGI can easily secure three points from L3 in full. Four points, if AGI is developed more than in a single place. And I expect you to grant that general superintelligence presents a special, unpredictable danger.
Given that, I don’t see what should bound the risk from AGI or prevent it from amplifying already existing dangers.
Why ? I’m saying p(doom) is not high. I didn’t mention P(otherstuff).
To be able to argue something (/decide how to go about arguing something), I need to have an idea about your overall beliefs.
That doesn’t imply a high probability of mass extinction.
Could you clarify what your own opinion even is? You seem to agree that rapid self-improvement would mean likely doom. But you aren’t worried about gradual self-improvement or AGI being dangerously smart without much (self-)improvement?
I think I have already answered that: I don’t think anyone is going to deliberately build something they can’t control at all. So the probability of mass extinction depends on creating an uncontrollable superintelligence accidentally—for instance, by rapid recursive self improvement. And RRSI , AKA Foom Doom, is a conjunction of claims, all of which are p<1, so it is not high probability.
I agree that probability mostly depends on accidental AGI. I don’t agree that probability mostly depends on (very) hard takeoff. I believe probability mostly depends on just “AGI being smarter than all of humanity”. If you have a kill-switch or whatever, an AGI without Alignment theory being solved is still “the most dangerous technology with the worst safety and the worst potential to control it”.
So, could you go into more cruxes of your beliefs, more context? (More or less full context of my own beliefs is captured by the previous comment. But I’m ready to provide more if needed.) To provide more context to your beliefs, you could try answering “what’s the worst disaster (below everyone being dead) an AGI is likely to cause” or “what’s the best benefit an AGI is likely to give”. To make sure you aren’t treating an AGI as impotent in negative scenarios and as a messiah in positive scenarios. Or not treating humans as incapable of sinking even a safe non-sentient boat and refusing to vaccinate from viruses.
I want to discuss this topic with you iff you’re ready to proactively describe the cruxes of your own beliefs. I believe in likely doom and I don’t think the burden of proof is on “doomers”.
Maybe there just isn’t a good argument for Certain Doom (or at least high probability near-extinction). I haven’t seen one
What do you expect to happen when you’re building uninterpretable technology without safety guarantees, smarter than all of humanity? Looks like the most dangerous technology with the worst safety and the worst potential to control it.
To me, those abstract considerations are enough a) to conclude likely doom and b) to justify common folk in blocking AI capability research — if common folk could do so.
I believe experts should have accountability (even before a disaster happens) and owe some explanation of what they’re doing. If an expert is saying “I’m building the most impactful technology without safety but that’s suddenly OK this time around because… … I can’t say, you need to be an expert to understand”, I think it’s OK to not accept the answer and block the research.
You are correct that critical thinkers may want to censor uncritical thinkers. However, independent-minded thinkers do not want to censor conventional-minded thinkers.
I still don’t see it. Don’t see a causal mechanism that would cause it. Even if we replace “independent-minded” with “independent-minded and valuing independent-mindedness for everyone”. I have the same problems with it as Ninety-Three and Raphael Harth.
To give my own example. Algorithms in social media could be a little too good at radicalizing and connecting people with crazy opinions, such as flat earth. A person censoring such algorithms/their output could be motivated by the desire to make people more independent-minded.
I deliberately avoided examples for the same reason Paul Graham’s What You Can’t Say deliberately avoids giving any specific examples: because either my examples would be mild and weak (and therefore poor illustrations) or they’d be so shocking (to most people) they’d derail the whole conversation. (comment)
I think the value of a general point can only stem from re-evaluating specific opinions. Therefore, sooner or later the conversation has to tackle specific opinions.
If “derailment” is impossible to avoid, then “derailment” is a part of the general point. Or there are more important points to be discussed. For example, if you can’t explain to cave people General Relativity, maybe you should explain “science” and “language” first — and maybe those tangents are actually more valuable than General Relativity.
I dislike Graham’s essay for the same reason: when Graham does introduce some general opinions (“morality is like fashion”, “censuring is motivated by the fear of free-thinking”, “there’s no prize for figuring out quickly”, “a statement can’t be worse than false”), they’re not discussed critically, with examples. Re:say looks weird to me. Invisible opponents are allowed to say only one sentence and each sentence gets a lengthy “answer” with more opinions.
We only censor other people more-independent-minded than ourselves. (...) Independent-minded people do not censor conventional-minded people.
I’m not sure that’s true. Not sure I can interpret the “independent/dependent” distinction.
In “weirdos/normies” case, a weirdo can want to censor ideas of normies. For example, some weirdos in my country want to censor LGBTQ+ stuff. They already do.
In “critical thinkers/uncritical thinkers” case, people with more critical thinking may want to censor uncritical thinkers. (I believe so.) For example, LW in particular has a couple of ways to censor someone, direct and indirect.
In general, I like your approach of writing this post like an “informal theorem”.
Meta-level comment: I don’t think it’s good to dismiss original arguments immediately and completely.
Object-level comment:
I think it might be more complicated than that:
We need to define what “a model produced by a reward function” means, otherwise the claims are meaningless. Like, if you made just a single update to the model (based on the reward function), calling it “a model produced by the reward function” is meaningless (’cause no real optimization pressure was applied). So we do need to define some goal of optimization (which determines who’s a winner and who’s a loser).
We need to argue that the goal is sensible. I.e. somewhat similar to a goal we might use while training our AIs.
Here’s some things we can try:
We can try defining all currently living species as winners. But is it sensible? Is it similar to a goal we would use while training our AIs? “Let’s optimize our models for N timesteps and then use all surviving models regardless of any other metrics” ← I think that’s not sensible, especially if you use an algorithm which can introduce random mutations into the model.
We can try defining species which avoided substantial changes for the longest time as winners. This seems somewhat sensible, because those species experienced the longest optimization pressure. But then humans are not the winners.
We can define any species which gained general intelligence as winners. Then humans are the only winners. This is sensible because of two reasons. First, with general intelligence deceptive alignment is possible: if humans knew that Simulation Gods optimize organisms for some goal, humans could focus on that goal or kill all competing organisms. Second, many humans (in our reality) value creating AGI more than solving any particular problem.
I think the later is the strongest counter-argument to “humans are not the winners”.