Discord: LemonUniverse (lemonuniverse). Reddit: u/Smack-works. About my situation: here.
I wrote some worse posts before 2024 because I was very uncertain how the events may develop.
Discord: LemonUniverse (lemonuniverse). Reddit: u/Smack-works. About my situation: here.
I wrote some worse posts before 2024 because I was very uncertain how the events may develop.
Could you ELI15 the difference between Kolmogorov complexity (KC) and Kolmogorov structure function (KSF)?
Here are some of the things needed to formalize the proposal in the post:
A complexity metric defined for different model classes.
A natural way to “connect” models. So we can identify the same object (e.g. “diamond”) in two different models. Related: multi-level maps.
I feel something like KSF could tackle 1, but what about 2?
Thanks for clarifying! Even if I still don’t fully understand your position, I now see where you’re coming from.
No, I think it’s what humans actually pursue today when given the options. I’m not convinced that these values are static, or coherent, much less that we would in fact converge.
Then those values/motivations should be limited by the complexity of human cognition, since they’re produced by it. Isn’t that trivially true? I agree that values can be incoherent, fluid, and not converging to anything. But building Task AGI doesn’t require building an AGI which learns coherent human values. It “merely” requires an AGI which doesn’t affect human values in large and unintended ways.
No, because we don’t comprehend them, we just evaluate what we want locally using the machinery directly, and make choices based on that.
This feels like arguing over definitions. If you have an oracle for solving certain problems, this oracle can be defined as a part of your problem-solving ability. Even if it’s not transparent compared to your other problem-solving abilities. Similarly, the machinery which calculates a complicated function from sensory inputs to judgements (e.g. from Mona Lisa to “this is beautiful”) can be defined as a part of our comprehension ability. Yes, humans don’t know (1) the internals of the machinery or (2) some properties of the function it calculates — but I think you haven’t given an example of how human values depend on knowledge of 1 or 2. You gave an example of how human values depend on the maxima of the function (e.g. the desire to find the most delicious food), but that function having maxima is not an unknown property, it’s a trivial property (some foods are worse than others, therefore some foods have the best taste).
That’s a very big “if”! And simplicity priors are made questionable, if not refuted, by the fact that we haven’t gotten any convergence about human values despite millennia of philosophy trying to build such an explanation.
I agree that ambitious value learning is a big “if”. But Task AGI doesn’t require it.
But
To pursue their values, humans should be able to reason about them. To form preferences about a thing, humans should be able to consider the thing. Therefore, human ability to comprehend should limit what humans can care about. At least before humans start unlimited self-modification. I think this logically can’t be false.
Eliezer Yudkowsky is a core proponent of complexity of value, but in Thou Art Godshatter and Protein Reinforcement and DNA Consequentialism he basically makes a point that human values arose from complexity limitations, including complexity limitations imposed by brainpower limitations. Some famous alignment ideas (e.g. NAH, Shard Theory) kinda imply that human values are limited by human ability to comprehend and it doesn’t seem controversial. (The ideas themselves are controversial, but for other reasons.)
If learning values is possible at all, there should be some simplicity biases which help to learn them. Wouldn’t it be strange if those simplicity biases were absolutely unrelated to simplicity biases of human cognition?
Based on your comments, I can guess that something below is the crux:
You define “values” as ~”the decisions humans would converge to after becoming arbitrarily more knowledgeable”. But that’s a somewhat controversial definition (some knowledge can lead to changes in values) and even given that definition it can be true that “past human ability to comprehend limits human values” — since human values were formed before humans explored unlimited knowledge. Some values formed when humans were barely generally intelligent. Some values formed when humans were animals.
You say that values depend on inscrutable brain machinery. But can’t we treat the machinery as a part of “human ability to comprehend”?
You talk about ontology. Humans can care about real diamonds without knowing what physical things the diamonds are made from. My reply: I define “ability to comprehend” based on ability to comprehend functional behavior of a thing under normal circumstances. Based on this definition, a caveman counts as being able to comprehend the cloud of atoms his spear is made of (because the caveman can comprehend the behavior of the spear under normal circumstances), even though the caveman can’t comprehend atomic theory.
Could you confirm or clarify the crux? Your messages felt ambiguous to me. In what specific way is A false?
Are you talking about value learning? My proposal doesn’t tackle advanced value learning. Basically, my argument is “if (A) human values are limited by human ability to comprehend/optimize things and (B) the factors which make something easier or harder to comprehend/optimize are simple, then the AI can avoid accidentally messing up human values — so we can define safe impact measures and corrigibility”. My proposal is not supposed to make the AI learn human values in great detail or extrapolate them out of distribution. My argument is “if A and B hold, then we can draw a box around human values and tell the AI to not mess up the contents of the box — without making the AI useless; yet the AI might not know what exact contents of the box count as ‘human values’”.[1]
The problem with B is that humans have very specialized and idiosyncratic cognitive machinery (the machinery generating experiences) which is much more advanced than human general ability to comprehend things. I interpreted you as making this counterargument in the top level comment. My reply is that I think human values depend on that machinery in a very limited way, so B is still true enough. But I’m not talking about extrapolating something out of distribution. Unless I’m missing your point.
Why those things follow from A and B is not obvious and depends on a non-trivial argument. I tried to explain it in the first section of the post, but might’ve failed.
Thanks for elaborating! This might lead to a crux. Let me summarize the proposals from the post (those summaries can’t replace reading the post though).
Outer alignment:
We define something like a set of primitives. Those primitives are independent from any specific ontology.
We prove[1] that as long as AI acts and interprets tasks using those primitives, it can prevent humans from being killed or brainwashed or disempowered. Even if the primitives are not enough to give a very nuanced definition of a “human” or “brainwashing”. That’s where the “we can express care about incomprehensible things as care about comprehensible properties of incomprehensible things” argument comes into play.
Inner alignment:
We prove that a more complicated model (made of the primitives) can’t deceive a simpler model (made of the primitives). The inner/outer alignment of simple enough models can be verified manually.
We prove that the most complicated model (expressible with the primitives) has at least human-level intelligence.
Bonus: we prove that any model (made of the primitives) is interpretable/learnable by humans and prove that you don’t need more complicated models for defining corrigibility/honesty. Disclaimer: the proposals above are not supposed to be practical, merely bounded and being conceptually simple.
Why the heck would we be able to define primitives with such wildly nice properties? Because of the argument that human ability to comprehend and act in the world limits what humans might currently care about, and the current human values are enough to express corrigibility. If you struggle to accept this argument, maybe try assuming it’s true and see if you can follow the rest of the logic? Or try to find a flaw in the logic instead of disagreeing with the definitions. Or bring up a specific failure mode.
To have our values interface appropriately with with this novel thinking patterns in the AI, including through corrigibility, I think we have to work with “values” that are the sort of thing that can refer / be preserved / be transferred across “ontological” changes.
If you talk about ontological crisis or inner alignment, I tried to address those in the post. By the way, I read most of your blog post and skimmed the rest.
To actually prove it we need to fully formalize the idea, of course. But I think my idea is more specific than many other alignment ideas (e.g. corrigibility, Mechanistic Anomaly Detection, Shard Theory).
I probably disagree. I get the feeling you have an overly demanding definition of “value” which is not necessary for solving corrigibility and a bunch of other problems. Seems like you want to define “value” closer to something like CEV or “caring about the ever-changing semantic essence of human ethical concepts”. But even if we talk about those stronger concepts (CEV-like values, essences), I’d argue the dynamic I’m talking about (“human ability to comprehend limits what humans can care about”) still applies to them to an important extent.
See my response to David about a very similar topic. Lmk if it’s useful.
Basically, I don’t think your observation invalidates any ideas from the post.
The main point of the post is that human ability to comprehend should limit what humans can care about. This can’t be false. Like, logically. You can’t form preferences about things you can’t consider. When it looks like humans form preferences about incomprehensible things, they really form preferences only about comprehensible properties of those incomprehensible things. In the post I make an analogy with a pseudorandom number generator: it’s one thing to optimize a specific state of the PRNG or want the PRNG to work in a specific way, and another thing to want to preserve the PRNG’s current algorithm (whatever it is). The first two goals might be incomprehensible, but the last goal is comprehensible. Caring about friends works in a similar way to caring about a PRNG. (You might dislike this framing for philosophical or moral reasons, that’s valid, but it won’t make object-level ideas from the post incorrect.)
Yes, some value judgements (e.g. “this movie is good”, “this song is beautiful”, or even “this is a conscious being”) depend on inscrutable brain machinery, the machinery which creates experience. The complexity of our feelings can be orders of magnitude greater than the complexity of our explicit reasoning. Does it kill the proposal in the post? I think not, for the following reason:
We aren’t particularly good at remembering exact experiences, we like very different experiences, we can’t access each other’s experiences, and we have very limited ways of controlling experiences. So, there should be pretty strict limitations on how much understanding of the inscrutable machinery is required for respecting the current human values. Defining corrigible behavior (“don’t kill everyone”, “don’t seek power”, “don’t mess with human brains”) shouldn’t require answering many specific, complicated machinery-dependent questions (“what separates good and bad movies?”, “what separates good and bad life?”, “what separates conscious and unconscious beings?”).
Also, some thoughts about your specific counterexample (I generalized it to being about experiences in general):
“How stimulating or addicting or novel is this experience?” ← I think those parameters were always comprehensible and optimizable, even in the Stone Age. (In a limited way, but still.) For example, it’s easy to get different gradations of “less addicting experiences” by getting injuries, starving or not sleeping.
“How ‘good’ is this experience in a more nebulous or normative way?” ← I think this is a more complicated value (aesthetic taste), based on simpler values.
Note that I’m using “easy to comprehend” in the sense of “the thing behaves in a simple way most of the time”, not in the sense of “it’s easy to comprehend why the thing exists” or “it’s easy to understand the whole causal chain related to the thing”. I think the latter senses are not useful for a simplicity metric, because they would mark everything as equally incomprehensible.
Note that “I care about taste experiences” (A), “I care about particular chemicals giving particular taste experiences” (B), and “I care about preserving the status quo connection between chemicals and taste experiences” (C) are all different things. B can be much more complicated than C, B might require the knowledge of chemistry while C doesn’t.
Does any of the above help to find the crux of the disagreement or understand the intuitions behind my claim?
Could you reformulate the last paragraph as “I’m confused how your idea helps with alignment subrpoblem X”, “I think your idea might be inconsistent or having a failure mode because of Y”, or “I’m not sure how your idea could be used to define Z”?
Wrt the third paragraph. The post is about corrigible task ASI which could be instructed to protect humans from being killed/brainwashed/disempowered (and which won’t kill/brainwash/disempower people before it’s instructed to not do this). The post is not about value learning in the sense of “the AI learns plus-minus the entirety of human ethics and can build an utopia on its own”. I think developing my idea could help with such value learning, but I’m not sure I can easily back up this claim. Also, I don’t know how to apply my idea directly to neural networks.
I think I understand you now. Your question seems much simpler than I expected. You’re basically just asking “but what if we’ll want infinitely complicated / detailed values in the future?”
If people iterativly modified themselves, would their preferences become ever more exacting? If so, then it is true that the “variables humans care about can’t be arbitrarily complicated”, but the variables humans care about could define a desire to become a system capable of caring about arbitrarily complicated variables.
It’s OK if the principle won’t be true for humans in the future, it only needs to be true for the current values. Aligning AI to some of the current human concepts should be enough to define corrigibility, low impact, or avoid goodharting. I.e. create a safe Task AGI. I’m not trying to dictate to anyone what they should care about.
Don’t worry about not reading it all. But could you be a bit more specific about the argument you want to make or the ambiguity you want to clarify? I have a couple of interpretations of your question.
Interpretation A:
The post defines a scale-dependent metric which is supposed to tell how likely humans are to care about something.
There are objects which are identical/similar on every scale. Do they break the metric? (Similar questions can be asked about things other than “scale”.) For example, what if our universe contains an identical, but much smaller universe, with countless people in it? Men In Black style. Would the metric say we’re unlikely to care about the pocket universe just because of its size?
Interpretation B:
The principle says humans don’t care about constraining things in overly specific ways.
Some concepts with low Kolmogorov Complexity constrain things in infinitely specific ways.
My response to B is that my metric of simplicity is different from Kolmogorov Complexity.
Thanks a lot for willingness to go into details. And for giving advice on messaging other researchers.
No offense taken. The marriage option was funny, hope I never get that desperate. Getting official grants is probably not possible for me, but thanks for the suggestion.
by both sides, to be precise
My wording was deliberate. It’s one thing to sanction another country, and another thing to “sanction yourself”.
I’m an independent alignment researcher in Russia. Imagine someone wants to donate money to me (from Europe/UK/America/etc). How can I receive the money? It’s really crucial for me to receive at least 100$ per month, at least for a couple of months. Even 40$ per month would be a small relief. [EDIT: my latest, most well-received, and only published alignment research is here.]
Down below are all the methods I learned about after asking people, Google, Youtube and LLMs:
Crypto. The best option. But currently I’m a noob at crypto.
There are some official ways to get money into Russia, but the bank can freeze those transfers or start an inquiry.
Some freelance artists use Boosty (Patreon-like site). But Boosty can stall money transfers for months and more. If your account doesn’t have subscribers and legitimate content, it can rise suspicion of the site.
Someone from a ‘friendly country’ or from Russia itself could act as an intermediary. The last link on this page refers to a network of russian alignment researchers. (Additional challenge: I don’t have a smartphone. With a dumbphone it’s not practically possible to register in Telegram. But most russian alignment researchers are there.)
Get out of Russia. Impossible for me, even with financial help.
What should I do? Is there any system for supporting russian researchers?
Also, if I approach a fellow russian researcher about my problem, what should I say? I don’t have experience in this.
Needless to say, the situation is pretty stressful to me. Imagine getting a chance to earn something for your hard work, but then you can’t get even pennies because of absolutely arbitrary restrictions imposed by your own state.
EDIT 2: I got help. Thanks everyone!
Even with chess there are some nuances:
Chess engines use much more brute force than humans. Though I think it’s not that easy to compare who does more calculation, since humans have a lot of memory and pattern recognition. Also, I’ve heard about strong chess engines “without search” (grandmaster level), but haven’t looked into it.
This might be outdated, but chess engines struggle with “fortresses” (a rare position type in chess).
You at various points rely on an assumption that there is one unique scale of complexity (one ladder of ), and it’ll be shared between the humans and the AI. That’s not necessarily true, which creates a lot of leaks where an AI might do something that’s simple in the AI’s internal representation but complicated in the human’s.
I think there are many somewhat different scales of complexity, but they’re all shared between the humans and the AI, so we can choose any of them. We start with properties () which are definitely easy to understand for humans. Then we gradually relax those properties. According to the principle, properties will capture all key variables relevant to the human values much earlier than top human mathematicians and physicists will stop understanding what those properties might describe. (Because most of the time, living a value-filled life doesn’t require using the best mathematical and physical knowledge of the day.) My model: “the entirety of human ontology >>> the part of human ontology a corrigible AI needs to share”.
This raises a second problem, which is the “easy to optimize” criterion, and how it might depend on the environment and on what tech tree unlocks (both physical and conceptual) the agent already has. Pink paint is pretty sophisticated, even though our current society has commodified it so we can take getting some for granted. Starting from no tech tree unlocks at all, you can probably get to hacking humans before you can recreate the Sherwin Williams supply chain.
There are three important possibilities relevant to your hypothetical:
If technology T and human hacking are equally hard to comprehend, then (a) we don’t want the AI to build technology T or (b) the AI should be able to screen off technology T from humans more or less perfectly. For example, maybe producing paint requires complex manipulations with matter, but those manipulations should be screened off from humans. The last paragraph in this section mentions a similar situation.
Technology T is easier to comprehend than human hacking, but it’s more expensive (requires more resources). Then we should be able to allow the AI to use those resources, if we want to. We should be controlling how much resources the AI is using anyway, so I’m not introducing any unnatural epicycles here.[1]
If humans themselves built technology T which affects them in a complicated way (e.g. drugs), it doesn’t mean the AI should build similar types of technology on its own.
My point here is that I don’t think technology undermines the usefulness of my metric. And I don’t think that’s a coincidence. According to the principle, one or both of the below should be true:
Up to this point in time, technology never affected what’s easy to optimize/comprehend on a deep enough level.
Up to this point in time, humans never used technology to optimize/comprehend (on a deep enough level) most of their fundamental values.
If neither were true, we would believe that technology radically changed fundamental human values at some point in the past. We would see life without technology as devoid of most non-trivial human values.
When the metric is a bit fuzzy and informal, it’s easy to reach convenient/hopeful conclusions about how the human-intended behavior is easy to optimize, but it should be hard to trust those conclusions.
The selling point of my idea is that it comes with a story for why it’s logically impossible for it to fail or why all of its flaws should be easy to predict and fix. Is it easy to come up with such story for other ideas? I agree that it’s too early to buy that story. But I think it’s original and probable enough to deserve attention.
Remember that I’m talking about a Task-directed AGI, not a Sovereign AGI.
Can anybody give/reference an ELI5 or ELI15 explanation of this example? How can we use the models without creating them? I know that gradient descent is used to update neural networks, but how can you get the predictions of those NNs without having them?
I feel very confused about the problem. Would appreciate anyone’s help with the questions below.
Why doesn’t the Gooder Regulator theorem solve the Agent-Like Structure Problem?
The separation between the “world model”, “search process” and “problem specification” should be in space (not in time)? We should be able to carve the system into those parts, physically?
Why would problem specification nessecerily be outside of the world model??? I imagine it could be encoded as an extra object in the world model. Any intuition for why keeping them separate is good for the agent? (I’ll propose one myself, see 5.)
Why are the “world model” and “search process” two different entities, what does each of them do? What is the fundamental difference between “modeling the world” and “searching”? Like, imagine I have different types of heuristics (A, B, C) for predicting the world, but I also can use them for search.
Doesn’t the inner alignment problem resolve the Agent-Like Structure Problem? Let me explain. Take a human, e.g. me. I have a big, changing brain. Parts of my brain can be said to want different things. That’s an instance of the inner alignment problem. And that’s a reason why having my goals completely entangled with all other parts of my brain could be dangerous (in such case it could be easier for any minor misalignment to blow up and overwrite my entire personality).
As I understand, the arguments from here would at least partially solve the problem, right? If they were formalized.
I have a couple of silly, absurd questions related to mesa-optimizers and mesa-controllers. I’m asking them to get a fresh look on the problem of inner alignment. I want to get a better grip on what basic properties of a model make it safe.
Question 1. How do we know that Quantum Mechanics theory is not plotting to kill humanity?
It’s a model, so it could be unsafe just like an AI.
QM is not an agent, but its predictions strongly affect humanity. Oracles can be dangerous.
QM is highly interpretable, so we can check that it’s not doing internal search. Or can we? Maybe it does search in some implicit way? Eliezer brought up this possibility: if you prohibit an AI from modeling its programmers’ psychology, the AI might start modelling something seemingly irrelevant which is actually equivalent to modeling the programmers’ psychology.
Maybe the AI reasons about certain very complicated properties of the material object on the pedestal… in fact, these properties are so complicated that they turn out to contain implicit models of User2′s psychology
Even if QM doesn’t do search in any way… maybe it still was optimized to steer humanity towards disaster?
Or maybe QM is “grounded” in some special way (e.g. it’s easy to split into parts and verify that each part is correct), so we’re very confident that it does physics and only physics?
Question 2. Crazier version of the previous question: how do we know that Peano arithmetic isn’t plotting to destroy humanity? how do we know that the game of chess isn’t plotting to end humanity?
Maybe Peano arithmetic contains theorems trying to prove which steers the real world towards disaster. How can we know and when do we care?
Question 3. Imagine you came up with a plan to achieve your goals. You did it yourself. How do you know that this plan is not optimizing for your ruin?
Humans do go insane and fall into addictions. But not always. So why are our thoughts relatively safe to us? Why doesn’t every new thought / experience turn into addiction which wipes out all of your previous personality?
Question 4. You’re the Telepath. You can read the mind of the Killer. The Killer can reason about some things which aren’t comprehensible to you, but otherwise your cognition is very similar. Can you always tell if the Killer is planning to kill you?
Here are some thoughts the Killer might think:
“I need to do <something incomprehensible> so the Telepath dies.”
“I need to get the Telepath to eat this food with <something incomprehensible> in it.”
“I need to do <something incomprehensible> without any comprehensible reason.”
With 1 we can understand the outcome and that’s all that matters. With 2 we can still tell that something dodgy is going on. Even in 3 we see that the Killer tries to make his reasoning illegible. Maybe the Killer can never deceive us if the incomprehensible concepts he’s thinking about are “embedded” into the comprehensible concepts?
Yes, we’re able to tell if AI optimizes through a specific class of concepts. In most/all sections of the post I’m assuming the AI generates concepts in a special language (i.e. it’s not just a trained neural network), a language which allows to measure the complexity of concepts. The claim is that if you’re optimizing through concepts of certain complexity, then you can’t fulfill a task in a “weird” way. If the claim is true and AI doesn’t think in arbitrary languages, then it’s supposed to be impossible to create a harmful Doppelganger.
Clarification: only the interpretability section deals with inner alignment. The claims of the previous sections are not supposed to follow from the interpretability section.
Yes. The special language is supposed to have the property that Akcan automatically learn if Ak+1 plans good, bad, or unnecessary actions.An can’t be arbitrarily smarter than humans, but it’s a general intelligence which doesn’t imitate humans and can know stuff humans don’t know.