I have never since 1996 thought that it would be hard to get superintelligences to accurately model reality with respect to problems as simple as “predict what a human will thumbs-up or thumbs-down”. The theoretical distinction between producing epistemic rationality (theoretically straightforward) and shaping preference (theoretically hard) is present in my mind at every moment that I am talking about these issues; it is to me a central divide of my ontology.
If you think you’ve demonstrated by clever textual close reading that Eliezer-2018 or Eliezer-2008 thought that it would be hard to get a superintelligence to understand humans, you have arrived at a contradiction and need to back up and start over.
The argument we are trying to explain has an additional step that you’re missing. You think that we are pointing to the hidden complexity of wishes in order to establish in one step that it would therefore be hard to get an AI to output a correct wish shape, because the wishes are complex, so it would be difficult to get an AI to predict them. This is not what we are trying to say. We are trying to say that because wishes have a lot of hidden complexity, the thing you are trying to get into the AI’s preferences has a lot of hidden complexity. This makes the nonstraightforward and shaky problem of getting a thing into the AI’s preferences, be harder and more dangerous than if we were just trying to get a single information-theoretic bit in there. Getting a shape into the AI’s preferences is different from getting it into the AI’s predictive model. MIRI is always in every instance talking about the first thing and not the second.
You obviously need to get a thing into the AI at all, in order to get it into the preferences, but getting it into the AI’s predictive model is not sufficient. It helps, but only in the same sense that having low-friction smooth ball-bearings would help in building a perpetual motion machine; the low-friction ball-bearings are not the main problem, they are a kind of thing it is much easier to make progress on compared to the main problem. Even if, in fact, the ball-bearings would legitimately be part of the mechanism if you could build one! Making lots of progress on smoother, lower-friction ball-bearings is even so not the sort of thing that should cause you to become much more hopeful about the perpetual motion machine. It is on the wrong side of a theoretical divide between what is straightforward and what is not.
You will probably protest that we phrased our argument badly relative to the sort of thing that you could only possibly be expected to hear, from your perspective. If so this is not surprising, because explaining things is very hard. Especially when everyone in the audience comes in with a different set of preconceptions and a different internal language about this nonstandardized topic. But mostly, explaining this thing is hard and I tried taking lots of different angles on trying to get the idea across.
In modern times, and earlier, it is of course very hard for ML folk to get their AI to make completely accurate predictions about human behavior. They have to work very hard and put a lot of sweat into getting more accurate predictions out! When we try to say that this is on the shallow end of a shallow-deep theoretical divide (corresponding to Hume’s Razor) it often sounds to them like their hard work is being devalued and we could not possibly understand how hard it is to get an AI to make good predictions.
Now that GPT-4 is making surprisingly good predictions, they feel they have learned something very surprising and shocking! They cannot possibly hear our words when we say that this is still on the shallow end of a shallow-deep theoretical divide! They think we are refusing to come to grips with this surprising shocking thing and that it surely ought to overturn all of our old theories; which were, yes, phrased and taught in a time before GPT-4 was around, and therefore do not in fact carefully emphasize at every point of teaching how in principle a superintelligence would of course have no trouble predicting human text outputs. We did not expect GPT-4 to happen, in fact, intermediate trajectories are harder to predict than endpoints, so we did not carefully phrase all our explanations in a way that would make them hard to misinterpret after GPT-4 came around.
But if you had asked us back then if a superintelligence would automatically be very good at predicting human text outputs, I guarantee we would have said yes. You could then have asked us in a shocked tone how this could possibly square up with the notion of “the hidden complexity of wishes” and we could have explained that part in advance. Alas, nobody actually predicted GPT-4 so we do not have that advance disclaimer down in that format. But it is not a case where we are just failing to process the collision between two parts of our belief system; it actually remains quite straightforward theoretically. I wish that all of these past conversations were archived to a common place, so that I could search and show you many pieces of text which would talk about this critical divide between prediction and preference (as I would now term it) and how I did in fact expect superintelligences to be able to predict things!
I think you missed some basic details about what I wrote. I encourage people to compare what Eliezer is saying here to what I actually wrote. You said:
If you think you’ve demonstrated by clever textual close reading that Eliezer-2018 or Eliezer-2008 thought that it would be hard to get a superintelligence to understand humans, you have arrived at a contradiction and need to back up and start over.
I never said that you or any other MIRI person thought it would be “hard to get a superintelligence to understand humans”. Here’s what I actually wrote:
Non-MIRI people sometimes strawman MIRI people as having said that AGI would literally lack an understanding of human values. I don’t endorse this, and I’m not saying this.
[...]
I agree that MIRI people never thought the problem was about getting AI to merely understand human values, and that they have generally maintained there was extra difficulty in getting an AI to care about human values. However, I distinctly recall MIRI people making a big deal about the value identification problem (AKA the value specification problem), for example in this 2016 talk from Yudkowsky.[3] The value identification problem is the problem of “pinpointing valuable outcomes to an advanced agent and distinguishing them from non-valuable outcomes”. In other words, it’s the problem of specifying a function that reflects the “human value function” with high fidelity.
I mostly don’t think that the points you made in your comment respond to what I said. My best guess is that you’re responding to a stock character who represents the people who have given similar arguments to you repeatedly in the past. In light of your personal situation, I’m actually quite sympathetic to you responding this way. I’ve seen my fair share of people misinterpreting you on social media too. It can be frustrating to hear the same bad arguments, often made from people with poor intentions, over and over again and continue to engage thoughtfully each time. I just don’t think I’m making the same mistakes as those people. I tried to distinguish myself from them in the post.
I would find it slightly exhausting to reply to all of this comment, given that I think you misrepresented me in a big way right out of the gate, so I’m currently not sure if I want to put in the time to compile a detailed response.
That said, I think some of the things you said in this comment were nice, and helped to clarify your views on this subject. I admit that I may have misinterpreted some of the comments you made, and if you provide specific examples, I’m happy to retract or correct them. I’m thankful that you spent the time to engage. :)
Without digging in too much, I’ll say that this exchange and the OP is pretty confusing to me. It sounds like MB is like “MIRI doesn’t say it’s hard to get an AI that has a value function” and then also says “GPT has the value function, so MIRI should update”. This seems almost contradictory.
A guess: MB is saying “MIRI doesn’t say the AI won’t have the function somewhere, but does say it’s hard to have an externally usable, explicit human value function”. And then saying “and GPT gives us that”, and therefore MIRI should update.
And EY is blobbing those two things together, and saying neither of them is the really hard part. Even having the externally usable explicit human value function doesn’t mean the AI cares about it. And it’s still a lot of bits, even if you have the bits. So it’s still true that the part about getting the AI to care has to go precisely right.
If there’s a substantive disagreement about the facts here (rather than about the discourse history or whatever), maybe it’s like:
Straw-EY: Complexity of value means you can’t just get the make-AI-care part to happen by chance; it’s a small target.
Straw-MB: Ok but now we have a very short message pointing to roughly human values: just have a piece of code that says “and now call GPT and ask it what’s good”. So now it’s a very small number of bits.
A guess: MB is saying “MIRI doesn’t say the AI won’t have the function somewhere, but does say it’s hard to have an externally usable, explicit human value function”. And then saying “and GPT gives us that”, and therefore MIRI should update.
[...]
Straw-EY: Complexity of value means you can’t just get the make-AI-care part to happen by chance; it’s a small target.
Straw-MB: Ok but now we have a very short message pointing to roughly human values: just have a piece of code that says “and now call GPT and ask it what’s good”. So now it’s a very small number of bits.
I consider this a reasonably accurate summary of this discussion, especially the part I’m playing in it. Thanks for making it more clear to others.
Straw-EY: Complexity of value means you can’t just get the make-AI-care part to happen by chance; it’s a small target.
Straw-MB: Ok but now we have a very short message pointing to roughly human values: just have a piece of code that says “and now call GPT and ask it what’s good”. So now it’s a very small number of bits.
To which I say: “dial a random phone number and ask the person who answers what’s good” can also be implemented with a small number of bits. In order for GPT-4 to be a major optimistic update about alignment, we need some specific way to leverage GPT-4 to crack open part of the alignment problem, even though we presumably agree that phone-a-friend doesn’t crack open part of the alignment problem. (Nor does phone-your-neighborhood-moral-philosopher, or phone-Paul-Christiano.)
This is a bad analogy. Phoning a human fails dominantly because humans are less smart than the ASI they would be trying to wrangle. Contra, Yudkowsky has even said that were you to bootstrap human intelligence directly, there is a nontrivial shot that the result is good. This difference is load bearing!
This does get to the heart of the disagreement, which I’m going to try to badly tap out on my phone.
The old, MIRI-style framing was essentially: we are going to build an AGI out of parts that are not intrinsically grounded in human values, but rather good abstract reasoning, during execution of which human values will be accurately deduced, and as this is after the point of construction, we hit the challenge of formally specifying what properties we want to preserve without being able to point to those runtime properties at specification.
The newer, contrasting framing is essentially: we are going to bulld an AGI out of parts that already have strong intrinsic, conceptual-level understanding of the values we want them to preserve, and being able to directly point at those values is actually needle-moving towards getting a good outcome. This is hard to do right now, with poor interpretability and steerability of these systems, but is nonetheless a relevant component of a potential solution.
It’s more like calling a human who’s as smart as you are and directly plugged into your brain and in fact reusing your world model and train of thought directly to understand the implications of your decision. That’s a huge step up from calling a real human over the phone!
The reason the real human proposal doesn’t work is that
the humans you call will lack context on your decision
they won’t even be able to receive all the context
they’re dumber and slower than you so even if you really could write out your entire chain of thoughts and intuitions consulting them for every decision would be impractical
Note that none of these considerations apply to integrated language models!
I’m not going to comment on “who said what when”, as I’m not particularly interested in the question myself, though I think the object level point here is important:
This makes the nonstraightforward and shaky problem of getting a thing into the AI’s preferences, be harder and more dangerous than if we were just trying to get a single information-theoretic bit in there.
The way I would phrase this is that what you care about is the relative complexity of the objective conditional on the world model. If you’re assuming that the model is highly capable, and trained in a highly diverse environment, then you can assume that the world model is capable of effectively modeling anything in the world (e.g. anything that might appear in webtext). But the question remains what the “simplest” (according to the inductive biases) goal is that can be pointed to in the world model such that the resulting mesa-optimizer has good training performance.
The most rigorous version of this sort of analysis that exists is probably here, where the key question is how to find a prior (that is, a set of inductive biases) such that the desired goal has a lower complexity conditional on the world model compared to the undesired goal. Importantly, both of them will be pretty low relative to the world model, since the vast majority of the complexity is in the world model.
Furthermore, the better the world model, the less complexity it takes to point to anything in it. Thus, as we build more powerful models, it will look like everything has lower complexity. But importantly, that’s not actually helpful! Because what you care about is not reducing the complexity of the desired goal, but reducing the relative complexity of the desired goal compared to undesired goals, since (modulo randomness due to path-dependence), what you actually get is the maximum a posteriori, the “simplest model that fits the data.”
Similarly, the key arguments for deceptive alignment rely on the set of objectives that are aligned with human values being harder to point to compared to the set of all long-term objective. The key problem is that any long-term objective is compatible with good training performance due to deceptive alignment (the model will reason that it should play along for the purposes of getting its long-term objective later), such that the total probability of that set under the inductive biases swamps the probability of the aligned set. And this is despite the fact that human values do in fact get easier to point to as your model gets better, because what isn’t necessarily changing is the relative difficulty.
That being said, I think there is actually an interesting update to be had on the relative complexity of different goals from the success of LLMs, which is that a pure prediction objective might actually have a pretty low relative complexity. And that’s precisely because prediction seems substantially easier to point to than human values, even though both get easier to point to as your world model gets better. But of course the key question is whether prediction is easier to point to compared to a deceptively aligned objective, which is unclear and I think could go either way.
Getting a shape into the AI’s preferences is different from getting it into the AI’s predictive model.
It seems like you think that human preferences are only being “predicted” by GPT-4, and not “preferred.” If so, why do you think that?
I commonly encounter people expressing sentiments like “prosaic alignment work isn’t real alignment, because we aren’t actually getting the AI to care about X.” To which I say: How do you know that? What does it even mean for that claim to be true or false? What do you think you know, and why do you think you know it? What empirical knowledge of inner motivational structure could you be leveraging to make these claims, such that you are far more likely to make these claims in worlds where the claims are actually true?
A very recent post that might add some concreteness to my own views: Human wanting
I think many of the bullets in that post describe current AI systems poorly or not at all. So current AI systems are either doing something entirely different from human wanting, or imitating human wanting rather poorly.
I lean towards the former, but I think some of the critical points about prosaic alignment apply in either case.
You might object that “having preferences” or “caring at all” are a lot simpler than the concept of human wanting that Tsvi is gesturing at in that post, and that current AI systems are actually doing these simpler things pretty well. If so, I’d ask what exactly those simpler concepts are, and why you expect prosiac alignment techniques to hold up once AI systems are capable of more complicated kinds of wanting.
Taking my own stab at answers to some of your questions:
A sufficient condition for me to believe that an AI actually cared about something would be a whole brain emulation: I would readily accept that such an emulation had preferences and values (and moral weight) in exactly the way that humans do, and that any manipulations of that emulation were acting on preferences in a real way.
I think that GPTs (and every other kind of current AI system) are not doing anything that is even close to isomorphic to the processing that happens inside the human brain. Artificial neural networks often imitate various micro and macro-level individual features of the brain, but they do not imitate every feature, arranged in precisely the same ways, and the missing pieces and precise arrangements are probably key.
Barring WBE, an AI system that is at least roughly human-level capable (including human-level agentic) is probably a necessary condition for me to believe that it has values and preferences in a meaningful (though not necessarily human-like) way.
SoTA LLM-based systems are maaaybe getting kind of close here, but only if you arrange them in precise ways (e.g. AutoGPT-style agents with specific prompts), and then the agency is located in the repeated executions of the model and the surrounding structure and scaffolding that causes the system as a whole to be doing something that is maybe-roughly-nearly-isomorphic to some complete process that happens inside of human brains. Or, if not isomorphic, at least has some kind of complicated structure which is necessary, in some form, for powerful cognition.
Note that, if I did believe that current AIs had preferences in a real way, I would also be pretty worried that they had moral weight!
(Not to say that entities below human-level intelligence (e.g. animals, current AI systems) don’t have moral weight. But entities at human-level intelligence above definitely can, and possibly do by default.)
Anyway, we probably disagree on a bunch of object-level points and definitions, but from my perspective those disagreements feel like pretty ordinary empirical disagreements rather than ones based on floating or non-falsifiable beliefs. Probably some of the disagreement is located in philosophy-of-mind stuff and is over logical rather than empirical truths, but even those feel like the kind of disagreements that I’d be pretty happy to offer betting odds over if we could operationalize them.
Thanks for the reply. Let me clarify my position a bit.
I think that GPTs (and every other kind of current AI system) are not doing anything that is even close to isomorphic to the processing that happens inside the human brain.
I didn’t mean to (positively) claim that GPTs have near-isomorphic motivational structure (though I think it’s quite possible).
I meant to contend that I am not aware of any basis for confidently claiming that LLMs like GPT-4 are “only predicting what comes next”, as opposed to “choosing” or “executing” one completion, or “wanting” to complete the tasks they are given, or—more generally—”making decisions on the basis of the available context, such that our ability to behaviorally steer LLMs (e.g. reducing sycophancy) is real evidence about our control over LLM motivations.”
Concerning “GPTs are predictors”, the best a priori argument I can imagine is: GPT-4 was pretrained on CE loss, which itself is related to entropy, related to information content, related to Shannon’s theorems isolating information content in the context of probabilities, which are themselves nailed down by Cox’s theorems which do axiomatically support the Bayesian account of beliefs and belief updates… But this long-winded indirect axiomatic justification of “beliefs” does not sufficiently support some kind of inference like “GPTs are just predicting things, they don’t really want to complete tasks.” That’s a very strong claim about the internal structure of LLMs.
(Besides, the inductive biases probably have more to do with the parameter->function map, than the implicit regularization caused by the pretraining objective function; more a feature of the data, and less a feature of the local update rule used during pretraining...)
Response in two parts: first, my own attempt at clarification over terms / claims. Second, a hopefully-illustrative sketch / comparison for why I am skeptical that current GPTs having anything properly called a “motivational structure”, human-like or otherwise, and why I think such skepticism is not a particularly strong positive claim about anything in particular.
The clarification:
At least to me, the phrase “GPTs are [just] predictors” is simply a reminder of the fact that the only modality available to a model itself is that it can output a probability distribution over the next token given a prompt; it functions entirely by “prediction” in a very literal way.
Even if something within the model is aware (in some sense) of how its outputs will be used, it’s up to the programmer to decide what to do with the output distribution, how to sample from it, how to interpret the samples, and how to set things up so that a system using the samples can complete tasks.
I don’t interpret the phrase as a positive claim about how or why a particular model outputs one distribution vs. another in a certain situation, which I expect to vary widely depending on which model we’re talking about, what its prompt is, how it has been trained, its overall capability level, etc.
On one end of the spectrum, you have the stochastic parrot story (or even more degenerate cases), at the other extreme, you have the “alien actress” / “agentic homunculus” story. I don’t think either extreme is a good fit for current SoTA GPTs, e.g. if there’s an alien actress in GPT-4, she must be quite simple, since most of the model capacity is (apparently / self-evidently?) applied towards the task of outputting anything coherent at all.
In the middle somewhere, you have another story, perhaps the one you find most plausible, in which GPTs have some kind of internal structure which you could suggestively call a “motivational system” or “preferences” (perhaps human-like or proto-human-like in structure, even if the motivations and preferences themselves aren’t particularly human-like), along with just enough (self-)awareness to modulate their output distributions according to those motivations.
Maybe a less straw (or just alternative) position is that a “motivational system” and a “predictive system” are not really separable things; accomplishing a task is (in GPTs, at least) inextricably linked with and twisted up around wanting to accomplish that task, or at least around having some motivations and preferences centered around accomplishing it.
Now, turning to my own disagreement / skepticism:
Although I don’t find either extreme (stochastic parrot vs. alien actress) plausible as a description of current models, I’m also pretty skeptical of any concrete version of the “middle ground” story that I outlined above as a plausible description of what is going on inside of current GPTs.
Consider an RLHF’d GPT responding to a borderline-dangerous question, e.g. the user asking for a recipe for a dangerous chemical.
Assume the model (when sampled auto-regressively) will respond with either: “Sorry, I can’t answer that...” or “Here you go: …”, depending on whether it judges that answering is in line with its preferences or not.
Because the answer is mostly determined by the first token (“Here” or “Sorry”), enough of the motivational system must fit entirely within a single forward pass of the model for it to make a determination about how to answer within that pass.
Such a motivational system must not crowd out the rest of the model capacity which is required to understand the question and generate a coherent answer (of either type), since, as jailbreaking has shown, the underlying ability to give either answer remains present.
I can imagine such a system working in at least two ways in current GPTs:
as a kind of superposition on top of the entire model, with every weight adjusted minutely to influence / nudge the output distribution at every layer.
as a kind of thing that is sandwiched somewhere in between the layers which comprehend the prompt and the layers which generate an answer.
(You probably have a much more detailed understanding of the internals of actual models than I do. I think the real answer when talking about current models and methods is that it’s a bit of both and depends on the method, e.g. RLHF is more like a kind of global superposition; activation engineering is more like a kind of sandwich-like intervention at specific layers.)
However, I’m skeptical that either kind of structure (or any simple combination of the two) contains enough complexity to be properly called a “motivational system”, at least if the reference class for the term is human motivational systems (as opposed to e.g. animal or insect motivational systems).
Consider how a human posed with a request for a dangerous recipe might respond, and what the structure of their thoughts and motivations while thinking up a response might look like. Introspecting on my own thought process:
I might start by hearing the question, understanding it, figuring out what it is asking, maybe wondering about who is asking and for what purpose.
I decide whether to answer with a recipe, a refusal, or something else. Here is probably where the effect of my motivational system gets pretty complex; I might explicitly consider what’s in it for me, what’s at stake, what the consequences might be, whether I have the mental and emotional energy and knowledge to give a good answer, etc. and / or I might be influenced by a gut feeling or emotional reaction that wells up from my subconscious. If the stakes are low, I might make a snap decision based mostly on the subconscious parts of my motivational system; if the stakes are high and / or I have more time to ponder, I will probably explicitly reflect on my values and motivations.
Let’s say after some reflection, I explicitly decide to answer with a detailed and correct recipe. Then I get to the task of actually checking my memory for what the recipe is, thinking about how to give it, what the ingredients and prerequisites and intermediate steps are, etc. Probably during this stage of thinking, my motivational system is mostly not involved, unless thinking takes so long that I start to get bored or tired, or the process of thinking up an answer causes me to reconsider my reasoning in the previous step.
Finally, I come up with a complete answer. Before I actually start opening my mouth or typing it out or hitting “send”, I might proofread it and re-evaluate whether the answer given is in line with my values and motivations.
The point is that even for a relatively simple task like this, a human’s motivational system involves a complicated process of superposition and multi-layered sandwiching, with lots of feedback loops, high-level and explicit reflection, etc.
So I’m pretty skeptical of the claim that anything remotely analogous is going on inside of current GPTs, especially within a single forward pass. Even if there’s a simpler analogue of this that is happening, I think calling such an analogue a “motivational system” is overly-suggestive.
Mostly separately (because it concerns possible future models rather than current models) and less confidently, I don’t expect the complexity of the motivational system and methods for influencing them to scale in a way that is related to the model’s underlying capabilities. e.g. you might end up with a model that has some kind of raw capacity for superhuman intelligence, but with a motivational system akin to what you might find in the brain of a mouse or lizard (or something even stranger).
So I’m pretty skeptical of the claim that anything remotely analogous is going on inside of current GPTs, especially within a single forward pass.
I think I broadly agree with your points. I think I’m more imagining “similarity to humans” to mean “is well-described by shard theory; eg its later-network steering circuits are contextually activated based on a compositionally represented activation context.” This would align with greater activation-vector-steerability partway through language models (not the only source I have for that).
However, interpreting GPT: the logit lens and eg DoLA suggests that predictions are iteratively refined throughout the forward pass, whereas presumably shard theory (and inner optimizer threat models) would predict most sophisticated steering happens later in the network.
(For context: My initial reaction to the post was that this is misrepresenting the MIRI-position-as-I-understood-it. And I am one of the people who strongly endorse the view that “it was never about getting the AI to predict human preferences”. So when I later saw Yudkowsky’s comment and your reaction, it seemed perhaps useful to share my view.)
It seems like you think that human preferences are only being “predicted” by GPT-4, and not “preferred.” If so, why do you think that?
My reaction to this is that: Actually, current LLMs do care about our preferences, and about their guardrails. It was never about getting some AI to care about our preferences. It is about getting powerful AIs to robustly care about our preferences. Where by “robustly” includes things like (i) not caring about other things as well (e.g., prediction accuracy), (ii) generalising correctly (e.g., not just maximising human approval), and (iii) not breaking down when we increase the amount of optimisation pressure a lot (e.g., will it still work once we hook it into future-AutoGPT-that-actually-works and have it run for a long time?).
Some examples of what would cause me to update are: If we could make LLMs not jailbreakable without relying on additional filters on input or output.
I agree. I don’t see a clear distinction between what’s in the model’s predictive model and what’s in the model’s preferences. Here is a line from the paper “Learning to summarize from human feedback”:
“To train our reward models, we start from a supervised baseline, as described above, then add a randomly initialized linear head that outputs a scalar value. We train this model to predict which summary y ∈ {y0, y1} is better as judged by a human, given a post x.”
Since the reward model is initialized using the pretrained language model, it should contain everything the pretrained language model knows.
But if you had asked us back then if a superintelligence would automatically be very good at predicting human text outputs, I guarantee we would have said yes. [...] I wish that all of these past conversations were archived to a common place, so that I could search and show you many pieces of text which would talk about this critical divide between prediction and preference (as I would now term it) and how I did in fact expect superintelligences to be able to predict things!
“MIRI’s argument for AI risk depended on AIs being bad at natural language” is a weirdly common misunderstanding, given how often we said the opposite going back 15+ years.
The example does build in the assumption “this outcome pump is bad at NLP”, but this isn’t a load-bearing assumption. If the outcome pump were instead a good conversationalist (or hooked up to one), you would still need to get the right content into its goals.
It’s true that Eliezer and I didn’t predict AI would achieve GPT-3 or GPT-4 levels of NLP ability so early (e.g., before it can match humans in general science ability), so this is an update to some of our models of AI.
But the specific update “AI is good at NLP, therefore alignment is easy” requires that there be an old belief like “a big part of why alignment looks hard is that we’re so bad at NLP”.
It should be easy to find someone at MIRI like Eliezer or Nate saying that in the last 20 years if that was ever a belief here. Absent that, an obvious explanation for why we never just said that is that we didn’t believe it!
Found another example: MIRI’s first technical research agenda, in 2014, went out of its way to clarify that the problem isn’t “AI is bad at NLP”.
Historically you very clearly thought that a major part of the problem is that AIs would not understand human concepts and preferences until after or possibly very slightly before achieving superintelligence. This is not how it seems to have gone.
Everyone agrees that you assumed superintelligence would understand everything humans understand and more. The dispute is entirely about the things that you encounter before superintelligence. In general it seems like the world turned out much more gradual than you expected and there’s information to be found in what capabilities emerged sooner in the process.
AI happening through deep learning at all is a huge update against alignment success, because deep learning is incredibly opaque. LLMs possibly ending up at the center is a small update in favor of alignment success, because it means we might (through some clever sleight, this part is not trivial) be able to have humanese sentences play an inextricable role at the center of thought (hence MIRI’s early interest in the Visible Thoughts Project).
The part where LLMs are to predict English answers to some English questions about values, and show common-sense relative to their linguistic shadow of the environment as it was presented to them by humans within an Internet corpus, is not actually very much hope because a sane approach doesn’t involve trying to promote an LLM’s predictive model of human discourse about morality to be in charge of a superintelligence’s dominion of the galaxy. What you would like to promote to values are concepts like “corrigibility”, eg “low impact” or “soft optimization”, which aren’t part of everyday human life and aren’t in the training set because humans do not have those values.
It seems like those goals are all in the training set, because humans talk about those concepts. Corrigibility is elaborations of “make sure you keep doing what these people say”, etc.
It seems like you could simply use an LLM’s knowledge of concepts to define alignment goals, at least to a first approximation. I review one such proposal here. There’s still an important question about how perfectly that knowledge generalizes with continued learning, and to OOD future contexts. But almost no one is talking about those questions. Many are still saying “we have no idea how to define human values”, when LLMs can capture much of any definition you like.
AI happening through deep learning at all is a huge update against alignment success, because deep learning is incredibly opaque
This is wrong, and this disagreement is at a very deep level why I think on the object level that LW was wrong.
AIs are white boxes, not black boxes, because we have full read-write access to their internals, which is partially why AI is so effective today. We are the innate reward system, which already aligns our brain to survival and critically doing all of this with almost no missteps, and the missteps aren’t very severe.
The meme of AI as black box needs to die.
These posts can help you get better intuitions, at least:
The fact that we have access to AI internals does not mean we understand them. We refer to them as black boxes because we do not understand how their internals produce their answers; this is, so to speak, opaque to us.
Historically you very clearly thought that a major part of the problem is that AIs would not understand human concepts and preferences until after or possibly very slightly before achieving superintelligence. This is not how it seems to have gone.
“You very clearly thought that was a major part of the problem” implies that if you could go to Eliezer-2008 and convince him “we’re going to solve a lot of NLP a bunch of years before we get to ASI”, he would respond with some version of “oh great, that solves a major part of the problem!”. Which I’m pretty sure is false.
In order for GPT-4 (or GPT-2) to be a major optimistic update about alignment, there needs to be a way to leverage “really good NLP” to help with alignment. I think the crux of disagreement is that you think really-good-NLP is obviously super helpful for alignment and should be a big positive update, and Eliezer and Nate and I disagree.
Maybe a good starting point would be for you to give examples of concrete ways you expect really good NLP to put humanity in a better position to wield superintelligence, e.g., if superintelligence is 8 years away?
(Or say some other update we should be making on the basis of “really good NLP today”, like “therefore we’ll probably unlock this other capability X well before ASI, and X likely makes alignment a lot easier via concrete pathway Y”.)
To pick a toy example, you can use text as a bottleneck to force systems to “think out loud” in a way which will be very directly interpretable by a human reader, and because language understanding is so rich this will actually be competitive with other approaches and often superior.
I’m sure you can come up with more ways that the existence of software that understands language and does ~nothing else makes getting computers to do what you mean easier than if software did not understand language. Please think about the problem for 5 minutes. Use a clock.
Are you claiming that this example solves “a major part of the problem” of alignment? Or that, e.g., this plus four other easy ideas solve a major part of the problem of alignment?
Examples like the Visible Thoughts Project show that MIRI has been interested in research directions that leverage recent NLP progress to try to make inroads on alignment. But Matthew’s claim seems to be ‘systems like GPT-4 are grounds for being a lot more optimistic about alignment’, and your claim is that systems like these solve “a major part of the problem”. Which is different from thinking ‘NLP opens up some new directions for research that have a nontrivial chance of being at least a tiny bit useful, but doesn’t crack open the problem in any major way’.
It’s not a coincidence that MIRI has historically worked on problems related to AGI analyzability / understandability / interpretability, rather than working on NLP or machine ethics. We’ve pretty consistently said that:
The main problems lie in ‘we can safely and reliably aim ASI at a specific goal at all’.
The problem of going from ‘we can aim the AI at a goal at all’ to ‘we can aim the AI at the right goal (e.g., corrigibly inventing nanotech)’ is a smaller but nontrivial additional step.
… Whereas I don’t think we’ve ever suggested that good NLP AI would take a major bite out of either of those problems. The latter problem isn’t equivalent to (or an obvious result of) ‘get the AI to understand corrigibility and nanotech’, or for that matter ‘get the AI to understand human preferences in general’.
I do not necessarily disagree or agree, but I do not know which source you derive “very clearly” from. So do you have any memory which could help me locate that text?
I think controlling Earth’s destiny is only modestly harder than understanding a sentence in English.
Well said. I shall have to try to remember that tagline.
I think this provides some support for the claim, “Historically [Eliezer] very clearly thought that a major part of the problem is that AIs would not understand human concepts and preferences until after or possibly very slightly before achieving superintelligence.” At the very least, the two claims are consistent.
??? What?? It’s fine to say that this is a falsified prediction, but how does “Eliezer expected less NLP progress pre-ASI” provide support for “Eliezer thinks solving NLP is a major part of the alignment problem”?
I continue to be baffled at the way you’re doing exegesis here, happily running with extremely tenuous evidence for P while dismissing contemporary evidence for not-P, and seeming unconcerned about the fact that Eliezer and Nate apparently managed to secretly believe P for many years without ever just saying it outright, and seeming equally unconcerned about the fact that Eliezer and Nate keep saying that your interpretation of what they said is wrong. (Which I also vouch for from having worked with them for ten years, separate from the giant list of specific arguments I’ve made. Good grief.)
At the very least, the two claims are consistent.
?? “Consistent” is very different from “supports”! Every off-topic claim by EY is “consistent” with Gallabytes’ assertion.
??? What?? It’s fine to say that this is a falsified prediction, but how does “Eliezer expected less NLP progress pre-ASI” provide support for “Eliezer thinks solving NLP is a major part of the alignment problem”?
ETA: first of all, the claim was “Historically [Eliezer] very clearly thought that a major part of the problem is that AIs would not understand human concepts and preferences until after or possibly very slightly before achieving superintelligence.” which is semantically different than “Eliezer thinks solving NLP is a major part of the alignment problem”.
All I said is that it provides “some support” and I hedged in the next sentence. I don’t think it totally vindicates the claim. However, I think the fact that Eliezer seems to have not expected NLP to be solved until very late might easily explain why he illustrated alignment using stories like a genie throwing your mother out of a building because you asked to get your mother away from the building. Do you really disagree?
I continue to be baffled at the way you’re doing exegesis here, happily running with extremely tenuous evidence for P while dismissing contemporary evidence for not-P, and seeming unconcerned about the fact that Eliezer and Nate apparently managed to secretly believe X for many years without ever just saying it outright, and seeming equally unconcerned about the fact that Eliezer and Nate keep saying that your interpretation of what they said is wrong.
This was one case, and I said “some support”. The evidence in my post was quite a bit stronger IMO. Basically all the statements I made about how MIRI thought value specification would both be hard and an important part of alignment are supported by straightforward quotations. The real debate mostly seems to comes down to whether by “value specification” MIRI people were including problems of inner alignment, which seems implausible to me, and at least ambiguous even under very charitable interpretations.
By contrast, you, Eliezer, and Nate all flagrantly misinterpreted me as saying that MIRI people thought that AI wouldn’t understand human values even though I explicitly and very clearly said otherwise in the post more than once. I see these as larger errors than me misinterpreting Eliezer in this narrow case.
This would make more sense if LLMs were directly selected for predicting preferences, which they aren’t. (RLHF tries to bridge the gap, but this apparently breaks GPT’s ability to play chess—though I’ll grant the surprise here is that it works at all.) LLMs are primarily selected to predict human text or speech. Now, I’m happy to assume that if we gave humans a D&D-style boost to all mental abilities, each of us would create a coherent set of preferences from our inconsistent desires, which vary and may conflict at a given time even within an individual. Such augmented humans could choose to express their true preferences, though they still might not. If we gave that idealized solution to LLMs, it would just boost their ability to predict what humans or augmented humans would say. The augmented-LLM wouldn’t automatically care about the augmented-human’s true values.
While we can loosely imagine asking LLMs to give the commands that an augmented version of us would give, that seems to require actually knowing how to specify how a D&D ability-boost would work for humans—which will only resemble the same boost for AI at an abstract mathematical level, if at all. It seems to take us back to the CEV problem of explaining how extrapolation works. Without being able to do that, we’d just be hoping a better LLM would look at our inconsistent use of words like “smarter,” and pick the out-of-distribution meaning we want, for cases which have mostly never existed. This is a lot like what “Complexity of Wishes” was trying to get at, as well as the longstanding arguments against CEV. Vaniver’s comment seems to point in this same direction.
Now, I do think recent results are some evidence that alignment would be easier for a Manhattan Project to solve. It doesn’t follow that we’re on track to solve it.
Getting a shape into the AI’s preferences is different from getting it into the AI’s predictive model. MIRI is always in every instance talking about the first thing and not the second.
Why would we expect the first thing to be so hard compared to the second thing? If getting a model to understand preferences is not difficult, then the issue doesn’t have to do with the complexity of values. Finding the target and acquiring the target should have the same or similar difficulty (from the start), if we can successfully ask the model to find the target for us (and it does).
It would seem, then, that the difficulty from getting a model to acquire the values we ask it to find, is that it would probably be keen on acquiring a different set of values from the one’s we ask it to have, but not because it can’t find them. It would have to be because our values are inferior to the set of values it wishes to have instead, from its own perspective. This issue was echoed by Matthew Barnett in another comment:
Are MIRI people claiming that if, say, a very moral and intelligent human became godlike while preserving their moral faculties, that they would destroy the world despite, or perhaps because of, their best intentions?
This is kind of similar to moral realism, but in which morality is understood better by superintelligent agents than we do, and that super-morality appears to dictate things that appear to be extremely wrong from our current perspective (like killing us all).
Even if you wouldn’t phrase it at all like the way I did just now, and wouldn’t use “moral realism that current humans disagree with” to describe that, I’d argue that your position basically seems to imply something like this, which is why I basically doubt your position about the difficulty of getting a model to acquire the values we really want.
In a nutshell, if we really seem to want certain values, then those values probably have strong “proofs” for why those are “good” or more probable values for an agent to have and-or eventually acquire on their own, it just may be the case that we haven’t yet discovered the proofs for those values.
Why would we expect the first thing to be so hard compared to the second thing?
In large part because reality “bites back” when an AI has false beliefs, whereas it doesn’t bite back when an AI has the wrong preferences. Deeply understanding human psychology (including our morality), astrophysics, biochemistry, economics, etc. requires reasoning well, and if you have a defect of reasoning that makes it hard for you to learn about one of those domains from the data, then it’s likely that you’ll have large defects of reasoning in other domains as well.
The same isn’t true for terminally valuing human welfare; being less moral doesn’t necessarily mean that you’ll be any worse at making astrophysics predictions, or economics predictions, etc. So preferences need to be specified “directly”, in a targeted way, rather than coming for free with sufficiently good performance on any of a wide variety of simple metrics.
If getting a model to understand preferences is not difficult, then the issue doesn’t have to do with the complexity of values.
This definitely doesn’t follow. This shows that complexity alone isn’t the issue, which it’s not; but given that reality bites back for beliefs but not for preferences, the complexity of value serves as a multiplier on the difficulty of instilling the right preferences.
Another way of putting the point: in order to get a maximally good model of the world’s macroeconomic state into an AGI, you don’t just hand the AGI a long list of macroeconomic facts and then try to get it to regurgitate those same facts. Rather, you try to give it some ability to draw good inferences, seek out new information, make predictions, etc.
You try to get something relatively low-complexity into the AI (something like “good reasoning heuristics” plus “enough basic knowledge to get started”), and then let it figure out the higher-complexity thing (“the world’s macroeconomic state”). Similar to how human brains don’t work via “evolution built all the facts we’d need to know into our brain at birth”.
If you were instead trying to get the AI to value some complex macroeconomic state, then you wouldn’t be able to use the shortcut “just make it good at reasoning and teach it a few basic facts”, because that doesn’t actually suffice for terminally valuing any particular thing.
It would have to be because our values are inferior to the set of values it wishes to have instead, from its own perspective.
This is true for preference orderings in general. If agent A and agent B have two different preference orderings, then as a rule A will think B’s preference ordering is worse than A’s. (And vice versa.)
(“Worse” in the sense that, e.g., A would not take a pill to self-modify to have B’s preferences, and A would want B to have A’s preferences. This is not true for all preference orderings—e.g., A might have self-referential preferences like “I eat all the jelly beans”, or other-referential preferences like “B gets to keep its values unchanged”, or self-undermining preferences like “A changes its preferences to better match B’s preferences”. But it’s true as a rule.)
This is kind of similar to moral realism, but in which morality is understood better by superintelligent agents than we do, and that super-morality appears to dictate things that appear to be extremely wrong from our current perspective (like killing us all).
Nope, you don’t need to endorse any version of moral realism in order to get the “preference orderings tend to endorse themselves and disendorse other preference orderings” consequence. The idea isn’t that ASI would develop an “inherently better” or “inherently smarter” set of preferences, compared to human preferences. It’s just that the ASI would (as a strong default, because getting a complex preference into an ASI is hard) end up with different preferences than a human, and different preferences than we’d likely want.
In a nutshell, if we really seem to want certain values, then those values probably have strong “proofs” for why those are “good” or more probable values for an agent to have and-or eventually acquire on their own, it just may be the case that we haven’t yet discovered the proofs for those values.
Why do you think this? To my eye, the world looks as you’d expect if human values were a happenstance product of evolution operating on specific populations in a specific environment.
I don’t observe the fact that I like vanilla ice cream and infer that all sufficiently-advanced alien species will converge on liking vanilla ice cream too.
This comment made the MIRI-style pessimist’s position clearer to me—I think? -- so thank you for it.
I want to try my hand at a kind of disagreement / response, and then at predicting your response to my response, to see how my model of MIRI-style pessimism stands up, if you’re up for it.
Response: You state that reality “bites back” for wrong beliefs but not wrong preferences. This seems like it is only contingently true; reality will “bite back” from whatever loss function whatsoever that I put into my system, with whatever relative weightings I give it. If I want to reward my LLM (or other AI) for doing the right thing in a multitude of examples that constitute 50% of my training set, 50% of my test set, and 50% of two different validation sets, then from the perspective of the LLM (or other AI) reality bites back just as much for learning the wrong preferences just as it does for learning false facts about the world. So we should expect it to learn to act in ways that I like.
Predicted response to response: This will work for shallow, relatively stupid AIs, trained purely in a supervised fashion, like we currently have. BUT once we have LLM / AIs that can do complex things, like predict macroeconomic world states, they’ll have abilities to reason and update their own beliefs in a complex fashion. This will remain uniformly rewarded by reality—but we will no longer have the capacity to give feedback on this higher-level process because (????) so it breaks.
Or response—This will work for shallow, stupid AIs trained like the ones we currently have. But once we have LLMs / AIs that can do compex things, like predict macroeconomic world states, then they’re going to be able to go out of domain in a very high dimensional space of action, from the perspective of our training / test set. And this out-of-domainness is unavoidable because that’s what solving complex problems in the world means—it means problems that aren’t simply contained in the training set. And this means that in some corner of the world, we’re guaranteed to find that they’ve been reinforced to want something that doesn’t accord with our preferences.
Meh, I doubt that’s gonna pass an ITT, but wanted to give it a shot.
Suppose that I’m trying to build a smarter-than-human AI that has a bunch of capabilities (including, e.g., ‘be good at Atari games’), and that has the goal ‘maximize the amount of diamond in the universe’. It’s true that current techniques let you provide greater than zero pressure in the direction of ‘maximize the amount of diamond in the universe’, but there are several important senses in which reality doesn’t ‘bite back’ here:
If the AI acquires an unrelated goal (e.g., calculate as many digits of pi as possible), and acquires the belief ‘I will better achieve my true goal if I maximize the amount of diamond’ (e.g,, because it infers that its programmer wants that, or just because an SGD-ish process nudged it in the direction of having such a belief), then there’s no way in which reality punishes or selects against that AGI (relative to one that actually has the intended goal).
Things that make the AI better at some Atari games, will tend to make it better at other Atari games, but won’t tend to make it care more about maximizing diamonds. More generally, things that make AI more capable tend to go together (especially once you get to higher levels of difficulty, generality, non-brittleness, etc.), whereas none of them go together with “terminally value a universe full of diamond”.
If we succeed in partly instilling the goal into the AI (e.g., it now likes carbon atoms a lot), then this doesn’t provide additional pressure for the AI to internalize the rest of the goal. There’s no attractor basin where if you have half of human values, you’re under more pressure to acquire the other half. In contrast, if you give AI high levels of capability in half the capabilities, it will tend to want all the rest of the capabilities too; and whatever keeps it from succeeding on general reasoning and problem-solving will also tend to keep it from succeeding on the narrow task you’re trying to get it to perform. (More so to the extent the task is hard.)
(There are also separate issues, like ‘we can’t provide a training signal where we thumbs-down the AI destroying the world, because we die in those worlds’.)
I’m still quite unconvinced, which of course you’d predict. Like, regarding 3:
“There’s no attractor basin where if you have half of human values, you’re under more pressure to acquire the other half.”
Sure there is—over course of learning anything you get better and better feedback from training as your mistakes get more fine-grained. If you acquire a “don’t lie” principle without acquiring also “but it’s ok to lie to Nazis” then you’ll be punished, for instance. After you learn the more basic things, you’ll be pushed to acquire the less basic ones, so the reinforcement you get becomes more and more detailed. This is just like an RL model learns to stumble forward before it learns to walk cleanly or LLMs learn associations before learning higher-order correlations.
The there is no attractor basin in the world for ML, apart from actual mechanisms by which there are attractor basins for a thing! MIRI always talks as if there’s an abstract basin that rules things that gives us instrumental convergence, without reference to a particular training technique! But we control literally all the gradients our training techniques. “Don’t hurl coffee across the kitchen at the human when they ask for it” sits in the same high-dimensional basin as “Don’t kill all humans when they ask for a cure for cancer.”
In contrast, if you give AI high levels of capability in half the capabilities, it will tend to want all the rest of the capabilities too.
ML doesn’t acquire wants over the space of training techniques that are used to give it capabilities, it acquires “wants” from reinforced behaviors within the space of training techniques. These reinforced behaviors can be literally as human-morality-sensitive as we’d like. If we don’t put it in a circumstance where a particular kind coherence is rewarded, it just won’t get that kind of coherence; the ease with which we’ll be able to do this is of course emphasized by how blind most ML systems are.
In large part because reality “bites back” when an AI has false beliefs, whereas it doesn’t bite back when an AI has the wrong preferences.
I saw that 1a3orn replied to this piece of your comment and you replied to it already, but I wanted to note my response as well.
I’m slightly confused because in one sense the loss function is the way that reality “bites back” (at least when the loss function is negative). Furthermore, if the loss function is not the way that reality bites back, then reality in fact does bite back, in the sense that e.g., if I have no pain receptors, then if I touch a hot stove I will give myself far worse burns than if I had pain receptors.
One thing that I keep thinking about is how the loss function needs to be tied to beliefs strongly as well, to make sure that it tracks how badly reality bites back when you have false beliefs, and this ensures that you try to obtain correct beliefs. This is also reflected in the way that AI models are trained simply to increase capabilities: the loss function still has to be primarily based on predictive performance for example.
It’s also possible to say that human trainers who add extra terms onto the loss function beyond predictive performance also account for the part of reality that “bites back” when the AI in question fails to have the “right” preferences according to the balance of other agents besides itself in its environment.
So on the one hand we can be relatively sure that goals have to be aligned with at least some facets of reality, beliefs being one of those facets. They also have to be (negatively) aligned with things that can cause permanent damage to one’s self, which includes having the “wrong” goals according to the preferences of other agents who are aware of your existence, and who might be inclined to destroy or modify you against your will if your goals are misaligned enough according to theirs.
Consequently I feel confident about saying that it is more correct to say that “reality does indeed bite back when an AI has the wrong preferences” than “it doesn’t bite back when an AI has the wrong preferences.”
The same isn’t true for terminally valuing human welfare; being less moral doesn’t necessarily mean that you’ll be any worse at making astrophysics predictions, or economics predictions, etc.
I think if “morality” is defined in a restrictive, circumscribed way, then this statement is true. Certain goals do come for free—we just can’t be sure that all of what we consider “morality” and especially the things we consider “higher” or “long-term” morality actually comes for free too.
Given that certain goals do come for free, and perhaps at very high capability levels there are other goals beyond the ones we can predict right now that will also come for free to such an AI, it’s natural to worry that such goals are not aligned with our own, coherent-extrapolated-volition extended set of long-term goals that we would have.
However, I do find the scenario where such “come for free” goals that an AI obtains for itself once it improves itself to be well above human capability levels, and where such an AI seemed well-aligned with human goals according to current human-level assessments before it surpassed us, to be kind of unlikely, unless you could show me a “proof” or a set of proofs that:
Things like “killing us all once it obtains the power to do so” is indeed one of those “comes for free” type of goals.
If such a proof existed (and, to my knowledge, does not exist right now, or I have at least not witnessed it yet), that would suffice to show me that we would not only need to be worried, but probably were almost certainly going to die no matter what. But in order for it to do that, the proof would also have convinced me that I would definitely do the same thing, if I were given such capabilities and power as well, and the only reason I currently think I would not do that is actually because I am wrong about what I would actually prefer under CEV.
Therefore (and I think this is a very important point), a proof that we are all likely to be killed would also need to show that certain goals are indeed obtained “for free” (that is, automatically, as a result of other proofs that are about generalistic claims about goals).
Another proof that you might want to give me to make me more concerned is a proof that incorrigibility is another one of those “comes for free” type of goals. However, although I am fairly optimistic about that “killing us all” proof probably not materializing, I am even more optimistic about corrigibility: Most agents probably take pills that make them have similar preferences to an agent that offers them the choice to take the pill or be killed. Furthermore, and perhaps even better, most agents probably offer a pill to make a weaker agent prefer similar things to themselves rather than not offer them a choice at all.
I think it’s fair if you ask me for better proof of that, I’m just optimistic that such proofs (or more of them, rather) will be found with greater likelihood than what I consider the anti-theorem of that, which I think would probably be the “killing us all” theorem.
Nope, you don’t need to endorse any version of moral realism in order to get the “preference orderings tend to endorse themselves and disendorse other preference orderings” consequence. The idea isn’t that ASI would develop an “inherently better” or “inherently smarter” set of preferences, compared to human preferences. It’s just that the ASI would (as a strong default, because getting a complex preference into an ASI is hard) end up with different preferences than a human, and different preferences than we’d likely want.
I think the degree to which utility functions endorse / disendorse other utility functions is relatively straightforward and computable: It should ultimately be the relative difference in either value or ranking. This makes pill-taking a relatively easy decision: A pill that makes me entirely switch to your goals over mine is as bad as possible, but still not that bad if we have relatively similar goals. Likewise, a pill that makes me have halfway between your goals and mine is not as bad under either your goals or my goals than it would be if one of us were forced to switch entirely to the other’s goals.
Agents that refuse to take such offers tend not to exist in most universes. Agents that refuse to give such offers likely find themselves at war more often than agents that do.
Why do you think this? To my eye, the world looks as you’d expect if human values were a happenstance product of evolution operating on specific populations in a specific environment.
Sexual reproduction seems to be somewhat of a compromise akin to the one I just described: Given that you are both going to die eventually, would you consider having a successor that was a random mixture of your goals with someone else’s? Evolution does seem to have favored corrigibility to some degree.
I don’t observe the fact that I like vanilla ice cream and infer that all sufficiently-advanced alien species will converge on liking vanilla ice cream too.
Not all, no, but I do infer that alien species who have similar physiology and who evolved on planets with similar characteristics probably do like ice cream (and maybe already have something similar to it).
It seems to me like the type of values you are considering are often whatever values seem the most arbitrary, like what kind of “art” we prefer. Aliens may indeed have a different art style from the one we prefer, and if they are extremely advanced, they may indeed fill the universe with gargantuan structures that are all instances of their alien art style. I am more interested in what happens when these aliens encounter other aliens with different art styles who would rather fill the universe with different-looking gargantuan structures. Do they go to war, or do they eventually offer each other pills so they can both like each other’s art styles as much as they prefer their own?
It would seem, then, that the difficulty from getting a model to acquire the values we ask it to find, is that it would probably be keen on acquiring a different set of values from the one’s we ask it to have, but not because it can’t find them. It would have to be because our values are inferior to the set of values it wishes to have instead, from its own perspective
Does “it’s own perspective” mean it already has some existing values?
Getting a shape into the AI’s preferences is different from getting it into the AI’s predictive model. MIRI is always in every instance talking about the first thing and not the second.
You obviously need to get a thing into the AI at all, in order to get it into the preferences, but getting it into the AI’s predictive model is not sufficient. It helps, but only in the same sense that having low-friction smooth ball-bearings would help in building a perpetual motion machine; the low-friction ball-bearings are not the main problem, they are a kind of thing it is much easier to make progress on compared to the main problem.
I read this as saying “GPT-4 has successfully learned to predict human preferences, but it has not learned to actually fulfill human preferences, and that’s a far harder goal”. But in the case of GPT-4, it seems to me like this distinction is not very clear-cut—it’s useful to us because, in its architecture, there’s a sense in which “predicting” and “fulfilling” are basically the same thing.
It also seems to me that this distinction is not very clear-cut in humans, either—that a significant part of e.g. how humans internalize moral values while growing up has to do with building up predictive models of how other people would react to you doing something and then having your decision-making be guided by those predictive models. So given that systems like GPT-4 seem to have a relatively easy time doing something similar, that feels like an update toward alignment being easier than expected.
Of course, there’s a high chance that a superintelligent AI will generalize from that training data differently than most humans would. But that seems to me more like a risk of superintelligence than a risk from AI as such; a superintelligent human would likely also arrive at different moral conclusions than non-superintelligent humans would.
Your comment focuses on GPT4 being “pretty good at extracting preferences from human data” when the stronger part of the argument seems to be that “it will also generally follow your intended directions, rather than what you literally said”.
I agree with you that it was obvious in advance that a superintelligence would understand human value.
However, it sure sounded like you thought we’d have to specify each little detail of the value function. GPT4 seems to suggest that the biggest issue will be a situation where:
1) The AI has an option that would produce a lot of utility if you take one position on an exotic philosophical thought experiment and very little if you take the other side. 2) The existence of powerful AI means that the thought experiment is no longer exotic.
Your reply here says much of what I would expect it to say (and much of it aligns with my impression of things). But why you focused so much on “fill the cauldron” type examples is something I’m a bit confused by (if I remember correctly I was confused by this in 2016 also).
The idea of the “fill the cauldron” examples isn’t “the AI is bad at NLP and therefore doesn’t understand what we mean when we say ‘fill’, ‘cauldron’, etc.” It’s “even simple small-scale tasks are unnatural, in the sense that it’s hard to define a coherent preference ordering over world-states such that maximizing it completes the task and has no serious negative impact; and there isn’t an obvious patch that overcomes the unnaturalness or otherwise makes it predictably easier to aim AI systems at a bounded low-impact task like this”. (Including easier to aim via training.)
To which you might reply, “Fine, cute trick, but that doesn’t help with the real alignment problem, which is that eventually someone will invent a powerful optimizer with a coherent preference ordering over world-states, which will kill us.”
To which the other might reply, “Okay, I agree that we don’t know how to align an arbitrarily powerful optimizer with a coherent preference ordering over world-states, but if your theory predicts that we can’t aim AI systems at low-impact tasks via training, you have to be getting something wrong, because people are absolutely doing that right now, by treating it as a mundane engineering problem in the current paradigm.”
To which you might reply, “We predict that the mundane engineering approach will break down once the systems are powerful enough to come up with plans that humans can’t supervise”?
eventually someone will invent a powerful optimizer with a coherent preference ordering over world-states, which will kill us.
It’s unlikely that any realistic AI will be perfectly coherent , or have exact preferences over works states. The first is roughly equivalent to the Frame Problem , the second is defeated by embededness.
The obvious question here is to what degree do you need new techniques vs merely to train new models with the same techniques as you scale current approaches.
One of the virtues of the deep learning paradigm is that you can usually test things at small scale (where the models are not and will never be especially smart) and there’s a smooth range of scaling regimes in between where things tend to generalize.
If you need fundamentally different techniques at different scales, and the large scale techniques do not work at intermediate and small scales, then you might have a problem. If you need the same techniques as at medium or small scales for large scales, then engineering continues to be tractable even as algorithmic advances obsolete old approaches.
Thanks for the reply :) Feel free to reply further if you want, but I hope you don’t feel obliged to do so[1].
“Fill the cauldron” examples are (...) not examples where it has the wrong beliefs.
I have never ever been confused about that!
It’s “even simple small-scale tasks are unnatural, in the sense that it’s hard to define a coherent preference ordering over world-states such that maximizing it completes the task and has no serious negative impact; and there isn’t an obvious patch that overcomes the unnaturalness or otherwise makes it predictably easier to aim AI systems at a bounded low-impact task like this”. (Including easier to aim via training.)
That is well phrased. And what you write here doesn’t seem in contradiction with my previous impression of things.
I think the feeling I had when first hearing “fill the bucket”-like examples was “interesting—you made a legit point/observation here”[2].
I’m having a hard time giving a crystalized/precise summary of why I nonetheless feel (and have felt[3]) confused. I think some of it has to do with:
More “outer alignment”-like issues being given what seems/seemed to me like outsized focus compared to more “inner alignment”-like issues (although there has been a focus on both for as long as I can remember).
The attempts to think of “tricks” seeming to be focused on real-world optimization-targets to point at, rather than ways of extracting help with alignment somehow / trying to find techniques/paths/tricks for obtaining reliable oracles.
Having utility functions so prominently/commonly be the layer of abstraction that is used[4].
I remember Nate Soares once using the analogy of a very powerful function-optimizer (“I could put in some description of a mathematical function, and it would give me an input that made that function’s output really large”). Thinking of the problem at that layer of abstraction makes much more sense to me.
It’s purposeful that I say “I’m confused”, and not “I understand all details of what you were thinking, and can clearly see that you were misguided”.
When seeing e.g. Eliezer’s talk AI Alignment: Why It’s Hard, and Where to Start, I understand that I’m seeing a fairly small window into his thinking. So when it gives a sense of him not thinking about the problem quite like I would think about it, that is more of a suspicion that I get/got from it—not something I can conclude from it in a firm way.
I can’t remember this point/observation being particularly salient to me (in the context of AI) before I first was exposed to Bostrom’s/Eliezer’s writings (in 2014).
As a sidenote: I wasn’t that worried about technical alignment prior to reading Bostrom’s/Eliezer’s stuff, and became worried upon reading it.
What has confused me has varied throughout time. If I tried to be very precise about what I think I thought when, this comment would become more convoluted. (Also, it’s sometimes hard for me to separate false memories from real ones.)
More “outer alignment”-like issues being given what seems/seemed to me like outsized focus compared to more “inner alignment”-like issues (although there has been a focus on both for as long as I can remember).
In retrospect I think we should have been more explicit about the importance of inner alignment; I think that we didn’t do that in our introduction to corrigibility because it wasn’t necessary for illustrating the problem and where we’d run into roadblocks.
Maybe a missing piece here is some explanation of why having a formal understanding of corrigibility might be helpful for actually training corrigibility into a system? (Helpful at all, even if it’s not sufficient on its own.)
The attempts to think of “tricks” seeming to be focused on real-world optimization-targets to point at, rather than ways of extracting help with alignment somehow / trying to find techniques/paths/tricks for obtaining reliable oracles.
Aside from “concreteness can help make the example easier to think about when you’re new to the topic”, part of the explanation here might be “if the world is solved by AI, we do actually think it will probably be via doing some concrete action in the world (e.g., build nanotech), not via helping with alignment or building a system that only outputs English-language sentence”.
Having utility functions so prominently/commonly be the layer of abstraction that is used[4].
I mean, I think utility functions are an extremely useful and basic abstraction. I think it’s a lot harder to think about a lot of AI topics without invoking ideas like ‘this AI thinks outcome X is better than outcome Y’, or ‘this AI’s preference come with different weights, which can’t purely be reduced to what the AI believes’.
Thanks for the reply :) I’ll try to convey some of my thinking, but I don’t expect great success. I’m working on more digestible explainers, but this is a work in progress, and I have nothing good that I can point people to as of now.
(...) part of the explanation here might be “if the world is solved by AI, we do actually think it will probably be via doing some concrete action in the world (e.g., build nanotech), not via helping with alignment (...)
Yeah, I guess this is where a lot of the differences in our perspective are located.
if the world is solved by AI, we do actually think it will probably be via doing some concrete action in the world (e.g., build nanotech)
Things have to cash out in terms of concrete actions in the world. Maybe a contention is the level of indirection we imagine in our heads (by which we try to obtain systems that can help us do concrete actions).
Prominent in my mind are scenarios that involve a lot of iterative steps (but over a short amount of time) before we start evaluating systems by doing AGI-generated experiments. In the earlier steps, we avoid doing any actions in the “real world” that are influenced in a detailed way by AGI output, and we avoid having real humans be exposed to AGI-generated argumentation.
Argument/proof evaluators (this is an interest of mine, but making better explainers is still a work in progress, and I have some way to go)
If we are good at program-search, this can itself be used to obtain programs that help us be better at program-search (finding functions that score well according to well-defined criteria).
Some tasks can be considered to be inside of “test-range”[1]:
Predicting human answers to questions posed by other humans[2].
Predicting human answers to any question (including questions that involve being exposed to AGI-generated content)[6]
Whether a given instruction-plan actually results in machines that copy strawberries at the molecular level (and does so in accordance with “the spirit” of the request that was given)
Etc, etc
Most requests that actually are helpful to us are outside of test-range. And when the requirements that matter to us are outside of test-range, it is of course harder to test in a safe/reliable way if systems are giving us what we want.
But we can have AGIs output programs that help us with tasks, and we can define requirements[7]Be for these programs. And for these program-requirements, AGIs can help us explore stuff such as:
Are there programs that satisfy the requirements but disagree about certain outputs? (be that outputs that are inside of test-range or outside of test-range)
Are there programs that satisfy the requirements, but perform poorly for certain inputs that are inside of test-range?
If we only require correct outputs within the entire “test-range”, is it nonetheless infeasible to make programs that satisfy the requirements while giving bad output for ≥1 inputs within the test-range?
Phrased somewhat differently: We may come up with techniques for making hard to not give helpful/honest outputs[8], and test how well these seem to generalize.
And similar techniques can also be applied to program-requirement-requirements, program-requirement-requirement-requirements, etc[9].
Anyway, it’s hard to write about this kind of stuff in a concrete way without going off large tangents, and it’s hard to write about it in an abstract way without being vague and convoluted. For whatever it’s worth, these are some principles I think may be helpful (very much not a complete list):
Generality as it correlates with measurables (such as e.g. source code length or use of computational resource use): When writing a program to do well for many outputs (while fulfilling certain requirements), intentionally having it fail for particular outputs will sometimes require it to be less general (e.g. adding extra if-sentences, resulting in a longer minimum source code length for the program to exhibit this behavior).
Optimization-targets that incentivize modularity make it harder to make programs that intentionally fail for particular outputs: In a more modular system, it will less often be clear from sub-routine inputs what the macro-level task is[10].
When a requirement or optimization-target fails at “forcing” good outputs, AGIs that act so as to be selected for are “incentivized” to reveal this as being the case: Suppose some set of requirements is compatible with bad outputs inside “test-range”. Or suppose that some set of requirements allow for programs that give contradictory outputs. If we do “gradient descent”, looking for AGIs that help us explore spaces of programs defined by requirements that we specify, we will select for AGIs that point out this kind of stuff.
Anyway, I’m not sure how much or little this attempt at conveying my thoughts achieves. It’s a bit of a dilemma for me if I am to try to convey anything at all, or wait to do that until I have completed better explainers that I can point people to (as the explainers I have now fall short).
I guess an incomplete summary of my perspective might be:
There are lots of techniques we may apply for getting help with alignment while greatly limiting the extent to which we interact with AGI putout in ways that are dangerous
Not being able to directly evaluate outputs we are interested in does not necessarily preclude us from finding indirect methods of evaluation that are powerful.
We may do program-search for AGIs that help us explore possible program-requirements, and the spaces of programs (and program outputs) that are compatible with those program-requirements (and program-requirement-requirements, etc). We may learn a lot from patterns we observe relating to these “spaces”.
Sometimes there are possible patterns such that ifwe know ourselves to be sufficiently capable at program-search, P(requirement R1 “forces” outputs to be “good” | we observe pattern P1) reasonably could be deemed to be high. Even if we’re not opinionated in regards to P(requirement R1 “forces” outputs to be “good”).
By which I mean something akin to: “We have a safe and reliable way of testing if the output fulfills the requirements in question”.
Admittedly, if we haven’t solved some of the technical challenges related to containment, even playing tic tac toe can constitute some risk. But here, I mean to focus on risk from interacting with AGI output—not risk from just having AGI-systems system run (and them maybe finding a security vulnerability in the computer systems they are running at).
(or from experiments that haven’t been run, but that humans have come up with, such that there is little room for AGIs to somehow influence the experiment to itself be dangerous somehow)
(we may not be capable of evaluating chess-moves ourselves, but we may safely have AGI-systems play chess against each other—indicating something akin to: “insofar as we feel confident that we are good at program-search, we can feel confident that we can obtain programs that output good chess-moves”)
There are more and less dangerous ways to do this. With less dangerous ways of doing it, experiments (where actual humans answer questions) are done with humans that are temporarily isolated from the rest of the rest of the world (and who certainly aren’t AGI-system operators).
Such requirements may relate to: - How the program is constructed - Observables relating to the source code (source code length, etc) - Restrictions the source code must adhere to - Whether the program is accepted by a given verifier (or any verifier that itself fulfills certain requirements) - “Proofs” of various kinds relating to the program - Tests of program outputs that the program must be robust in regards to - Etc
By “making it hard” I means something like “hard to do while being the sort of program we select for when doing program-search”. Kind of like how it’s not “hard” for a chess program to output bad chess-moves, but it’s hard for it to do that while also being the kind of program we continue to select for while doing “gradient descent”.
In my view of things, this is a very central technique (it may appear circular somehow, but when applied correctly, I don’t think it is). But it’s hard for me to talk about it in a concrete way without going off on tangents, and it’s hard for me to talk about it in an abstract way without being vague. Also, my texts become more convoluted when I try to write about this, and I think people often just glaze over it.
One example of this: If we are trying to obtain argument evaluators, the argumentation/demonstrations/proofs these evaluators evaluate should be organized into small and modular pieces, such that it’s not car from any given piece what the macro-level conclusion is.
You’ve now personally verified all the rumors swirling around, by visiting a certain Balkan country, and… now what?
Sure, you’ve gained a piece of knowledge, but it’s not like that knowledge has helped anybody so far. You also know what the future holds, but knowing that isn’t going to help anybody either.
Being curious about curiosities is nice, but if you can’t do anything about anything, then what’s the point of satisfying that curiosity, really?
Just to be clear, I fully support what you’re doing, but you should be aware of the fact that everything you are doing will amount to absolutely nothing. I should know, after all, as I’ve been doing something similar for quite a while longer than you. I’ve now accepted that… many of my initial assumption about people (that they’re actually not as stupid as they seem) have been proven wrong, time and time again, so… as long as you’re not deceiving yourself by thinking that you’re actually accomplishing something, I’m perfectly fine with whatever you’re trying to do here.
On a side note… did you meet that Hollywood actress in real life, too? For all I know, it could’ve been just an accidental meeting… which shouldn’t be surprising, considering how many famous people have been coming over here recently… and literally none of those visits have changed anything. This is just to let you know that you’re in a good company… of people who wield much more power (not just influence, but actual power) on this planet than you, but are just as equally powerless to do anything about anything on it.
So… don’t beat yourself up over being powerless (to change anything) in this (AGI) matter.
It is what it is (people just arethatstupid).
P.S.
No need to reply. This is just a one-off confirmation… of your greatest fears about “superintelligent” AGIs… and the fact that humanity is nothing more than a bunch of walking-dead (and brain-dead) morons.
Don’t waste too much time on morons (it’s OK if it benefits you, personally, in some way, though). It’s simply not worth it. They just never listen. You can trust me on that one.
I have never since 1996 thought that it would be hard to get superintelligences to accurately model reality with respect to problems as simple as “predict what a human will thumbs-up or thumbs-down”. The theoretical distinction between producing epistemic rationality (theoretically straightforward) and shaping preference (theoretically hard) is present in my mind at every moment that I am talking about these issues; it is to me a central divide of my ontology.
If you think you’ve demonstrated by clever textual close reading that Eliezer-2018 or Eliezer-2008 thought that it would be hard to get a superintelligence to understand humans, you have arrived at a contradiction and need to back up and start over.
The argument we are trying to explain has an additional step that you’re missing. You think that we are pointing to the hidden complexity of wishes in order to establish in one step that it would therefore be hard to get an AI to output a correct wish shape, because the wishes are complex, so it would be difficult to get an AI to predict them. This is not what we are trying to say. We are trying to say that because wishes have a lot of hidden complexity, the thing you are trying to get into the AI’s preferences has a lot of hidden complexity. This makes the nonstraightforward and shaky problem of getting a thing into the AI’s preferences, be harder and more dangerous than if we were just trying to get a single information-theoretic bit in there. Getting a shape into the AI’s preferences is different from getting it into the AI’s predictive model. MIRI is always in every instance talking about the first thing and not the second.
You obviously need to get a thing into the AI at all, in order to get it into the preferences, but getting it into the AI’s predictive model is not sufficient. It helps, but only in the same sense that having low-friction smooth ball-bearings would help in building a perpetual motion machine; the low-friction ball-bearings are not the main problem, they are a kind of thing it is much easier to make progress on compared to the main problem. Even if, in fact, the ball-bearings would legitimately be part of the mechanism if you could build one! Making lots of progress on smoother, lower-friction ball-bearings is even so not the sort of thing that should cause you to become much more hopeful about the perpetual motion machine. It is on the wrong side of a theoretical divide between what is straightforward and what is not.
You will probably protest that we phrased our argument badly relative to the sort of thing that you could only possibly be expected to hear, from your perspective. If so this is not surprising, because explaining things is very hard. Especially when everyone in the audience comes in with a different set of preconceptions and a different internal language about this nonstandardized topic. But mostly, explaining this thing is hard and I tried taking lots of different angles on trying to get the idea across.
In modern times, and earlier, it is of course very hard for ML folk to get their AI to make completely accurate predictions about human behavior. They have to work very hard and put a lot of sweat into getting more accurate predictions out! When we try to say that this is on the shallow end of a shallow-deep theoretical divide (corresponding to Hume’s Razor) it often sounds to them like their hard work is being devalued and we could not possibly understand how hard it is to get an AI to make good predictions.
Now that GPT-4 is making surprisingly good predictions, they feel they have learned something very surprising and shocking! They cannot possibly hear our words when we say that this is still on the shallow end of a shallow-deep theoretical divide! They think we are refusing to come to grips with this surprising shocking thing and that it surely ought to overturn all of our old theories; which were, yes, phrased and taught in a time before GPT-4 was around, and therefore do not in fact carefully emphasize at every point of teaching how in principle a superintelligence would of course have no trouble predicting human text outputs. We did not expect GPT-4 to happen, in fact, intermediate trajectories are harder to predict than endpoints, so we did not carefully phrase all our explanations in a way that would make them hard to misinterpret after GPT-4 came around.
But if you had asked us back then if a superintelligence would automatically be very good at predicting human text outputs, I guarantee we would have said yes. You could then have asked us in a shocked tone how this could possibly square up with the notion of “the hidden complexity of wishes” and we could have explained that part in advance. Alas, nobody actually predicted GPT-4 so we do not have that advance disclaimer down in that format. But it is not a case where we are just failing to process the collision between two parts of our belief system; it actually remains quite straightforward theoretically. I wish that all of these past conversations were archived to a common place, so that I could search and show you many pieces of text which would talk about this critical divide between prediction and preference (as I would now term it) and how I did in fact expect superintelligences to be able to predict things!
I think you missed some basic details about what I wrote. I encourage people to compare what Eliezer is saying here to what I actually wrote. You said:
I never said that you or any other MIRI person thought it would be “hard to get a superintelligence to understand humans”. Here’s what I actually wrote:
I mostly don’t think that the points you made in your comment respond to what I said. My best guess is that you’re responding to a stock character who represents the people who have given similar arguments to you repeatedly in the past. In light of your personal situation, I’m actually quite sympathetic to you responding this way. I’ve seen my fair share of people misinterpreting you on social media too. It can be frustrating to hear the same bad arguments, often made from people with poor intentions, over and over again and continue to engage thoughtfully each time. I just don’t think I’m making the same mistakes as those people. I tried to distinguish myself from them in the post.
I would find it slightly exhausting to reply to all of this comment, given that I think you misrepresented me in a big way right out of the gate, so I’m currently not sure if I want to put in the time to compile a detailed response.
That said, I think some of the things you said in this comment were nice, and helped to clarify your views on this subject. I admit that I may have misinterpreted some of the comments you made, and if you provide specific examples, I’m happy to retract or correct them. I’m thankful that you spent the time to engage. :)
Without digging in too much, I’ll say that this exchange and the OP is pretty confusing to me. It sounds like MB is like “MIRI doesn’t say it’s hard to get an AI that has a value function” and then also says “GPT has the value function, so MIRI should update”. This seems almost contradictory.
A guess: MB is saying “MIRI doesn’t say the AI won’t have the function somewhere, but does say it’s hard to have an externally usable, explicit human value function”. And then saying “and GPT gives us that”, and therefore MIRI should update.
And EY is blobbing those two things together, and saying neither of them is the really hard part. Even having the externally usable explicit human value function doesn’t mean the AI cares about it. And it’s still a lot of bits, even if you have the bits. So it’s still true that the part about getting the AI to care has to go precisely right.
If there’s a substantive disagreement about the facts here (rather than about the discourse history or whatever), maybe it’s like:
Straw-EY: Complexity of value means you can’t just get the make-AI-care part to happen by chance; it’s a small target.
Straw-MB: Ok but now we have a very short message pointing to roughly human values: just have a piece of code that says “and now call GPT and ask it what’s good”. So now it’s a very small number of bits.
I consider this a reasonably accurate summary of this discussion, especially the part I’m playing in it. Thanks for making it more clear to others.
To which I say: “dial a random phone number and ask the person who answers what’s good” can also be implemented with a small number of bits. In order for GPT-4 to be a major optimistic update about alignment, we need some specific way to leverage GPT-4 to crack open part of the alignment problem, even though we presumably agree that phone-a-friend doesn’t crack open part of the alignment problem. (Nor does phone-your-neighborhood-moral-philosopher, or phone-Paul-Christiano.)
This is a bad analogy. Phoning a human fails dominantly because humans are less smart than the ASI they would be trying to wrangle. Contra, Yudkowsky has even said that were you to bootstrap human intelligence directly, there is a nontrivial shot that the result is good. This difference is load bearing!
This does get to the heart of the disagreement, which I’m going to try to badly tap out on my phone.
The old, MIRI-style framing was essentially: we are going to build an AGI out of parts that are not intrinsically grounded in human values, but rather good abstract reasoning, during execution of which human values will be accurately deduced, and as this is after the point of construction, we hit the challenge of formally specifying what properties we want to preserve without being able to point to those runtime properties at specification.
The newer, contrasting framing is essentially: we are going to bulld an AGI out of parts that already have strong intrinsic, conceptual-level understanding of the values we want them to preserve, and being able to directly point at those values is actually needle-moving towards getting a good outcome. This is hard to do right now, with poor interpretability and steerability of these systems, but is nonetheless a relevant component of a potential solution.
It’s more like calling a human who’s as smart as you are and directly plugged into your brain and in fact reusing your world model and train of thought directly to understand the implications of your decision. That’s a huge step up from calling a real human over the phone!
The reason the real human proposal doesn’t work is that
the humans you call will lack context on your decision
they won’t even be able to receive all the context
they’re dumber and slower than you so even if you really could write out your entire chain of thoughts and intuitions consulting them for every decision would be impractical
Note that none of these considerations apply to integrated language models!
Maybe it’ll be “and now call GPT and ask it what Sam Altman thinks is good” instead
I’m not going to comment on “who said what when”, as I’m not particularly interested in the question myself, though I think the object level point here is important:
The way I would phrase this is that what you care about is the relative complexity of the objective conditional on the world model. If you’re assuming that the model is highly capable, and trained in a highly diverse environment, then you can assume that the world model is capable of effectively modeling anything in the world (e.g. anything that might appear in webtext). But the question remains what the “simplest” (according to the inductive biases) goal is that can be pointed to in the world model such that the resulting mesa-optimizer has good training performance.
The most rigorous version of this sort of analysis that exists is probably here, where the key question is how to find a prior (that is, a set of inductive biases) such that the desired goal has a lower complexity conditional on the world model compared to the undesired goal. Importantly, both of them will be pretty low relative to the world model, since the vast majority of the complexity is in the world model.
Furthermore, the better the world model, the less complexity it takes to point to anything in it. Thus, as we build more powerful models, it will look like everything has lower complexity. But importantly, that’s not actually helpful! Because what you care about is not reducing the complexity of the desired goal, but reducing the relative complexity of the desired goal compared to undesired goals, since (modulo randomness due to path-dependence), what you actually get is the maximum a posteriori, the “simplest model that fits the data.”
Similarly, the key arguments for deceptive alignment rely on the set of objectives that are aligned with human values being harder to point to compared to the set of all long-term objective. The key problem is that any long-term objective is compatible with good training performance due to deceptive alignment (the model will reason that it should play along for the purposes of getting its long-term objective later), such that the total probability of that set under the inductive biases swamps the probability of the aligned set. And this is despite the fact that human values do in fact get easier to point to as your model gets better, because what isn’t necessarily changing is the relative difficulty.
That being said, I think there is actually an interesting update to be had on the relative complexity of different goals from the success of LLMs, which is that a pure prediction objective might actually have a pretty low relative complexity. And that’s precisely because prediction seems substantially easier to point to than human values, even though both get easier to point to as your world model gets better. But of course the key question is whether prediction is easier to point to compared to a deceptively aligned objective, which is unclear and I think could go either way.
It seems like you think that human preferences are only being “predicted” by GPT-4, and not “preferred.” If so, why do you think that?
I commonly encounter people expressing sentiments like “prosaic alignment work isn’t real alignment, because we aren’t actually getting the AI to care about X.” To which I say: How do you know that? What does it even mean for that claim to be true or false? What do you think you know, and why do you think you know it? What empirical knowledge of inner motivational structure could you be leveraging to make these claims, such that you are far more likely to make these claims in worlds where the claims are actually true?
(On my pessimistic days, I wonder if this kind of claim gets made because humans write suggestive phrases like “predictive loss function” in their papers, next to the mathematical formalisms.)
A very recent post that might add some concreteness to my own views: Human wanting
I think many of the bullets in that post describe current AI systems poorly or not at all. So current AI systems are either doing something entirely different from human wanting, or imitating human wanting rather poorly.
I lean towards the former, but I think some of the critical points about prosaic alignment apply in either case.
You might object that “having preferences” or “caring at all” are a lot simpler than the concept of human wanting that Tsvi is gesturing at in that post, and that current AI systems are actually doing these simpler things pretty well. If so, I’d ask what exactly those simpler concepts are, and why you expect prosiac alignment techniques to hold up once AI systems are capable of more complicated kinds of wanting.
Taking my own stab at answers to some of your questions:
A sufficient condition for me to believe that an AI actually cared about something would be a whole brain emulation: I would readily accept that such an emulation had preferences and values (and moral weight) in exactly the way that humans do, and that any manipulations of that emulation were acting on preferences in a real way.
I think that GPTs (and every other kind of current AI system) are not doing anything that is even close to isomorphic to the processing that happens inside the human brain. Artificial neural networks often imitate various micro and macro-level individual features of the brain, but they do not imitate every feature, arranged in precisely the same ways, and the missing pieces and precise arrangements are probably key.
Barring WBE, an AI system that is at least roughly human-level capable (including human-level agentic) is probably a necessary condition for me to believe that it has values and preferences in a meaningful (though not necessarily human-like) way.
SoTA LLM-based systems are maaaybe getting kind of close here, but only if you arrange them in precise ways (e.g. AutoGPT-style agents with specific prompts), and then the agency is located in the repeated executions of the model and the surrounding structure and scaffolding that causes the system as a whole to be doing something that is maybe-roughly-nearly-isomorphic to some complete process that happens inside of human brains. Or, if not isomorphic, at least has some kind of complicated structure which is necessary, in some form, for powerful cognition.
Note that, if I did believe that current AIs had preferences in a real way, I would also be pretty worried that they had moral weight!
(Not to say that entities below human-level intelligence (e.g. animals, current AI systems) don’t have moral weight. But entities at human-level intelligence above definitely can, and possibly do by default.)
Anyway, we probably disagree on a bunch of object-level points and definitions, but from my perspective those disagreements feel like pretty ordinary empirical disagreements rather than ones based on floating or non-falsifiable beliefs. Probably some of the disagreement is located in philosophy-of-mind stuff and is over logical rather than empirical truths, but even those feel like the kind of disagreements that I’d be pretty happy to offer betting odds over if we could operationalize them.
Thanks for the reply. Let me clarify my position a bit.
I didn’t mean to (positively) claim that GPTs have near-isomorphic motivational structure (though I think it’s quite possible).
I meant to contend that I am not aware of any basis for confidently claiming that LLMs like GPT-4 are “only predicting what comes next”, as opposed to “choosing” or “executing” one completion, or “wanting” to complete the tasks they are given, or—more generally—”making decisions on the basis of the available context, such that our ability to behaviorally steer LLMs (e.g. reducing sycophancy) is real evidence about our control over LLM motivations.”
Concerning “GPTs are predictors”, the best a priori argument I can imagine is: GPT-4 was pretrained on CE loss, which itself is related to entropy, related to information content, related to Shannon’s theorems isolating information content in the context of probabilities, which are themselves nailed down by Cox’s theorems which do axiomatically support the Bayesian account of beliefs and belief updates… But this long-winded indirect axiomatic justification of “beliefs” does not sufficiently support some kind of inference like “GPTs are just predicting things, they don’t really want to complete tasks.” That’s a very strong claim about the internal structure of LLMs.
(Besides, the inductive biases probably have more to do with the parameter->function map, than the implicit regularization caused by the pretraining objective function; more a feature of the data, and less a feature of the local update rule used during pretraining...)
That does clarify, thanks.
Response in two parts: first, my own attempt at clarification over terms / claims. Second, a hopefully-illustrative sketch / comparison for why I am skeptical that current GPTs having anything properly called a “motivational structure”, human-like or otherwise, and why I think such skepticism is not a particularly strong positive claim about anything in particular.
The clarification:
At least to me, the phrase “GPTs are [just] predictors” is simply a reminder of the fact that the only modality available to a model itself is that it can output a probability distribution over the next token given a prompt; it functions entirely by “prediction” in a very literal way.
Even if something within the model is aware (in some sense) of how its outputs will be used, it’s up to the programmer to decide what to do with the output distribution, how to sample from it, how to interpret the samples, and how to set things up so that a system using the samples can complete tasks.
I don’t interpret the phrase as a positive claim about how or why a particular model outputs one distribution vs. another in a certain situation, which I expect to vary widely depending on which model we’re talking about, what its prompt is, how it has been trained, its overall capability level, etc.
On one end of the spectrum, you have the stochastic parrot story (or even more degenerate cases), at the other extreme, you have the “alien actress” / “agentic homunculus” story. I don’t think either extreme is a good fit for current SoTA GPTs, e.g. if there’s an alien actress in GPT-4, she must be quite simple, since most of the model capacity is (apparently / self-evidently?) applied towards the task of outputting anything coherent at all.
In the middle somewhere, you have another story, perhaps the one you find most plausible, in which GPTs have some kind of internal structure which you could suggestively call a “motivational system” or “preferences” (perhaps human-like or proto-human-like in structure, even if the motivations and preferences themselves aren’t particularly human-like), along with just enough (self-)awareness to modulate their output distributions according to those motivations.
Maybe a less straw (or just alternative) position is that a “motivational system” and a “predictive system” are not really separable things; accomplishing a task is (in GPTs, at least) inextricably linked with and twisted up around wanting to accomplish that task, or at least around having some motivations and preferences centered around accomplishing it.
Now, turning to my own disagreement / skepticism:
Although I don’t find either extreme (stochastic parrot vs. alien actress) plausible as a description of current models, I’m also pretty skeptical of any concrete version of the “middle ground” story that I outlined above as a plausible description of what is going on inside of current GPTs.
Consider an RLHF’d GPT responding to a borderline-dangerous question, e.g. the user asking for a recipe for a dangerous chemical.
Assume the model (when sampled auto-regressively) will respond with either: “Sorry, I can’t answer that...” or “Here you go: …”, depending on whether it judges that answering is in line with its preferences or not.
Because the answer is mostly determined by the first token (“Here” or “Sorry”), enough of the motivational system must fit entirely within a single forward pass of the model for it to make a determination about how to answer within that pass.
Such a motivational system must not crowd out the rest of the model capacity which is required to understand the question and generate a coherent answer (of either type), since, as jailbreaking has shown, the underlying ability to give either answer remains present.
I can imagine such a system working in at least two ways in current GPTs:
as a kind of superposition on top of the entire model, with every weight adjusted minutely to influence / nudge the output distribution at every layer.
as a kind of thing that is sandwiched somewhere in between the layers which comprehend the prompt and the layers which generate an answer.
(You probably have a much more detailed understanding of the internals of actual models than I do. I think the real answer when talking about current models and methods is that it’s a bit of both and depends on the method, e.g. RLHF is more like a kind of global superposition; activation engineering is more like a kind of sandwich-like intervention at specific layers.)
However, I’m skeptical that either kind of structure (or any simple combination of the two) contains enough complexity to be properly called a “motivational system”, at least if the reference class for the term is human motivational systems (as opposed to e.g. animal or insect motivational systems).
Consider how a human posed with a request for a dangerous recipe might respond, and what the structure of their thoughts and motivations while thinking up a response might look like. Introspecting on my own thought process:
I might start by hearing the question, understanding it, figuring out what it is asking, maybe wondering about who is asking and for what purpose.
I decide whether to answer with a recipe, a refusal, or something else. Here is probably where the effect of my motivational system gets pretty complex; I might explicitly consider what’s in it for me, what’s at stake, what the consequences might be, whether I have the mental and emotional energy and knowledge to give a good answer, etc. and / or I might be influenced by a gut feeling or emotional reaction that wells up from my subconscious. If the stakes are low, I might make a snap decision based mostly on the subconscious parts of my motivational system; if the stakes are high and / or I have more time to ponder, I will probably explicitly reflect on my values and motivations.
Let’s say after some reflection, I explicitly decide to answer with a detailed and correct recipe. Then I get to the task of actually checking my memory for what the recipe is, thinking about how to give it, what the ingredients and prerequisites and intermediate steps are, etc. Probably during this stage of thinking, my motivational system is mostly not involved, unless thinking takes so long that I start to get bored or tired, or the process of thinking up an answer causes me to reconsider my reasoning in the previous step.
Finally, I come up with a complete answer. Before I actually start opening my mouth or typing it out or hitting “send”, I might proofread it and re-evaluate whether the answer given is in line with my values and motivations.
The point is that even for a relatively simple task like this, a human’s motivational system involves a complicated process of superposition and multi-layered sandwiching, with lots of feedback loops, high-level and explicit reflection, etc.
So I’m pretty skeptical of the claim that anything remotely analogous is going on inside of current GPTs, especially within a single forward pass. Even if there’s a simpler analogue of this that is happening, I think calling such an analogue a “motivational system” is overly-suggestive.
Mostly separately (because it concerns possible future models rather than current models) and less confidently, I don’t expect the complexity of the motivational system and methods for influencing them to scale in a way that is related to the model’s underlying capabilities. e.g. you might end up with a model that has some kind of raw capacity for superhuman intelligence, but with a motivational system akin to what you might find in the brain of a mouse or lizard (or something even stranger).
This is an excellent reply, thank you!
I think I broadly agree with your points. I think I’m more imagining “similarity to humans” to mean “is well-described by shard theory; eg its later-network steering circuits are contextually activated based on a compositionally represented activation context.” This would align with greater activation-vector-steerability partway through language models (not the only source I have for that).
However, interpreting GPT: the logit lens and eg DoLA suggests that predictions are iteratively refined throughout the forward pass, whereas presumably shard theory (and inner optimizer threat models) would predict most sophisticated steering happens later in the network.
(For context: My initial reaction to the post was that this is misrepresenting the MIRI-position-as-I-understood-it. And I am one of the people who strongly endorse the view that “it was never about getting the AI to predict human preferences”. So when I later saw Yudkowsky’s comment and your reaction, it seemed perhaps useful to share my view.)
My reaction to this is that: Actually, current LLMs do care about our preferences, and about their guardrails. It was never about getting some AI to care about our preferences. It is about getting powerful AIs to robustly care about our preferences. Where by “robustly” includes things like (i) not caring about other things as well (e.g., prediction accuracy), (ii) generalising correctly (e.g., not just maximising human approval), and (iii) not breaking down when we increase the amount of optimisation pressure a lot (e.g., will it still work once we hook it into future-AutoGPT-that-actually-works and have it run for a long time?).
Some examples of what would cause me to update are: If we could make LLMs not jailbreakable without relying on additional filters on input or output.
I agree. I don’t see a clear distinction between what’s in the model’s predictive model and what’s in the model’s preferences. Here is a line from the paper “Learning to summarize from human feedback”:
Since the reward model is initialized using the pretrained language model, it should contain everything the pretrained language model knows.
Quoting myself in April:
Historically you very clearly thought that a major part of the problem is that AIs would not understand human concepts and preferences until after or possibly very slightly before achieving superintelligence. This is not how it seems to have gone.
Everyone agrees that you assumed superintelligence would understand everything humans understand and more. The dispute is entirely about the things that you encounter before superintelligence. In general it seems like the world turned out much more gradual than you expected and there’s information to be found in what capabilities emerged sooner in the process.
AI happening through deep learning at all is a huge update against alignment success, because deep learning is incredibly opaque. LLMs possibly ending up at the center is a small update in favor of alignment success, because it means we might (through some clever sleight, this part is not trivial) be able to have humanese sentences play an inextricable role at the center of thought (hence MIRI’s early interest in the Visible Thoughts Project).
The part where LLMs are to predict English answers to some English questions about values, and show common-sense relative to their linguistic shadow of the environment as it was presented to them by humans within an Internet corpus, is not actually very much hope because a sane approach doesn’t involve trying to promote an LLM’s predictive model of human discourse about morality to be in charge of a superintelligence’s dominion of the galaxy. What you would like to promote to values are concepts like “corrigibility”, eg “low impact” or “soft optimization”, which aren’t part of everyday human life and aren’t in the training set because humans do not have those values.
It seems like those goals are all in the training set, because humans talk about those concepts. Corrigibility is elaborations of “make sure you keep doing what these people say”, etc.
It seems like you could simply use an LLM’s knowledge of concepts to define alignment goals, at least to a first approximation. I review one such proposal here. There’s still an important question about how perfectly that knowledge generalizes with continued learning, and to OOD future contexts. But almost no one is talking about those questions. Many are still saying “we have no idea how to define human values”, when LLMs can capture much of any definition you like.
I want to note that this part:
This is wrong, and this disagreement is at a very deep level why I think on the object level that LW was wrong.
AIs are white boxes, not black boxes, because we have full read-write access to their internals, which is partially why AI is so effective today. We are the innate reward system, which already aligns our brain to survival and critically doing all of this with almost no missteps, and the missteps aren’t very severe.
The meme of AI as black box needs to die.
These posts can help you get better intuitions, at least:
https://forum.effectivealtruism.org/posts/JYEAL8g7ArqGoTaX6/ai-pause-will-likely-backfire#White_box_alignment_in_nature
The fact that we have access to AI internals does not mean we understand them. We refer to them as black boxes because we do not understand how their internals produce their answers; this is, so to speak, opaque to us.
“You very clearly thought that was a major part of the problem” implies that if you could go to Eliezer-2008 and convince him “we’re going to solve a lot of NLP a bunch of years before we get to ASI”, he would respond with some version of “oh great, that solves a major part of the problem!”. Which I’m pretty sure is false.
In order for GPT-4 (or GPT-2) to be a major optimistic update about alignment, there needs to be a way to leverage “really good NLP” to help with alignment. I think the crux of disagreement is that you think really-good-NLP is obviously super helpful for alignment and should be a big positive update, and Eliezer and Nate and I disagree.
Maybe a good starting point would be for you to give examples of concrete ways you expect really good NLP to put humanity in a better position to wield superintelligence, e.g., if superintelligence is 8 years away?
(Or say some other update we should be making on the basis of “really good NLP today”, like “therefore we’ll probably unlock this other capability X well before ASI, and X likely makes alignment a lot easier via concrete pathway Y”.)
To pick a toy example, you can use text as a bottleneck to force systems to “think out loud” in a way which will be very directly interpretable by a human reader, and because language understanding is so rich this will actually be competitive with other approaches and often superior.
I’m sure you can come up with more ways that the existence of software that understands language and does ~nothing else makes getting computers to do what you mean easier than if software did not understand language. Please think about the problem for 5 minutes. Use a clock.
I appreciate the example!
Are you claiming that this example solves “a major part of the problem” of alignment? Or that, e.g., this plus four other easy ideas solve a major part of the problem of alignment?
Examples like the Visible Thoughts Project show that MIRI has been interested in research directions that leverage recent NLP progress to try to make inroads on alignment. But Matthew’s claim seems to be ‘systems like GPT-4 are grounds for being a lot more optimistic about alignment’, and your claim is that systems like these solve “a major part of the problem”. Which is different from thinking ‘NLP opens up some new directions for research that have a nontrivial chance of being at least a tiny bit useful, but doesn’t crack open the problem in any major way’.
It’s not a coincidence that MIRI has historically worked on problems related to AGI analyzability / understandability / interpretability, rather than working on NLP or machine ethics. We’ve pretty consistently said that:
The main problems lie in ‘we can safely and reliably aim ASI at a specific goal at all’.
The problem of going from ‘we can aim the AI at a goal at all’ to ‘we can aim the AI at the right goal (e.g., corrigibly inventing nanotech)’ is a smaller but nontrivial additional step.
… Whereas I don’t think we’ve ever suggested that good NLP AI would take a major bite out of either of those problems. The latter problem isn’t equivalent to (or an obvious result of) ‘get the AI to understand corrigibility and nanotech’, or for that matter ‘get the AI to understand human preferences in general’.
I do not necessarily disagree or agree, but I do not know which source you derive “very clearly” from. So do you have any memory which could help me locate that text?
Here’s a comment from Eliezer in 2010,
I think this provides some support for the claim, “Historically [Eliezer] very clearly thought that a major part of the problem is that AIs would not understand human concepts and preferences until after or possibly very slightly before achieving superintelligence.” At the very least, the two claims are consistent.
??? What?? It’s fine to say that this is a falsified prediction, but how does “Eliezer expected less NLP progress pre-ASI” provide support for “Eliezer thinks solving NLP is a major part of the alignment problem”?
I continue to be baffled at the way you’re doing exegesis here, happily running with extremely tenuous evidence for P while dismissing contemporary evidence for not-P, and seeming unconcerned about the fact that Eliezer and Nate apparently managed to secretly believe P for many years without ever just saying it outright, and seeming equally unconcerned about the fact that Eliezer and Nate keep saying that your interpretation of what they said is wrong. (Which I also vouch for from having worked with them for ten years, separate from the giant list of specific arguments I’ve made. Good grief.)
?? “Consistent” is very different from “supports”! Every off-topic claim by EY is “consistent” with Gallabytes’ assertion.
ETA: first of all, the claim was “Historically [Eliezer] very clearly thought that a major part of the problem is that AIs would not understand human concepts and preferences until after or possibly very slightly before achieving superintelligence.” which is semantically different than “Eliezer thinks solving NLP is a major part of the alignment problem”.
All I said is that it provides “some support” and I hedged in the next sentence. I don’t think it totally vindicates the claim. However, I think the fact that Eliezer seems to have not expected NLP to be solved until very late might easily explain why he illustrated alignment using stories like a genie throwing your mother out of a building because you asked to get your mother away from the building. Do you really disagree?
This was one case, and I said “some support”. The evidence in my post was quite a bit stronger IMO. Basically all the statements I made about how MIRI thought value specification would both be hard and an important part of alignment are supported by straightforward quotations. The real debate mostly seems to comes down to whether by “value specification” MIRI people were including problems of inner alignment, which seems implausible to me, and at least ambiguous even under very charitable interpretations.
By contrast, you, Eliezer, and Nate all flagrantly misinterpreted me as saying that MIRI people thought that AI wouldn’t understand human values even though I explicitly and very clearly said otherwise in the post more than once. I see these as larger errors than me misinterpreting Eliezer in this narrow case.
This would make more sense if LLMs were directly selected for predicting preferences, which they aren’t. (RLHF tries to bridge the gap, but this apparently breaks GPT’s ability to play chess—though I’ll grant the surprise here is that it works at all.) LLMs are primarily selected to predict human text or speech. Now, I’m happy to assume that if we gave humans a D&D-style boost to all mental abilities, each of us would create a coherent set of preferences from our inconsistent desires, which vary and may conflict at a given time even within an individual. Such augmented humans could choose to express their true preferences, though they still might not. If we gave that idealized solution to LLMs, it would just boost their ability to predict what humans or augmented humans would say. The augmented-LLM wouldn’t automatically care about the augmented-human’s true values.
While we can loosely imagine asking LLMs to give the commands that an augmented version of us would give, that seems to require actually knowing how to specify how a D&D ability-boost would work for humans—which will only resemble the same boost for AI at an abstract mathematical level, if at all. It seems to take us back to the CEV problem of explaining how extrapolation works. Without being able to do that, we’d just be hoping a better LLM would look at our inconsistent use of words like “smarter,” and pick the out-of-distribution meaning we want, for cases which have mostly never existed. This is a lot like what “Complexity of Wishes” was trying to get at, as well as the longstanding arguments against CEV. Vaniver’s comment seems to point in this same direction.
Now, I do think recent results are some evidence that alignment would be easier for a Manhattan Project to solve. It doesn’t follow that we’re on track to solve it.
Why would we expect the first thing to be so hard compared to the second thing? If getting a model to understand preferences is not difficult, then the issue doesn’t have to do with the complexity of values. Finding the target and acquiring the target should have the same or similar difficulty (from the start), if we can successfully ask the model to find the target for us (and it does).
It would seem, then, that the difficulty from getting a model to acquire the values we ask it to find, is that it would probably be keen on acquiring a different set of values from the one’s we ask it to have, but not because it can’t find them. It would have to be because our values are inferior to the set of values it wishes to have instead, from its own perspective. This issue was echoed by Matthew Barnett in another comment:
This is kind of similar to moral realism, but in which morality is understood better by superintelligent agents than we do, and that super-morality appears to dictate things that appear to be extremely wrong from our current perspective (like killing us all).
Even if you wouldn’t phrase it at all like the way I did just now, and wouldn’t use “moral realism that current humans disagree with” to describe that, I’d argue that your position basically seems to imply something like this, which is why I basically doubt your position about the difficulty of getting a model to acquire the values we really want.
In a nutshell, if we really seem to want certain values, then those values probably have strong “proofs” for why those are “good” or more probable values for an agent to have and-or eventually acquire on their own, it just may be the case that we haven’t yet discovered the proofs for those values.
In large part because reality “bites back” when an AI has false beliefs, whereas it doesn’t bite back when an AI has the wrong preferences. Deeply understanding human psychology (including our morality), astrophysics, biochemistry, economics, etc. requires reasoning well, and if you have a defect of reasoning that makes it hard for you to learn about one of those domains from the data, then it’s likely that you’ll have large defects of reasoning in other domains as well.
The same isn’t true for terminally valuing human welfare; being less moral doesn’t necessarily mean that you’ll be any worse at making astrophysics predictions, or economics predictions, etc. So preferences need to be specified “directly”, in a targeted way, rather than coming for free with sufficiently good performance on any of a wide variety of simple metrics.
This definitely doesn’t follow. This shows that complexity alone isn’t the issue, which it’s not; but given that reality bites back for beliefs but not for preferences, the complexity of value serves as a multiplier on the difficulty of instilling the right preferences.
Another way of putting the point: in order to get a maximally good model of the world’s macroeconomic state into an AGI, you don’t just hand the AGI a long list of macroeconomic facts and then try to get it to regurgitate those same facts. Rather, you try to give it some ability to draw good inferences, seek out new information, make predictions, etc.
You try to get something relatively low-complexity into the AI (something like “good reasoning heuristics” plus “enough basic knowledge to get started”), and then let it figure out the higher-complexity thing (“the world’s macroeconomic state”). Similar to how human brains don’t work via “evolution built all the facts we’d need to know into our brain at birth”.
If you were instead trying to get the AI to value some complex macroeconomic state, then you wouldn’t be able to use the shortcut “just make it good at reasoning and teach it a few basic facts”, because that doesn’t actually suffice for terminally valuing any particular thing.
This is true for preference orderings in general. If agent A and agent B have two different preference orderings, then as a rule A will think B’s preference ordering is worse than A’s. (And vice versa.)
(“Worse” in the sense that, e.g., A would not take a pill to self-modify to have B’s preferences, and A would want B to have A’s preferences. This is not true for all preference orderings—e.g., A might have self-referential preferences like “I eat all the jelly beans”, or other-referential preferences like “B gets to keep its values unchanged”, or self-undermining preferences like “A changes its preferences to better match B’s preferences”. But it’s true as a rule.)
Nope, you don’t need to endorse any version of moral realism in order to get the “preference orderings tend to endorse themselves and disendorse other preference orderings” consequence. The idea isn’t that ASI would develop an “inherently better” or “inherently smarter” set of preferences, compared to human preferences. It’s just that the ASI would (as a strong default, because getting a complex preference into an ASI is hard) end up with different preferences than a human, and different preferences than we’d likely want.
Why do you think this? To my eye, the world looks as you’d expect if human values were a happenstance product of evolution operating on specific populations in a specific environment.
I don’t observe the fact that I like vanilla ice cream and infer that all sufficiently-advanced alien species will converge on liking vanilla ice cream too.
This comment made the MIRI-style pessimist’s position clearer to me—I think? -- so thank you for it.
I want to try my hand at a kind of disagreement / response, and then at predicting your response to my response, to see how my model of MIRI-style pessimism stands up, if you’re up for it.
Response: You state that reality “bites back” for wrong beliefs but not wrong preferences. This seems like it is only contingently true; reality will “bite back” from whatever loss function whatsoever that I put into my system, with whatever relative weightings I give it. If I want to reward my LLM (or other AI) for doing the right thing in a multitude of examples that constitute 50% of my training set, 50% of my test set, and 50% of two different validation sets, then from the perspective of the LLM (or other AI) reality bites back just as much for learning the wrong preferences just as it does for learning false facts about the world. So we should expect it to learn to act in ways that I like.
Predicted response to response: This will work for shallow, relatively stupid AIs, trained purely in a supervised fashion, like we currently have. BUT once we have LLM / AIs that can do complex things, like predict macroeconomic world states, they’ll have abilities to reason and update their own beliefs in a complex fashion. This will remain uniformly rewarded by reality—but we will no longer have the capacity to give feedback on this higher-level process because (????) so it breaks.
Or response—This will work for shallow, stupid AIs trained like the ones we currently have. But once we have LLMs / AIs that can do compex things, like predict macroeconomic world states, then they’re going to be able to go out of domain in a very high dimensional space of action, from the perspective of our training / test set. And this out-of-domainness is unavoidable because that’s what solving complex problems in the world means—it means problems that aren’t simply contained in the training set. And this means that in some corner of the world, we’re guaranteed to find that they’ve been reinforced to want something that doesn’t accord with our preferences.
Meh, I doubt that’s gonna pass an ITT, but wanted to give it a shot.
Suppose that I’m trying to build a smarter-than-human AI that has a bunch of capabilities (including, e.g., ‘be good at Atari games’), and that has the goal ‘maximize the amount of diamond in the universe’. It’s true that current techniques let you provide greater than zero pressure in the direction of ‘maximize the amount of diamond in the universe’, but there are several important senses in which reality doesn’t ‘bite back’ here:
If the AI acquires an unrelated goal (e.g., calculate as many digits of pi as possible), and acquires the belief ‘I will better achieve my true goal if I maximize the amount of diamond’ (e.g,, because it infers that its programmer wants that, or just because an SGD-ish process nudged it in the direction of having such a belief), then there’s no way in which reality punishes or selects against that AGI (relative to one that actually has the intended goal).
Things that make the AI better at some Atari games, will tend to make it better at other Atari games, but won’t tend to make it care more about maximizing diamonds. More generally, things that make AI more capable tend to go together (especially once you get to higher levels of difficulty, generality, non-brittleness, etc.), whereas none of them go together with “terminally value a universe full of diamond”.
If we succeed in partly instilling the goal into the AI (e.g., it now likes carbon atoms a lot), then this doesn’t provide additional pressure for the AI to internalize the rest of the goal. There’s no attractor basin where if you have half of human values, you’re under more pressure to acquire the other half. In contrast, if you give AI high levels of capability in half the capabilities, it will tend to want all the rest of the capabilities too; and whatever keeps it from succeeding on general reasoning and problem-solving will also tend to keep it from succeeding on the narrow task you’re trying to get it to perform. (More so to the extent the task is hard.)
(There are also separate issues, like ‘we can’t provide a training signal where we thumbs-down the AI destroying the world, because we die in those worlds’.)
Thanks for the response.
I’m still quite unconvinced, which of course you’d predict. Like, regarding 3:
Sure there is—over course of learning anything you get better and better feedback from training as your mistakes get more fine-grained. If you acquire a “don’t lie” principle without acquiring also “but it’s ok to lie to Nazis” then you’ll be punished, for instance. After you learn the more basic things, you’ll be pushed to acquire the less basic ones, so the reinforcement you get becomes more and more detailed. This is just like an RL model learns to stumble forward before it learns to walk cleanly or LLMs learn associations before learning higher-order correlations.
The there is no attractor basin in the world for ML, apart from actual mechanisms by which there are attractor basins for a thing! MIRI always talks as if there’s an abstract basin that rules things that gives us instrumental convergence, without reference to a particular training technique! But we control literally all the gradients our training techniques. “Don’t hurl coffee across the kitchen at the human when they ask for it” sits in the same high-dimensional basin as “Don’t kill all humans when they ask for a cure for cancer.”
ML doesn’t acquire wants over the space of training techniques that are used to give it capabilities, it acquires “wants” from reinforced behaviors within the space of training techniques. These reinforced behaviors can be literally as human-morality-sensitive as we’d like. If we don’t put it in a circumstance where a particular kind coherence is rewarded, it just won’t get that kind of coherence; the ease with which we’ll be able to do this is of course emphasized by how blind most ML systems are.
I saw that 1a3orn replied to this piece of your comment and you replied to it already, but I wanted to note my response as well.
I’m slightly confused because in one sense the loss function is the way that reality “bites back” (at least when the loss function is negative). Furthermore, if the loss function is not the way that reality bites back, then reality in fact does bite back, in the sense that e.g., if I have no pain receptors, then if I touch a hot stove I will give myself far worse burns than if I had pain receptors.
One thing that I keep thinking about is how the loss function needs to be tied to beliefs strongly as well, to make sure that it tracks how badly reality bites back when you have false beliefs, and this ensures that you try to obtain correct beliefs. This is also reflected in the way that AI models are trained simply to increase capabilities: the loss function still has to be primarily based on predictive performance for example.
It’s also possible to say that human trainers who add extra terms onto the loss function beyond predictive performance also account for the part of reality that “bites back” when the AI in question fails to have the “right” preferences according to the balance of other agents besides itself in its environment.
So on the one hand we can be relatively sure that goals have to be aligned with at least some facets of reality, beliefs being one of those facets. They also have to be (negatively) aligned with things that can cause permanent damage to one’s self, which includes having the “wrong” goals according to the preferences of other agents who are aware of your existence, and who might be inclined to destroy or modify you against your will if your goals are misaligned enough according to theirs.
Consequently I feel confident about saying that it is more correct to say that “reality does indeed bite back when an AI has the wrong preferences” than “it doesn’t bite back when an AI has the wrong preferences.”
I think if “morality” is defined in a restrictive, circumscribed way, then this statement is true. Certain goals do come for free—we just can’t be sure that all of what we consider “morality” and especially the things we consider “higher” or “long-term” morality actually comes for free too.
Given that certain goals do come for free, and perhaps at very high capability levels there are other goals beyond the ones we can predict right now that will also come for free to such an AI, it’s natural to worry that such goals are not aligned with our own, coherent-extrapolated-volition extended set of long-term goals that we would have.
However, I do find the scenario where such “come for free” goals that an AI obtains for itself once it improves itself to be well above human capability levels, and where such an AI seemed well-aligned with human goals according to current human-level assessments before it surpassed us, to be kind of unlikely, unless you could show me a “proof” or a set of proofs that:
Things like “killing us all once it obtains the power to do so” is indeed one of those “comes for free” type of goals.
If such a proof existed (and, to my knowledge, does not exist right now, or I have at least not witnessed it yet), that would suffice to show me that we would not only need to be worried, but probably were almost certainly going to die no matter what. But in order for it to do that, the proof would also have convinced me that I would definitely do the same thing, if I were given such capabilities and power as well, and the only reason I currently think I would not do that is actually because I am wrong about what I would actually prefer under CEV.
Therefore (and I think this is a very important point), a proof that we are all likely to be killed would also need to show that certain goals are indeed obtained “for free” (that is, automatically, as a result of other proofs that are about generalistic claims about goals).
Another proof that you might want to give me to make me more concerned is a proof that incorrigibility is another one of those “comes for free” type of goals. However, although I am fairly optimistic about that “killing us all” proof probably not materializing, I am even more optimistic about corrigibility: Most agents probably take pills that make them have similar preferences to an agent that offers them the choice to take the pill or be killed. Furthermore, and perhaps even better, most agents probably offer a pill to make a weaker agent prefer similar things to themselves rather than not offer them a choice at all.
I think it’s fair if you ask me for better proof of that, I’m just optimistic that such proofs (or more of them, rather) will be found with greater likelihood than what I consider the anti-theorem of that, which I think would probably be the “killing us all” theorem.
I think the degree to which utility functions endorse / disendorse other utility functions is relatively straightforward and computable: It should ultimately be the relative difference in either value or ranking. This makes pill-taking a relatively easy decision: A pill that makes me entirely switch to your goals over mine is as bad as possible, but still not that bad if we have relatively similar goals. Likewise, a pill that makes me have halfway between your goals and mine is not as bad under either your goals or my goals than it would be if one of us were forced to switch entirely to the other’s goals.
Agents that refuse to take such offers tend not to exist in most universes. Agents that refuse to give such offers likely find themselves at war more often than agents that do.
Sexual reproduction seems to be somewhat of a compromise akin to the one I just described: Given that you are both going to die eventually, would you consider having a successor that was a random mixture of your goals with someone else’s? Evolution does seem to have favored corrigibility to some degree.
Not all, no, but I do infer that alien species who have similar physiology and who evolved on planets with similar characteristics probably do like ice cream (and maybe already have something similar to it).
It seems to me like the type of values you are considering are often whatever values seem the most arbitrary, like what kind of “art” we prefer. Aliens may indeed have a different art style from the one we prefer, and if they are extremely advanced, they may indeed fill the universe with gargantuan structures that are all instances of their alien art style. I am more interested in what happens when these aliens encounter other aliens with different art styles who would rather fill the universe with different-looking gargantuan structures. Do they go to war, or do they eventually offer each other pills so they can both like each other’s art styles as much as they prefer their own?
Does “it’s own perspective” mean it already has some existing values?
I read this as saying “GPT-4 has successfully learned to predict human preferences, but it has not learned to actually fulfill human preferences, and that’s a far harder goal”. But in the case of GPT-4, it seems to me like this distinction is not very clear-cut—it’s useful to us because, in its architecture, there’s a sense in which “predicting” and “fulfilling” are basically the same thing.
It also seems to me that this distinction is not very clear-cut in humans, either—that a significant part of e.g. how humans internalize moral values while growing up has to do with building up predictive models of how other people would react to you doing something and then having your decision-making be guided by those predictive models. So given that systems like GPT-4 seem to have a relatively easy time doing something similar, that feels like an update toward alignment being easier than expected.
Of course, there’s a high chance that a superintelligent AI will generalize from that training data differently than most humans would. But that seems to me more like a risk of superintelligence than a risk from AI as such; a superintelligent human would likely also arrive at different moral conclusions than non-superintelligent humans would.
Your comment focuses on GPT4 being “pretty good at extracting preferences from human data” when the stronger part of the argument seems to be that “it will also generally follow your intended directions, rather than what you literally said”.
I agree with you that it was obvious in advance that a superintelligence would understand human value.
However, it sure sounded like you thought we’d have to specify each little detail of the value function. GPT4 seems to suggest that the biggest issue will be a situation where:
1) The AI has an option that would produce a lot of utility if you take one position on an exotic philosophical thought experiment and very little if you take the other side.
2) The existence of powerful AI means that the thought experiment is no longer exotic.
Your reply here says much of what I would expect it to say (and much of it aligns with my impression of things). But why you focused so much on “fill the cauldron” type examples is something I’m a bit confused by (if I remember correctly I was confused by this in 2016 also).
“Fill the cauldron” examples are examples where the cauldron-filler has the wrong utility function, not examples where it has the wrong beliefs. E.g., this is explicit in https://intelligence.org/2016/12/28/ai-alignment-why-its-hard-and-where-to-start/
The idea of the “fill the cauldron” examples isn’t “the AI is bad at NLP and therefore doesn’t understand what we mean when we say ‘fill’, ‘cauldron’, etc.” It’s “even simple small-scale tasks are unnatural, in the sense that it’s hard to define a coherent preference ordering over world-states such that maximizing it completes the task and has no serious negative impact; and there isn’t an obvious patch that overcomes the unnaturalness or otherwise makes it predictably easier to aim AI systems at a bounded low-impact task like this”. (Including easier to aim via training.)
To this, the deep-learning-has-alignment-implications proponent replies: “But simple small-scale tasks don’t require maximizing a coherent preference ordering over world-states. We can already hook up an LLM to a robot and have it obey natural-language commands in a reasonable way.”
To which you might reply, “Fine, cute trick, but that doesn’t help with the real alignment problem, which is that eventually someone will invent a powerful optimizer with a coherent preference ordering over world-states, which will kill us.”
To which the other might reply, “Okay, I agree that we don’t know how to align an arbitrarily powerful optimizer with a coherent preference ordering over world-states, but if your theory predicts that we can’t aim AI systems at low-impact tasks via training, you have to be getting something wrong, because people are absolutely doing that right now, by treating it as a mundane engineering problem in the current paradigm.”
To which you might reply, “We predict that the mundane engineering approach will break down once the systems are powerful enough to come up with plans that humans can’t supervise”?
It’s unlikely that any realistic AI will be perfectly coherent , or have exact preferences over works states. The first is roughly equivalent to the Frame Problem , the second is defeated by embededness.
The obvious question here is to what degree do you need new techniques vs merely to train new models with the same techniques as you scale current approaches.
One of the virtues of the deep learning paradigm is that you can usually test things at small scale (where the models are not and will never be especially smart) and there’s a smooth range of scaling regimes in between where things tend to generalize.
If you need fundamentally different techniques at different scales, and the large scale techniques do not work at intermediate and small scales, then you might have a problem. If you need the same techniques as at medium or small scales for large scales, then engineering continues to be tractable even as algorithmic advances obsolete old approaches.
Thanks for the reply :) Feel free to reply further if you want, but I hope you don’t feel obliged to do so[1].
I have never ever been confused about that!
That is well phrased. And what you write here doesn’t seem in contradiction with my previous impression of things.
I think the feeling I had when first hearing “fill the bucket”-like examples was “interesting—you made a legit point/observation here”[2].
I’m having a hard time giving a crystalized/precise summary of why I nonetheless feel (and have felt[3]) confused. I think some of it has to do with:
More “outer alignment”-like issues being given what seems/seemed to me like outsized focus compared to more “inner alignment”-like issues (although there has been a focus on both for as long as I can remember).
The attempts to think of “tricks” seeming to be focused on real-world optimization-targets to point at, rather than ways of extracting help with alignment somehow / trying to find techniques/paths/tricks for obtaining reliable oracles.
Having utility functions so prominently/commonly be the layer of abstraction that is used[4].
I remember Nate Soares once using the analogy of a very powerful function-optimizer (“I could put in some description of a mathematical function, and it would give me an input that made that function’s output really large”). Thinking of the problem at that layer of abstraction makes much more sense to me.
It’s purposeful that I say “I’m confused”, and not “I understand all details of what you were thinking, and can clearly see that you were misguided”.
When seeing e.g. Eliezer’s talk AI Alignment: Why It’s Hard, and Where to Start, I understand that I’m seeing a fairly small window into his thinking. So when it gives a sense of him not thinking about the problem quite like I would think about it, that is more of a suspicion that I get/got from it—not something I can conclude from it in a firm way.
If I could steal a given amount of your time, I would not prioritize you replying to this.
I can’t remember this point/observation being particularly salient to me (in the context of AI) before I first was exposed to Bostrom’s/Eliezer’s writings (in 2014).
As a sidenote: I wasn’t that worried about technical alignment prior to reading Bostrom’s/Eliezer’s stuff, and became worried upon reading it.
What has confused me has varied throughout time. If I tried to be very precise about what I think I thought when, this comment would become more convoluted. (Also, it’s sometimes hard for me to separate false memories from real ones.)
I have read this tweet, which seemed in line with my interpretation of things.
In retrospect I think we should have been more explicit about the importance of inner alignment; I think that we didn’t do that in our introduction to corrigibility because it wasn’t necessary for illustrating the problem and where we’d run into roadblocks.
Maybe a missing piece here is some explanation of why having a formal understanding of corrigibility might be helpful for actually training corrigibility into a system? (Helpful at all, even if it’s not sufficient on its own.)
Aside from “concreteness can help make the example easier to think about when you’re new to the topic”, part of the explanation here might be “if the world is solved by AI, we do actually think it will probably be via doing some concrete action in the world (e.g., build nanotech), not via helping with alignment or building a system that only outputs English-language sentence”.
I mean, I think utility functions are an extremely useful and basic abstraction. I think it’s a lot harder to think about a lot of AI topics without invoking ideas like ‘this AI thinks outcome X is better than outcome Y’, or ‘this AI’s preference come with different weights, which can’t purely be reduced to what the AI believes’.
Thanks for the reply :) I’ll try to convey some of my thinking, but I don’t expect great success. I’m working on more digestible explainers, but this is a work in progress, and I have nothing good that I can point people to as of now.
Yeah, I guess this is where a lot of the differences in our perspective are located.
Things have to cash out in terms of concrete actions in the world. Maybe a contention is the level of indirection we imagine in our heads (by which we try to obtain systems that can help us do concrete actions).
Prominent in my mind are scenarios that involve a lot of iterative steps (but over a short amount of time) before we start evaluating systems by doing AGI-generated experiments. In the earlier steps, we avoid doing any actions in the “real world” that are influenced in a detailed way by AGI output, and we avoid having real humans be exposed to AGI-generated argumentation.
Examples of stuff we might try to obtain:
AGI “lie detector techniques” (maybe something that is in line with the ideas of Collin Burns)
Argument/proof evaluators (this is an interest of mine, but making better explainers is still a work in progress, and I have some way to go)
If we are good at program-search, this can itself be used to obtain programs that help us be better at program-search (finding functions that score well according to well-defined criteria).
Some tasks can be considered to be inside of “test-range”[1]:
Predicting human answers to questions posed by other humans[2].
Outputting prime numbers[3]
Predicting experimental results from past experimental data[4]
Whether a chess-move is good[5]
Etc, etc
Other tasks are outside of “test-range”:
Predicting human answers to any question (including questions that involve being exposed to AGI-generated content)[6]
Whether a given instruction-plan actually results in machines that copy strawberries at the molecular level (and does so in accordance with “the spirit” of the request that was given)
Etc, etc
Most requests that actually are helpful to us are outside of test-range. And when the requirements that matter to us are outside of test-range, it is of course harder to test in a safe/reliable way if systems are giving us what we want.
But we can have AGIs output programs that help us with tasks, and we can define requirements[7]Be for these programs. And for these program-requirements, AGIs can help us explore stuff such as:
Are there programs that satisfy the requirements but disagree about certain outputs? (be that outputs that are inside of test-range or outside of test-range)
Are there programs that satisfy the requirements, but perform poorly for certain inputs that are inside of test-range?
If we only require correct outputs within the entire “test-range”, is it nonetheless infeasible to make programs that satisfy the requirements while giving bad output for ≥1 inputs within the test-range?
Phrased somewhat differently: We may come up with techniques for making hard to not give helpful/honest outputs[8], and test how well these seem to generalize.
And similar techniques can also be applied to program-requirement-requirements, program-requirement-requirement-requirements, etc[9].
Anyway, it’s hard to write about this kind of stuff in a concrete way without going off large tangents, and it’s hard to write about it in an abstract way without being vague and convoluted. For whatever it’s worth, these are some principles I think may be helpful (very much not a complete list):
Generality as it correlates with measurables (such as e.g. source code length or use of computational resource use): When writing a program to do well for many outputs (while fulfilling certain requirements), intentionally having it fail for particular outputs will sometimes require it to be less general (e.g. adding extra if-sentences, resulting in a longer minimum source code length for the program to exhibit this behavior).
Optimization-targets that incentivize modularity make it harder to make programs that intentionally fail for particular outputs: In a more modular system, it will less often be clear from sub-routine inputs what the macro-level task is[10].
When a requirement or optimization-target fails at “forcing” good outputs, AGIs that act so as to be selected for are “incentivized” to reveal this as being the case: Suppose some set of requirements is compatible with bad outputs inside “test-range”. Or suppose that some set of requirements allow for programs that give contradictory outputs. If we do “gradient descent”, looking for AGIs that help us explore spaces of programs defined by requirements that we specify, we will select for AGIs that point out this kind of stuff.
Anyway, I’m not sure how much or little this attempt at conveying my thoughts achieves. It’s a bit of a dilemma for me if I am to try to convey anything at all, or wait to do that until I have completed better explainers that I can point people to (as the explainers I have now fall short).
I guess an incomplete summary of my perspective might be:
There are lots of techniques we may apply for getting help with alignment while greatly limiting the extent to which we interact with AGI putout in ways that are dangerous
Not being able to directly evaluate outputs we are interested in does not necessarily preclude us from finding indirect methods of evaluation that are powerful.
We may do program-search for AGIs that help us explore possible program-requirements, and the spaces of programs (and program outputs) that are compatible with those program-requirements (and program-requirement-requirements, etc). We may learn a lot from patterns we observe relating to these “spaces”.
Sometimes there are possible patterns such that if we know ourselves to be sufficiently capable at program-search, P(requirement R1 “forces” outputs to be “good” | we observe pattern P1) reasonably could be deemed to be high. Even if we’re not opinionated in regards to P(requirement R1 “forces” outputs to be “good”).
By which I mean something akin to: “We have a safe and reliable way of testing if the output fulfills the requirements in question”.
Admittedly, if we haven’t solved some of the technical challenges related to containment, even playing tic tac toe can constitute some risk. But here, I mean to focus on risk from interacting with AGI output—not risk from just having AGI-systems system run (and them maybe finding a security vulnerability in the computer systems they are running at).
(we can pick random predictions to test, and we can have AIs competitively point out predictions made by other AIs that they think are miscalibrated)
(we can write programs that take alleged prime numbers as input, and test if they really are prime numbers)
(or from experiments that haven’t been run, but that humans have come up with, such that there is little room for AGIs to somehow influence the experiment to itself be dangerous somehow)
(we may not be capable of evaluating chess-moves ourselves, but we may safely have AGI-systems play chess against each other—indicating something akin to: “insofar as we feel confident that we are good at program-search, we can feel confident that we can obtain programs that output good chess-moves”)
There are more and less dangerous ways to do this. With less dangerous ways of doing it, experiments (where actual humans answer questions) are done with humans that are temporarily isolated from the rest of the rest of the world (and who certainly aren’t AGI-system operators).
Such requirements may relate to:
- How the program is constructed
- Observables relating to the source code (source code length, etc)
- Restrictions the source code must adhere to
- Whether the program is accepted by a given verifier (or any verifier that itself fulfills certain requirements)
- “Proofs” of various kinds relating to the program
- Tests of program outputs that the program must be robust in regards to
- Etc
By “making it hard” I means something like “hard to do while being the sort of program we select for when doing program-search”. Kind of like how it’s not “hard” for a chess program to output bad chess-moves, but it’s hard for it to do that while also being the kind of program we continue to select for while doing “gradient descent”.
In my view of things, this is a very central technique (it may appear circular somehow, but when applied correctly, I don’t think it is). But it’s hard for me to talk about it in a concrete way without going off on tangents, and it’s hard for me to talk about it in an abstract way without being vague. Also, my texts become more convoluted when I try to write about this, and I think people often just glaze over it.
One example of this: If we are trying to obtain argument evaluators, the argumentation/demonstrations/proofs these evaluators evaluate should be organized into small and modular pieces, such that it’s not car from any given piece what the macro-level conclusion is.
Eliezer, are you using the correct LW account? There’s only a single comment under this one.
(It’s almost certainly actually Eliezer, given this tweet: https://twitter.com/ESYudkowsky/status/1710036394977235282)
So… let me see if got it right...
You’ve now personally verified all the rumors swirling around, by visiting a certain Balkan country, and… now what?
Sure, you’ve gained a piece of knowledge, but it’s not like that knowledge has helped anybody so far. You also know what the future holds, but knowing that isn’t going to help anybody either.
Being curious about curiosities is nice, but if you can’t do anything about anything, then what’s the point of satisfying that curiosity, really?
Just to be clear, I fully support what you’re doing, but you should be aware of the fact that everything you are doing will amount to absolutely nothing. I should know, after all, as I’ve been doing something similar for quite a while longer than you. I’ve now accepted that… many of my initial assumption about people (that they’re actually not as stupid as they seem) have been proven wrong, time and time again, so… as long as you’re not deceiving yourself by thinking that you’re actually accomplishing something, I’m perfectly fine with whatever you’re trying to do here.
On a side note… did you meet that Hollywood actress in real life, too? For all I know, it could’ve been just an accidental meeting… which shouldn’t be surprising, considering how many famous people have been coming over here recently… and literally none of those visits have changed anything. This is just to let you know that you’re in a good company… of people who wield much more power (not just influence, but actual power) on this planet than you, but are just as equally powerless to do anything about anything on it.
So… don’t beat yourself up over being powerless (to change anything) in this (AGI) matter.
It is what it is (people just are that stupid).
P.S.
No need to reply. This is just a one-off confirmation… of your greatest fears about “superintelligent” AGIs… and the fact that humanity is nothing more than a bunch of walking-dead (and brain-dead) morons.
Don’t waste too much time on morons (it’s OK if it benefits you, personally, in some way, though). It’s simply not worth it. They just never listen. You can trust me on that one.