Arguments for optimism on AI Alignment (I don’t endorse this version, will reupload a new version soon.)
Or, why we probably don’t need to worry about AI.
So this post is partially a response to Amalthea’s comment on how I simply claimed that my side is right, and I responded by stating that I was going for a short comment rather than having to make another very long comment on the issue.
This is the post where I won’t try to claim that my side is right, and instead give evidence so I can properly collect my thoughts here. This will be a link-heavy post, and I’ll reference a lot of concepts and conversations, so it will help if you have some light background on these ideas, but I will try to make everything intelligible to the lay/non-technical person.
This will be a long post, so get a drink and a snack.
The Sharp Left Turn probably won’t happen, because AI training is very different from evolution
Nate Soares suggests that a critical problem in AI safety is the sharp left turn, and the sharp left turn essentially is that capabilities generalize much more than the goals, ie it is basically goal misgeneralization plus fast takeoff:
My guess for how AI progress goes is that at some point, some team gets an AI that starts generalizing sufficiently well, sufficiently far outside of its training distribution, that it can gain mastery of fields like physics, bioengineering, and psychology, to a high enough degree that it more-or-less singlehandedly threatens the entire world. Probably without needing explicit training for its most skilled feats, any more than humans needed many generations of killing off the least-successful rocket engineers to refine our brains towards rocket-engineering before humanity managed to achieve a moon landing.
And in the same stroke that its capabilities leap forward, its alignment properties are revealed to be shallow, and to fail to generalize. The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn’t make the resulting humans optimize mentally for IGF. Like, sure, the apes are eating because they have a hunger instinct and having sex because it feels good—but it’s not like they could be eating/fornicating due to explicit reasoning about how those activities lead to more IGF. They can’t yet perform the sort of abstract reasoning that would correctly justify those actions in terms of IGF. And then, when they start to generalize well in the way of humans, they predictably don’t suddenly start eating/fornicating because of abstract reasoning about IGF, even though they now could. Instead, they invent condoms, and fight you if you try to remove their enjoyment of good food (telling them to just calculate IGF manually). The alignment properties you lauded before the capabilities started to generalize, predictably fail to generalize with the capabilities.
So essentially the analogy is akin to AI is aligned in the training data, but in the test set, due to the limitations of the method of alignment, fail to generalize to the test set.
Here’s the problem: We actually know why the sharp left turn happened, and the circumstances that led to the sharp left turn in humans won’t reappear in AI training and AI progress.
Basically, the sharp left turn happened because the outer optimizer of evolution was billions of times less powerful than the inner search process like human lifetime learning, and the inner learners like us humans die after basically a single step, or at best 2-3 steps of the outer optimizer. Evolution mostly can’t transmit as ,many bits from one generation to the next generation via it’s tools, compared to cultural evolution, and the difference between their ability to transmit bits over certain time-scales is massive.
Once we had the ability to transmit some information via culture, that meant that given our ability to optimize billions of times more efficiently, we could essentially undergo a sharp left turn where capabilities spiked. But the only reason this happened was to quote Quintin Pope:
Once the inner learning processes become capable enough to pass their knowledge along to their successors, you get what looks like a sharp left turn. But that sharp left turn only happens because the inner learners have found a kludgy workaround past the crippling flaw where they all get deleted shortly after initialization.
This does not exist for AIs trained with SGD, and there is a much smaller gap between the outer optimizer SGD and the inner optimizer, with the difference being ~0-40x.
Here’s the source for it below, and I’ll explicitly quote it:
See also: Model Agnostic Meta Learning proposed a bi-level optimization process that used between 10 and 40 times more compute in the inner loop, only for Rapid Learning or Feature Reuse? to show they could get about the same performance while removing almost all the compute from the inner loop, or even by getting rid of the inner loop entirely.
Also, we can set the ratio of outer to inner optimization steps to basically whatever we want, which means that we can control the inner learner’s rates of learning far better than evolution, meaning we can prevent a sharp left turn from happening.
A crux I have with Jan Kulevit is that to the extent that animals do have culture, it is much more limited than human culture, and that evolution largely has little ability to pass on traits non-culturally, and very critically this is a one-time inefficiency, there is no reason to assume a second source of massive inefficiency leading to a sharp left turn:
X4vier and particular illustrates this, and I’ll show it below:
https://www.lesswrong.com/posts/hvz9qjWyv8cLX9JJR/?commentId=qYFkt2JRv3WzAXsHL
https://www.lesswrong.com/posts/hvz9qjWyv8cLX9JJR/?commentId=vETS4TqDPMqZD2LAN
I don’t believe that Nate’s example actually shows the misgeneralization were concerned about
This is because the alleged misgeneralization was not a situation where 1 AI was trained in an environment and maximized the correlates IGF, then in the new environment it encountered inputs that changed the goals such that it now misgeneralizes the goal to not pursue IGF anymore.
What happened is that evolution trained humans in one environment to optimize the correlates of IGF, then basically trained new humans in another environment, and they diverged.
Very critically, there were thousands of different systems/humans being trained on in drastically different environments, not 1 AI being trained on different environments like in modern AI training, so it’s not a valid example of misgeneralization.
Some posts and quotes from Quintin Pope will help:
(Part 2, how this matters for analogies from evolution) Many of the most fundamental questions of alignment are about how AIs will generalize from their training data. E.g., “If we train the AI to act nicely in situations where we can provide oversight, will it continue to act nicely in situations where we can’t provide oversight?”
When people try to use human evolutionary history to make predictions about AI generalizations, they often make arguments like “In the ancestral environment, evolution trained humans to do X, but in the modern environment, they do Y instead.” Then they try to infer something about AI generalizations by pointing to how X and Y differ.
However, such arguments make a critical misstep: evolution optimizes over the human genome, which is the top level of the human learning process. Evolution applies very little direct optimization power to the middle level. E.g., evolution does not transfer the skills, knowledge, values, or behaviors learned by one generation to their descendants. The descendants must re-learn those things from information present in the environment (which may include demonstrations and instructions from the previous generation).
This distinction matters because the entire point of a learning system being trained on environmental data is to insert useful information and behavioral patterns into the middle level stuff. But this (mostly) doesn’t happen with evolution, so the transition from ancestral environment to modern environment is not an example of a learning system generalizing from its training data. It’s not an example of:
We trained the system in environment A. Then, the trained system processed a different distribution of inputs from environment B, and now the system behaves differently.
It’s an example of:
We trained a system in environment A. Then, we trained a fresh version of the same system on a different distribution of inputs from environment B, and now the two different systems behave differently.
These are completely different kinds of transitions, and trying to reason from an instance of the second kind of transition (humans in ancestral versus modern environments), to an instance of the first kind of transition (future AIs in training versus deployment), will very easily lead you astray.
Two different learning systems, trained on data from two different distributions, will usually have greater divergence between their behaviors, as compared to a single system which is being evaluated on the data from the two different distributions. Treating our evolutionary history like humanity’s “training” will thus lead to overly pessimistic expectations regarding the stability and predictability of an AI’s generalizations from its training data.
Drawing correct lessons about AI from human evolutionary history requires tracking how evolution influenced the different levels of the human learning process. I generally find that such corrected evolutionary analogies carry implications that are far less interesting or concerning than their uncorrected counterparts. E.g., here are two ways of thinking about how humans came to like ice cream:
If we assume that humans were “trained” in the ancestral environment to pursue gazelle meat and such, and then “deployed” into the modern environment where we pursued ice cream instead, then that’s an example where behavior in training completely fails to predict behavior in deployment.
If there are actually two different sets of training “runs”, one set trained in the ancestral environment where the humans were rewarded for pursuing gazelles, and one set trained in the modern environment where the humans were rewarded for pursuing ice cream, then the fact that humans from the latter set tend to like ice cream is no surprise at all.
In particular, this outcome doesn’t tell us anything new or concerning from an alignment perspective. The only lesson applicable to a single training process is the fact that, if you reward a learner for doing something, they’ll tend to do similar stuff in the future, which is pretty much the common understanding of what rewards do.
A comment by Quintin on why humans didn’t actually misgeneralize to liking icecream:
https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/?commentId=sYA9PLztwiTWY939B
AIs are white boxes, and we are the innate reward system
Edit from comments due to Steven Byrnes: The white-box definition I’m using in this post does not correspond to the intuitive definition of a white box, and instead refers to the computer analysis/security sense of the term.
These links will be the definitions of white box AI going forward for this post:
https://forum.effectivealtruism.org/posts/JYEAL8g7ArqGoTaX6/?commentId=CLi5eBchYfXKZvXuD
The above arguments on why the Sharp Left Turn probably won’t reappear in modern AI development, and why the claim that humans didn’t misgeneralize is enough to land us out of the most doomy voices like Eliezer Yudkowsky, and in particular the removal of reasons to assume extreme misgeneralization lands us out of MIRI-sphere views, as well as arguably outside of 50% p(doom). But I wanted to argue that the chance of doom is way lower than that, so low that we mostly shouldn’t be concerned about AI, and thus I have to provide a positive story of why AIs very likely are aligned, and I argue that AIs are white boxes and we are the innate reward system, in this context.
The key advantage we have over evolution is that unlike studying brains, we have full read-write access to their internals, and they’re essentially a special type of computer program, and we already have ways to manipulate computer programs at essentially no cost to us. Indeed, this is why SGD and backpropagation works at all to optimize SGD. If the AI was a black box, SGD and backpropagation wouldn’t work.
The innate reward system aligns us via whitebox methods, and the values that the reward system imprints on us is ridiculously reliable, where almost every human has empathy for friends and acquaintances, parental instincts, revenge etc.
This is shown in the link below:
(Here, we must take a detour and say that our reward system is ridiculously good at aligning us to survive, and the flaws like obesity in the modern world are usually surprisingly mild failures, in that the human isn’t as capable of things as we thought, and this arguably implies that alignment failures in practice will look much more like capabilities failures, and passing the analogy back to the AI case, I basically don’t expect X-risk, GCRs, or really anything more severe than say the AI messing up a kitchen, for example.)
Steven Byrnes raised the concern that if you don’t know how to do the manipulation, then it does cost you to gain the knowledge.
Steven Byrnes’s comment is linked here: https://forum.effectivealtruism.org/posts/JYEAL8g7ArqGoTaX6/?commentId=3xxsumjgHWoJqSzqw
Nora Belrose responded on what white boxing meant, as well as how people use SGD to automate the search so that the cost of manipulation in an overall sense is as low as possible:
https://twitter.com/norabelrose/status/1709603325078102394
I mean it in the computer security sense, where it refers to the observability of the source code of a program (Nora Belrose)
https://twitter.com/norabelrose/status/1709606248314998835
We can do better than IDA Pro & Ghidra by exploiting the differentability of neural nets, using SGD to locate the manipulations of NN weights that improve alignment the most
I’d be much more worried if we didn’t have SGD and were just evolving AGI in a sim or smth (Nora Belrose)
https://twitter.com/norabelrose/status/1709601025286635762
I’m pointing out that it’s a white box in the very literal sense that you can observe and manipulate everything that’s going on inside, and this is a far from trivial fact because you can’t do this with other systems we routinely align like humans or animals. (Nora Belrose)
https://twitter.com/norabelrose/status/1709603731413901382
No, I don’t agree this is a weakening. In a literal sense it is zero cost to analyze and manipulate the NNs. It may be greater than zero cost to come up with manual manipulations that achieve some goal. But that’s why we automate the search for manipulations using SGD (Nora Belrose)
Steven Byrnes argues that this could be due to differing definitions:
https://twitter.com/steve47285/status/1709655473941631430
I think that’s a black box with a button on the front panel that says “SGD”. We can talk all day about all the cool things we can do by pressing the SGD button. But it’s still a button outside the box, metaphorically.
To me, “white box” would mean: If an LLM outputs A rather than B, and you ask me why, then I can always give you a reasonable answer. I claim that this is closer to how that term is normally used in practice.
(Yes I know, it’s not literally a button, it’s an input-output interface that also changes the black box internals.) (Steven Byrnes)
This is the response chain so that I could see why Nora Belrose and Steven Byrnes were disagreeing.
I ultimately think a potential difference is that for alignment purposes, the humans vs AI abstraction is not a very useful abstraction, and SGD vs the inner optimizer is the better abstraction here, and thus it doesn’t matter whether AI progresses generally, it’s the specific progress by humans + SGD vs the inner optimizer that’s important, and thus the cost of manipulating AI values is quite low.
This leads to...
I believe the security mindset is inappropriate for AI
In general, a common disagreement with a lot of LWers is that there is very limited transfer of knowledge from the computer security field to AI, because AI is very different in ways that make the analogies inappropriate.
For one particular example, you can randomly double your training data, or the size of the model, and it will work usually just fine. A rocket would explode if you tried to double the size of your fuel tanks.
All of this and more is explained by Quintin below, but there are several big disanalogies between the AI field and the computer security field, so much so that I think that ML/AI is a lot like quantum mechanics, where we shouldn’t port intuitions from other fields and expect them to work because of the weirdness of the domain:
https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/#Yudkowsky_mentions_the_security_mindset__
Similarly, I think that machine learning is not really like computer security, or rocket science (another analogy that Yudkowsky often uses). Some examples of things that happen in ML that don’t really happen in other fields:
Models are internally modular by default. Swapping the positions of nearby transformer layers causes little performance degradation.
Swapping a computer’s hard drive for its CPU, or swapping a rocket’s fuel tank for one of its stabilization fins, would lead to instant failure at best. Similarly, swapping around different steps of a cryptographic protocol will, usually make it output nonsense. At worst, it will introduce a crippling security flaw. For example, password salts are added before hashing the passwords. If you switch to adding them after, this makes salting near useless.
We can arithmetically edit models. We can finetune one model for many tasks individually and track how the weights change with each finetuning to get a “task vector” for each task. We can then add task vectors together to make a model that’s good at multiple of the tasks at once, or we can subtract out task vectors to make the model worse at the associated tasks.
Randomly adding / subtracting extra pieces to either rockets or cryptosystems is playing with the worst kind of fire, and will eventually get you hacked or exploded, respectively.
We can stitch different models together, without any retraining.
The rough equivalent for computer security would be to have two encryption algorithms A and B, and a plaintext X. Then, midway through applying A to X, switch over to using B instead. For rocketry, it would be like building two different rockets, then trying to weld the top half of one rocket onto the bottom half of the other.
Things often get easier as they get bigger. Scaling models makes them learn faster, and makes them more robust.
This is usually not the case in security or rocket science.
You can just randomly change around what you’re doing in ML training, and it often works fine. E.g., you can just double the size of your model, or of your training data, or change around hyperparameters of your training process, while making literally zero other adjustments, and things usually won’t explode.
Rockets will literally explode if you try to randomly double the size of their fuel tanks.
I don’t think this sort of weirdness fits into the framework / “narrative” of any preexisting field. I think these results are like the weirdness of quantum tunneling or the double slit experiment: signs that we’re dealing with a very strange domain, and we should be skeptical of importing intuitions from other domains.
I also believe that the epistemic differences between computer security and alignment is in computer security, there’s an easy to check ground truth for whether a crypto-system is broken, whereas in AI alignment, we don’t have the ability to get feedback from proposed breakages of alignment schemes.
For more, see Quintin’s post section on the difference between AI safety and computer security in regards to epistemics, and a worked example of an attempted security break, where there is suggestive evidence that inner misaligned models/optimization daemons go away as we increase the amount of dimensions.
(Where Quintin Pope talks about the fact that alignment doesn’t have good feedback loops on ground truth on “What is an attempted break?”, and the example of a claimed break actually went away as the dimensions was scaled up, and note that the disconfirmatory evidence was more realistic than the attempted break.)
This is why I disagreed with Jeffrey Ladish about the security mindset on Twitter: I believe it’s a trap for those not possessing technical knowledge, like a lot of LWers, and there are massive differences between AI and computer security that means most attempted connections fail.
https://twitter.com/JeffLadish/status/1712262020438131062
uh I guess I hope he reads enough to internalize the security mindset?? (Jeffrey Ladish)
https://twitter.com/SharmakeFarah14/status/1712264530829492518
I generally tend to think the security mindset is a trap, because ML/AI alignment is very different from rocket engineering or cybersecurity.
For a primer on why, read @QuintinPope5′s post section on it:
So now that I’ve tried to show why porting over the security mindset is flawed, I want to talk about a class of adversaries like gradient hackers or inner-misaligned mesa-optimization, and why I believe this is actually very difficult to do against SGD, and even the non-platonic ideal version of SGD, we can detect most mesa-optimizers quite easily.
Inner Misalignment, or at least Gradient Hacking is very difficult for AIs trained on SGD
I’ll be taking the inner misalignment definition from Evan Hubinger’s post The Inner Alignment Problem:
https://www.lesswrong.com/posts/pL56xPoniLvtMDQ4J/the-inner-alignment-problem
We refer to this problem of aligning mesa-optimizers with the base objective as the inner alignment problem.
The basic reason why it’s hard for a misaligned mesa-optimizer to stick around for long is because Gradient Descent is in fact, much more powerful and white-boxy than people realize, and in particular it has 5 defenses that any mesa-optimizer would need to overcome in order to misalign it:
https://www.lesswrong.com/posts/w2TAEvME2yAG9MHeq/gradient-hacking-is-extremely-difficult
1.) Everything is optimized simultaneously and independently. The exact local sensitivity to of each and every parameter to the loss is computed in the backward pass and every parameter is updated based on this. This means it is impossible for a mesaoptimizer or gradient hacker (as long as it is differentiably connected to the network) to shield any part of itself from gradient updates. This includes the gradient hacker’s internal circuitry, its connections with other parts of the network, its connections to the network’s input data, and the connections between the rest of the network and the loss.
2.) Gradient descent is maximal. Each parameter is updated in the direction of steepest descent. This means that, unlike biological life, gradient hackers cannot ‘burn’ something to increase entropy to be traded off against decreasing entropy elsewhere. Only the local maximally increasing entropy direction is allowed.
3.) It is stateless. Each parameter is updated according to its local sensitivity to the loss for each datapoint. Credit is not propagated across time. It is impossible to ‘bargain’ with gradient descent by increasing loss now but decreasing loss more later, or having decreased loss in the past.
4.) It is first-order only. It only computes partial derivatives of the parameter to the loss. It does not take into account potential second order interactions. While bad in some cases, this makes gradient descent immune to ‘blackmail’ or ‘tripwires’ setup by mesaoptimizers where optimizing some part of their circuit would potentially cause a huge loss as a second order effect. GD will blithely trigger the tripwire and then optimize away the consequences. This also disallows ‘trade’ where a gradient hacker could cause low loss on one causal branch in exchange for high loss on some other branch. Gradient descent will instead sum the steepest descent direction on both branches.
5.) It is infinitesimal. GD only computes the infinitesimal local sensitivities of each parameter regardless of the actual learning rate or step-size in practice. This means gradient descent does not ‘see’ if you are standing next to a giant cliff, so cannot be blackmailed in this way either.
Basically it will optimize the entire causal graph, and leave no slack, and as a bonus are extremely resistant to blackmail by mesa-optimizers. In general, a big part of my optimism around inner alignment is that SGD is extraordinarily good at credit assignment, and it has quite strong correction features in the case that a mesa-optimizer does attempt to misalign it.
We also can detect most mesa-optimizers in the AI without the need for mechanistic interpretability, like so:
While it might be possible to construct gradient hackers around these features, the advantages to us is that all of these approaches (except perhaps 2) should be visible to us even without detailed mechanistic interpretability knowledge of the network. We know a-priori where the stop-gradients or non-differentiable parts of the architecture are. We can straightforwardly observe what gradients are being clipped—and if the gradient updates to the same set of parameters is being constantly clipped that could be a red flag. We can also find regions of high curvature in the loss landscape by just inspecting the Hessians or approximations of it. Only a gradient hacker which utilizes correlations across minibatches to cancel out its interventions in model behaviour seems difficult to detect.
One caveat here is that the prevention of mesa-optimizers applies fully only to SSL learning on IID data, which is an unfortunate limitation, albeit I do expect SGD to still be ridiculously good at credit assignment even in the RL context.
In particular, the detection mechanisms for mesa-optimizers are intact, but we do need to worry about 1 new potential inner misalignment pathway.
Also of importance is that this post only applies to the classic supervised/self-supervised learning on i.i.d data which most current foundation models are trained in. The important things about this paradigm is that the outputs of the model have no causal link to the data distribution—i.e. the model cannot influence what future data it will be trained on (except potentially highly indirectly influencing successor models [1]). In an RL context, or supervised learning with adaptive data sampling, where there is a link between the model’s outputs and the future data distribution, then this provides another route for gradient hackers to operate—by creating outputs which will steer the data distribution in a desired direction which will strengthen the gradient hacker.
But there’s also weak evidence that optimization daemons/demons, often called inner misaligned models, go away when you increase the dimension count:
Another poster (ironically using the handle “DaemonicSigil”) then found a scenario in which gradient descent does form an optimization demon. However, the scenario in question is extremely unnatural, and not at all like those found in normal deep learning practice. So no one knew whether this represented a valid “proof of concept” that realistic deep learning systems would develop optimization demons.
Roughly two and a half years later, Ulisse Mini would make DaemonicSigil’s scenario a bit more like those found in deep learning by increasing the number of dimensions from 16 to 1000 (still vastly smaller than any realistic deep learning system), which produced very different results, and weakly suggested that more dimensions do reduce demon formation.
This was actually a crux in a discussion between me and David Xu about inner alignment, where I argued that the sharp left turn conditions don’t exist in AI development, and he argued that misalignment happens when there are gaps that go uncorrected, which is likely referring to the gap between the base goal like SGD and the internal optimizer’s goal that leads to inner misalignment, and I argued that inner misalignment is likely to be extremely difficult to do, due to SGD being able to correct the gap between the inner and outer mesa-optimizer in most cases, and I now showed the argument in this post:
Twitter conversation below:
https://twitter.com/davidxu90/status/1712567663401238742
Speaking as someone who’s read that post (alongside most of Quintin’s others) and who still finds his basic argument unconvincing, I can say that my issue is that I don’t buy his characterization of the doom argument—e.g. I disagree that there needs to be a “vast gap”. (David Xu)
https://twitter.com/davidxu90/status/1712568155959362014
SGD is not the kind of thing where you need “vast gaps” between the inner and outer optimizer to get misalignment; on my model, misalignment happens whenever gaps appear that go uncorrected, since uncorrected gaps will tend to grow alongside capabilities/coherence. (David Xu)
https://twitter.com/SharmakeFarah14/status/1712573782773108737
since uncorrected gaps will tend to grow alongside capabilities/coherence.
This is definitely what I don’t expect, and part of that is because I expect that uncorrected inner misalignment will be squashed out by SGD unless extreme things happen:
https://www.lesswrong.com/posts/w2TAEvME2yAG9MHeq/gradient-hacking-is-extremely-difficult (Myself)
https://twitter.com/davidxu90/status/1712575172124033352
Yes, that definitely sounds cruxy—you expect SGD to contain corrective mechanisms by default, whereas I don’t. This seems like a stronger claim than “SGD is different from evolution”, however, and I don’t think I’ve seen good arguments made for it. (David Xu)
This reminds me, I should address that other conversation I had with David Xu on how strong priors do we need to encode to ensure alignment, vs how much can we let it learn and it leading to a good outcome, or alternatively how much do we need to specify upfront. And that leads to...
I expect reasonably weak priors to work well to align AI with human values, and that a lot of the complexity can be offloaded to the learning process
Equivalently speaking, I expect the cost of specification of values to be relatively low, and that a lot of the complexity is offloadable to the learning process.
This was another crux between David Xu and me, specifically on the question of whether you can largely get away with weak priors, or do you actually need to encode a lot stronger prior to prevent misalignment? It ultimately boiled down to the crux that I expected reasonably weak priors to be enough, guided by the innate reward system.
A big part of my reasoning here has to do with the fact that a lot of values and biases are inaccessible by the genome, and that means that you can’t directly specify them. You can shape them via setting up training algorithms and data, but it turns out that it’s very difficult to directly specify things like values, for instance in the genome. This is primarily because the genome does not have direct access to the world model or the brain, which would be required to hardcode the prior. To the extent that it can, it has to be over relatively simple properties, which means that you need to get alignment with relatively weak priors encoded, and the innate reward system generally does this fantastically, with examples of misalignment being rare and mild.
The fact that humans can reliably get values like “having empathy for friends and acquaintances, we have parental instincts, we want revenge when others harm us, etc”, without requiring the genome to hardcode a lot of prior information, and getting away with reasonably weak priors is a rather underappreciated thing, since it means that we don’t need to specify our values very much, and thus we can reliably offload most of the value learning work to AI.
Here are some posts and comments below:
(I want to point out that it’s not just that with weak prior information that the genome can reliably bind humans to real-enough things such that for example, they don’t die from thirst from drinking fake water, but that it can create the innate reward system which uses simple update rules to reliably get nearly every person on earth to have empathy for their family and ingroup, revenge when others harmed us, etc, and the rare exceptions to the pattern are rather rare and usually mild alignment failures at best. That’s a source of a lot of my optimism on AI safety and alignment.)
https://www.lesswrong.com/posts/9Yc7Pp7szcjPgPsjf/the-brain-as-a-universal-learning-machine
Here is the compressed conversation between David Xu and me:
https://twitter.com/davidxu90/status/1713102210354294936
(And the reason I’d be more optimistic there is basically because I expect the human has meta-priors I’d endorse, causing them to extrapolate in a “good” way, and reach a policy similar to one I myself would reach under similar augmentation.) (David Xu)
https://twitter.com/davidxu90/status/1713230086730862731
(In reality, of course, I disagree with the framing in both cases: “two different systems” isn’t correct, because the genetic information that evolution was working with in fact does encode fairly strong priors, as I mentioned upthread.) (David Xu)
https://twitter.com/SharmakeFarah14/status/1713232260827095119
My disagreement is that I expect the genetic priors to be quite weak, and that a lot of values are learned, not encoded in priors, because values are inaccessible to the genome:
Maybe we will eventually be able to hardcode it, but we don’t need that. (Myself)
https://twitter.com/davidxu90/status/1713232760637358547
Values aren’t “learned”, “inferred”, or any other words that suggests they’re directly imbibed from the training data, because values aren’t constrained by training data alone; if this were false, it would imply the orthogonality thesis is false. (David Xu)
I’m going to reply in this post and say that the orthogonality thesis is a lot like the no free lunch theorem: An extraordinarily powerful result that is too general to apply, because it only applies to the space of all logically possible AIs, and it only works if you have 0 prior that’s applied, which in this case would require you to specify everything, including the values of the system, or at best use stuff like brute force search or memorization algorithms.
I have a very similar attitude to “Most goals in the space of goal space are bad.” I’d probably agree in the most general sense, but that even weak priors can prevent most goals from being bad, and thus I suspect that a 0 prior condition is likely necessary. But I’m not arguing that with 0 prior, models are aligned with people without specifying everything. I’m arguing that we can get away with reasonably weak priors, and let within life-time learning do the rest.
Once you introduce even weak priors to the situation, then the issue is basically resolved, and I stated that weak priors work to induce learning of values, and it’s consistent with the orthogonality thesis to have arbitrarily tiny prior information be necessary to learn alignment.
I could make an analogous argument for capabilities, and I’d be demonstrably wrong, since the conclusion doesn’t hold.
This is why I hate the orthogonality thesis, despite rationalists being right on it: It allows for too many outcomes, and any inference like values aren’t learned can’t be supported based on the orthogonality thesis.
https://twitter.com/SharmakeFarah14/status/1713234214391255277
The problem with the orthogonality thesis is that it allows for too many outcomes, and notice I said the genetic prior is weak, not non-existent, which would be compatible with the orthogonality thesis. (Myself)
https://twitter.com/davidxu90/status/1713234707272626653
The orthogonality thesis, as originally deployed, isn’t meant as a tool to predict outcomes, but to counter arguments (pretty much) like the ones being made here: encountering “good” training data doesn’t constrain motivations. Beyond that the thesis doesn’t say much. (David Xu)
https://twitter.com/SharmakeFarah14/status/1713236849873891699
I suspect it’s true when looking at the multiverse of AIs as a whole, then it’s true, if we impose 0 prior, but even weak priors start to constrain your motivations a lot. I have more faith in weak priors + whiteboxness working out than you do. (Myself)
https://twitter.com/davidxu90/status/1713237355501584857
I have more faith in weak priors + whiteboxness working out than you do.
I agree that something in the vicinity of this is likely [a] crux. (David Xu)
https://twitter.com/davidxu90/status/1713238995893912060
TBC, I do think it’s logically possible for the NN landscape to be s.t. everything I’ve said is untrue, and that good minds abound given good data. I don’t think this is likely a priori, and I don’t think Quintin’s arguments shift me very much, but I admit it’s possible. (David Xu)
##My own algorithm for how to do AI alignment
This is a subpoint, but for those that want to have a ready-to-go alignment plan, here it is:
-
Implement a weak prior over goal space.
-
Use DPO, RLHF, or something else to create a preference model.
-
Create a custom loss function for the preference model.
-
Use the backpropagation algorithm to optimize it and achieve a low loss.
-
Repeat the backpropagation algorithm until you achieve an acceptable solution.
Now that I’m basically finished with laying out the arguments and the conversations, lets move on to the conclusion:
Conclusion
My optimism on AI safety stems from a variety of sources. The reasons are, in order of the post, not ordered by importance are:
-
I don’t believe the sharp left turn is anywhere near as general as Nate Soares puts it, because the conditions that caused a sharp left turn in humans was basically cultural learning in humans being able to optimize over much faster time-scales than evolution could respond, evolution not course-correcting us, and being able to transmit OOMs more information via culture through the generations than evolution could. None of these conditions hold for modern AI development.
-
I don’t believe that Nate’s example of misgeneralizing the goal of IGF actually works as an actual example of misgeneralization that matters for our purposes, because they were not that 1 AI is trained for a goal in environment A, and then in environment B, it does not pursue the goal, but instead pursues a different goal competently.
Instead, what’s happening is that 1 human generation, or 1 human is trained in Environment A, and then a fresh generation of humans is trained on a different distribution, which predictably will have more divergence than the first case.
In particular, there’s no reason to be concerned about the alignment of AI misgeneralizing, since we have no reason to assume that the central example of Lesswrong is actually misgeneralization. From Quintin:
If we assume that humans were “trained” in the ancestral environment to pursue gazelle meat and such, and then “deployed” into the modern environment where we pursued ice cream instead, then that’s an example where behavior in training completely fails to predict behavior in deployment.
If there are actually two different sets of training “runs”, one set trained in the ancestral environment where the humans were rewarded for pursuing gazelles, and one set trained in the modern environment where the humans were rewarded for pursuing ice cream, then the fact that humans from the latter set tend to like ice cream is no surprise at all.
In particular, this outcome doesn’t tell us anything new or concerning from an alignment perspective. The only lesson applicable to a single training process is the fact that, if you reward a learner for doing something, they’ll tend to do similar stuff in the future, which is pretty much the common understanding of what rewards do.
-
AIs are mostly white boxes, at the very least, and the control over AI that we have means that a better analogy is through our innate reward systems, which align us to quite a lot of goals spectacularly well, so well that the total evidence of alignment could easily put X-risk or even say, killing a human 5-15+ OOMs or less, which would make the alignment problem a non-problem for our purposes. It would pretty much single-handedly make AI misuse the biggest problem, but that issue has different solutions, and governments are likely to regulate AI misuse anyway, so existential risk gets cut 10-99%+ or more.
-
I believe the security mindset is inappropriate for AI due to the fact that aligning AI mostly doesn’t involve dealing with adversarial intelligences or inputs, and the reason turns out to be that the most natural class, inner misaligned mesa-optimizers/optimization daemons mostly doesn’t exist, because of my next reason. Also alignment is in a different epistemic state to computer security, and there are other disanalogies that make porting intuitions from other fields into ML/AI research very difficult to do correctly.
-
It is actually really difficult to inner misalign the AI, since SGD is really good at credit assignment, and optimizes the entire causal graph leading to the loss, leaving no slack. It’s not like evolution where you have to do this from Gwern’s post here:
https://gwern.net/backstop#rl
Imagine trying to run a business in which the only feedback given is whether you go bankrupt or not. In running that business, you make millions or billions of decisions, to adopt a particular model, rent a particular store, advertise this or that, hire one person out of scores of applicants, assign them this or that task to make many decisions of their own (which may in turn require decisions to be made by still others), and so on, extended over many years. At the end, you turn a healthy profit, or go bankrupt. So you get 1 bit of feedback, which must be split over billions of decisions. When a company goes bankrupt, what killed it? Hiring the wrong accountant? The CEO not investing enough in R&D? Random geopolitical events? New government regulations? Putting its HQ in the wrong city? Just a generalized inefficiency? How would you know which decisions were good and which were bad? How do you solve the “credit assignment problem”?
The way SGD solves this problem is by running backprop, which is a white-box algorithm, and Nora Belrose explains it more here:
And that’s the base optimizer, not the mesa-optimizer, which is why SGD has a chance to correct the inner-misaligned agent far more effectively than cultural/biological evolution, the free market, etc. It is white-box, like the inner optimizers it runs, and solves credit assignment in a much better way than the previous optimizers like cultural/biological evolution, the free market, etc could hope to do.
I believe that due to information inaccessibility plus the fact that the brain acts quite a lot like a Universal Learning Machine/Neural Turing Machine, this means that alignment in the human case for say surviving, having empathy for friends etc, can’t depend on complicated genetic priors, and thus to the extent that genetic priors are encoded in, they need to be fairly weak and universalish-priors, plus help from the innate reward system, which is built upon those priors to use simple updating rules to reinforce certain behaviors and penalize others, and this works ridiculously well to align humans to surviving and having things like empathy/sympathy for the ingroup, revenge etc.
So now that we have listed the reasons why I expect optimism on AI safety, I’ll add 1 new mini-section to show that the shutdown problem from AI is almost solved.
Addendum 1: The shutdown problem for AI is almost solved
It turns out that we can keep the most useful aspects of Expected Utility Maximization while making an AI shutdownable.
Sami Petersen showed that we can integrate incomplete preferences to AIs while weakening transitivity just enough to get a non-trivial theory of Expected Utility Maximization that’s quite a lot safer. Elliott Thornley proposed that incomplete preferences would be used to solve the shut-down problem, and the very nice thing about subagent models of Expected Utility Maximization is that they require a unanimous committee in order for a decision to be accepted as a sure gain.
This is both useful, but can lead to problems. On the one hand, we only need one expected utility maximizer that wants to be able to shut down the AI in order for us to shut it down as a whole, but we would need to be sort of careful on where their execution conditions/domain is, as unanimous committees can terrible because only one agent needs to do something to grind the entire system to a halt, which is why in the real world, it’s usually not a preferred way to govern something.
Nevertheless, for AI safety purposes, this is still very, very useful, and if it grows up to have broader conditions than the ones outlined in the posts below, this might be the single biggest MIRI success of the last 15 years, which is ridiculously good.
http://pf-user-files-01.s3.amazonaws.com/u-242443/uploads/2023-05-02/m343uwh/The Shutdown Problem- Two Theorems%2C Incomplete Preferences as a Solution.pdf
Edit 3: I’ve removed addendum 2 as I think it’s mostly irrelevant, and Daniel Kokotajlo showed me that Ajeya actually expects things to slow down in the next few years, so the section really didn’t make that much sense.
This topic is poorly understood, very high confidence is obviously wrong for any claim that’s not exceptionally clear. Absence of doom is not such a claim, so the need to worry isn’t going anywhere.
This is why the post is so long: It has to integrate a lot of different sources of evidence, actually give lots of evidence for major claims, and I had to make sure that I actually have positive arguments such that it’s very, very likely we will align AI, an arguably make it safe by default. That’s why I made the argument on AIs as white boxes, and the fact that I think that the genome uses very weak priors to align us to having empathy for the ingroup, for example ridiculously well, because these were intended to be reasons to expect safe AI by default in a very strong sense.
Also, there is a lot of untapped evidence on humans, and that’s what I was using to make this post.
Quintin Pope and TurnTrout’s post is below on the massive evidence we have about humans for alignment.
https://www.lesswrong.com/posts/CjFZeDD6iCnNubDoS/humans-provide-an-untapped-wealth-of-evidence-about
Without sufficient clarity, which humanity doesn’t possess on this topic, no amount of somewhat confused arguments is sufficient for the kind of certainty that makes the remaining risk of extinction not worth worrying about. It’s important to understand and develop what arguments we have, but in their present state they are not suitable for arguing this particular case outside their own assumption-laden frames.
When reunited with unknown unknowns outside their natural frames, such arguments might plausibly make it reasonable to believe the risk of extinction is as low as 10%, or as high as 90%, but nothing more extreme than that. Nowhere across this whole range of epistemic possibilities is a situation that we “mostly don’t need to worry about”.
I think that’s because AI today feels like a software project akin to building a website. If it works, that’s nice, but if it doesn’t work it’s no big deal.
Weak systems have safe failures because they are weak, not because they are safe. If you piss off a kitten, it will not kill you. If you piss off an adult tiger...
The optimistic assumptions laid out in this post don’t have to fail in every possible case for us to be in mortal danger. They only have to fail in one set of circumstances that someone actualizes. And as long as things keep looking like they are OK, people will continue to push the envelope of risk to get more capabilities.
We have already seen AI developers throw caution to the wind in many ways (releasing weights as open source, connecting AI to the internet, giving it access to a command prompt) and things seem OK for now so I imagine this will continue. We have already seen some psycho behavior from Sydney too. But all these systems are weak reasoners and they don’t have a particularly solid grasp on cause and effect in the real world.
We are certainly in a better position with respect to winning than when I started posting on this website. To me the big wins are (1) that safety is a mainstream topic and (2) that the AIs learned English before they learned physics. But I don’t regard those as sufficient for human survival.
I disagree, and I think there are deeper reasons for why most computer security analogies do not work for ML/AI alignment.
I think the biggest reasons for this are the following:
The thing that LW people call security mindset is non-standard, and under the computer security definition, you only start handing out points for discovering potential failures when they can actually demonstrate it, and virtually no proposed failures that I am aware of have been demonstrated successfully, except for goal misgeneralization and specification gaming, and even here they are toy AIs.
In contrast, the notion that inner misaligned models/optimization daemons would appear in modern AI systems has been tested twice before, and in 1 case, DaemonicSigil was able to get a gradient hacker/optimization daemon to appear, but it was extremely toy, then when it was shown in a more realistic case, the optimization daemon phenomenon went away, or was going away.
See Iceman’s comment for more details on why LW Security Mindset!=Computer Security Mindset:
https://www.lesswrong.com/posts/99tD8L8Hk5wkKNY8Q/?commentId=xF5XXJBNgd6qtEM3q
That leads to 2:
ML people can do things that would not work well under a security mindset or rocket engineering, like randomly doubling model size or data, or swapping one model for another, which would be big no-nos under computer security and rocket engineering, because rockets would literally explode if you doubled their fuel randomly in-flight, and switching the order in securing a password would make it output nonsense at best or destroy the security at worst.
There are enough results like this that I’m now skeptical of applying the security mindset frame to AI safety, beyond inner alignment being very likely by default to SGD’s corrective properties.
Do you just like not believe that AI systems will ever become superhumanly strong? That once you really crank up the power (via hardware and/or software progress), you’ll end up with something that could kill you?
Read what I wrote above: current systems are safe because they’re weak, not safe because they’re inherently safe.
Security mindset isn’t necessary for weak systems because weak systems are not dangerous.
This is exactly what I am arguing against. I do not believe that the security mindset doesn’t work because AI is weak, I believe that the security mindset fails for deeper reasons than that, and an increase in capabilities doesn’t mean that the security mindset looks better (indeed, it may actually look worse, see the attempted optimization daemon break of AI to see how making capabilities go up by increasing the dimensions of the AI, where it started going away, or all of SGD’s corrections.)
Edit: I also have issues with the way LW applies the security mindset, and I’ll quote my comment from there on why a lot of LW implementations of security mindset fail:
Maybe you’re right, we may need to deploy an AI system that demonstrates the potential to kill tens of millions of people before anyone really takes AI risk seriously. The AI equivalent of Trinity.
https://en.wikipedia.org/wiki/Trinity_(nuclear_test)
It’s not just about “being taken seriously”, although that’s a nice bonus—it’s also about getting shared understanding about what makes programs secure vs. insecure. You need a method of touching grass so that researchers have some idea of whether or not they’re making progress on the real issues.
We already can’t make MNIST digit recognizers secure against adversarial attacks. We don’t know how to prevent prompt injection. Convnets are vulnerable to adversarial attacks. RL agents that play Go at superhuman levels are vulnerable to simple strategies that exploit gaps in their cognition.
No, there’s plenty of evidence that we can’t make ML systems robust.
What is lacking is “concrete” evidence that that will result in blood and dead bodies.
None of those things are examples of misalignment except arguably prompt injection, which seems like it’s being solved by OpenAI with ordinary engineering.
To me the security mindset seems inapplicable because in computer science, programs are rigid systems with narrow targets. AI is not very rigid and the target, I.e. an aligned mind, is not necessarily narrow.
That rigidity is what makes computer security so easy.
...
Relative to AGI security.
No the rigidity is what makes a system error prone i.e. brittle. If you don’t specify the solution exactly, the machine won’t solve the problem. Classic computer programs can’t generalize.
The OP makes a point how you can double a model size and it will work well but if you double a computer programs binary size with unused lines of code you can get all sorts of weird errors. Even if none of that extra size is ever used.
An analogy is trying to write a symbolic logic program to emulate an LLM. (Ie with only if statements and for loops) or trying to make a self driving car with Boolean logic.
If I flip one single bit in a computer program, it will probably catastrophically fail and crash the whole computer. However removing random weights won’t do much to an LLM.
a little tangent on the flipping a bit:
Flipping a bit in the actual binary itself (the thing the computer reads to run the program) will probably cause the computer to access a part of itself it wasn’t supposed to and immediately crash.
Changing a letter in a computer program that humans write will almost certainly cause the program to not compile.
Yep, these are the important parts, and Neural Networks are much more robust than that, and it has extreme robustness compared to a lot of other fields, which is why I’m skeptical of applying the security mindset, since it would predict false things.
The non-rigidity of ChatGPT and its ilk does not make them less error-prone. Indeed, ChatGPT text is usually full of errors. But the errors are just as non-rigid. So are the means, if they can be found, of fixing them. ChatGPT output has to be read with attention to see its emptiness.
None of this has anything to do with security mindset, as I understand the term.
The point is that if it was like computer security or even computer engineering, those errors would completely destroy ChatGPT’s intelligence, and make it as useless as a random computer. This is just one example of an observation like this that makes me skeptical of applying the security mindset, as ML/AI and it’s subfield, ML/AI alignment is a strange enough field that I wouldn’t port over any intuitions from other fields.
ML/AI alignment is like quantum mechanics, in which you need to leave your intuitions at the door, and unfortunately this makes public outreach likely net-negative.
At this point it is not clear to me what you mean by security mindset. I understand by it what Bruce Schneier described in the article I linked, and what Eliezer describes here (which cites and quotes from Bruce Schneier). You have cited QuintinPope, who also cites the Eliezer article, but gets from it this concept of “security mindset”: “The bundle of intuitions acquired from the field of computer security are good predictors for the difficulty / value of future alignment research directions”. From this and his further words about the concept, he seems to mean something like “programming mindset”, i.e. good practice in software engineering. Only if I read both you and him as using “security mindset” to mean that can I make sense of the way you both use the term.
But that is simply not what “security mindset” means. Recall that Schneier’s article began with the example of a company selling ant farms by mail order, nothing to do with software. After several more examples, only one of which concerns computers, he gives his own short characterisation of the concept that he is talking about:
Later on he describes its opposite:
That is what Eliezer is talking about, when he is talking about security mindset.
Yes, prompting ChatGPT is not like writing a software library like pytorch. That does not make getting ChatGPT to do what you want and only what you want any easier or safer. In fact, it is much more difficult. Look at all the jailbreaks for ChatGPT and other chatbots, where they have been made to say things they were intended not to say, and answer questions they were intended not to answer.
My issue with the security mindset is that there’s a selection effect/bias that causes people to notice the failures of security, and not it’s successes, even if the true evidence for success is massively larger than it’s failure.
Here’s a quote from lc’s post POC or GTFO as a counter to alignment wordcelism, on why the security industry has massive issues with people claiming security failures when they don’t or can’t happen:
And this is why in general I dislike the security mindset, because of the incentives to report failure or bad events even when they aren’t very much of a concern.
Also, the stuff that computer security people do largely doesn’t need to be done in ML/AI, which is another reason I’m skeptical of the security mindset.
These are parochial matters within the computer security community, and do not bear on the hazards of AGI.
They do matter, since it implies a sort of selection effect where people will share the evidence for doom, and not notice the evidence for not-doom, and this matters because the real chance of doom may be much lower, in principle arbitrarily low, while LWers and AI safety/governance organizations have higher probabilities of doom.
Combined with more standard biases on negative news being selected for, it is one piece in why I think AI doom is very unlikely. This is just one piece of it, not my entire argument
And I think this already happened, cf the entire inner misalignment/optimization daemon situation, where it was tested twice, once showing a confirmed break, and the other one by Ulisse Mini, where in a more realistic situation, the optimization daemon/inner misalignment went away, and very little shared on this result, compared to the original which almost certainly got more views.
Downvote for being absurdly overconfident, and thereby harming the whole direction of more optimism on alignment. I’d downvote Eliezer for the same reason on his 99.99% doom arguments in public; they are visibly silly, making the whole direction seem silly by association.
In both cased, there are too many unknown unknowns to have confidences remotely that high. And you’ve added way more silly zeros than EY, despite having looser arguments.
This is a really important topic; we need serious discussion of how to really think about alignment difficulty. This is a serious attempt, but it’s just not realistically humble. It also seems to be ignoring the cultural norm and explicit stated goal of writing to inform, not to persuade, on LW.
So, I look forward to your next iteration, improved by the feedback on this post!
I’ll probably put this back into drafts by tomorrow.
It looks like you already took out the 99.9...% claims, which are the primary thing I was reacting to. That’s great IMO. I think the new phrasing of “not claiming this is right, just getting the logic out there” is way better- both more honest and ultimately more convincing if the logic holds.
jBut that’s a major edit without noting the edit, so I think this should be a draft right now, not a post that’s evolving so that the comments are now addressing an earlier version. Publishing a second version that includes much of the first is a great idea.
I’d choose a different term than white box, as per Steve Byrnes’ conclusion that he just won’t use those terms since they’re confusing.
My biggest substantive comment is that you seem to be assuming that because we could get alignment right, we will get alignment right. Even Yudkowsky agrees that we could get it right.
You’re arguing that it’s a lot easier than assumed, and I think that’s probably right. But that’s not enough to be confident that we will get it right. It will depend on how seriously the first person to make self-improving AGI takes alignment, even if there are easy techniques available. Will they use them, or will they race and take risks?
I honestly agree with this. I feel that the post has been edited so much that I now think it’s time to delete this post and reupload a new version of it so that I can actually deal with the edits, without having this weird patchwork post.
Yeah, I’ll probably edit it to emphasize something else.
I am definitely assuming that, but I do think it’s a weak assumption, assuming that at least some part of my post holds true. In essence, I’m hoping that OpenAI doesn’t do the worst thing even if it isn’t favored by profit incentives.
The good news is that assuming value learning is easy, then we have an easier time, since we can do AI regulations a lot more normally, and in particular, we don’t need to be that strict with licensing. Don’t get me wrong, AI governance is necessary in this world, but the type of governance would be drastically different.
No pauses, for one example.
Agreed on all points. This is closely related to my thinking on how we survive, which is why I care about seeing it presented in a way people can hear and understand. I’ll send you a draft of the closely related post I’m working on, and if you haven’t seen it, I focus on that last point, values learning being relatively easy, in this post: The (partial) fallacy of dumb superintelligence.
I think it’s worth explicitly discussing the assumption that people won’t do “the dumbest possible thing”. It’s a reasonable assumption, but it’s probably a little more complicated than that. If alignment taxes are non-zero, there will be some pull between different motivations.
Yeah, it kinda depends on how small the alignment tax is. If it’s not 0, like I unfortunately suspect, but instead small, then there is a small chance of extinction risk. I definitely plan to discuss that when I reupload the post after deleting it first.
Thanks for talking with me today!
Discussion is written by others, unpublishing affects both.
I also think it would be better if you changed the title to saying you don’t endorse it anymore. It’s sad for the discussion to disappear/become unfindable.
Okay, I endorse parts of this post, but in hindsight, I clearly was overconfident. I still want to reupload this post, partially because I want to not have to deal with the editing process, but I will probably edit the title to say I don’t endorse this version anymore, and make a new post based on this one.
I’m pretty confused about almost everything you said about “innate reward system”.
My view is: the relevant part of the human innate reward system (the part related to compassion, norm-following, etc.) consists of maybe hundreds of lines of code, and nobody knows what they are, and I would feel better if we did. (And that happens to be my own main research interest.)
Whereas your view seems to be: umm, I’m not sure, I’m gonna say things and you can correct me. Maybe you think that (1) the innate reward system is simple, (2) when we do RLHF, we are providing tens of thousands of samples of what the innate reward system would do in different circumstances, (3) and therefore ML will implicitly interpolate how the innate reward system works from that data, (4) …and this will continue to extrapolate to norm-following behavior etc. even in out-of-distribution situations like inventing new society-changing technology. Is that right? (I’m stating this possible argument without endorsing or responding to it, I’m still at the trying-to-understand-you phase.)
My general model of the way that the innate reward system works is that the following happens:
I agree with the claim that the innate reward system is simple.
The innate reward system uses the fact that it can edit the weights and code of the brain, albeit it’s limited by biology’s quirks like it’s completely uninterpretable neurons to use the backpropagation algorithm, or a weaker variant thereof to update the gradients using RLHF or DPO or whatever specific variant it is to train a reward model for preference alignment. It continuously trains online on a lot of examples.
Yes, the ML/AI algorithm learns to interpolate from the data, and via weak priors plus the examples learned, it eventually starts to learn how the innate reward system works from the data, and what the reward function is.
I think one key reason why we can navigate out-of distribution situations is because the innate reward system is fully online, and thus whenever it faces out of distribution situations, it’s able to react on the timescale of the rest of the brain and take action.
At the very least, this is a possible sketch of how we could make a reward system that lets us align the AI.
Regarding the idea that there is a short code for how the innate reward system works:
I agree with the view that there probably is a short, powerful code of the innate reward system in humans, for the same reason as my argument that priors from genetics are probably very weak.
My claim here is that even the weaker reward model where we use local update rules is already enough to make alignment very likely, for the same reasons that the innate reward system is able to input a lot of preferences reliably like empathy for the ingroup, revenge when we are harmed, etc.
Your algorithm seems like a very good thing, if we could get at it, but even the weaker stuff enabled by SGD probably is enough to ensure alignment with very high probability.
On the topic of security mindset, the thing that the LW community calls “security mindset” isn’t even an accurate rendition of what computer security people would call security mindset. As noted by lc, actual computer security mindset is POC || GTFO, or trying to translate that into lesswrongesse, you do not have warrant to believe in something until you have an example of the thing you’re maybe worried about being a real problem because you are almost certain to be privileging the hypothesis.
In the cybersecurity analogy, it seems like there are two distinct scenarios being conflated here:
1) Person A says to Person B, “I think your software has X vulnerability in it.” Person B says, “This is a highly specific scenario, and I suspect you don’t have enough evidence to come to that conclusion. In a world where X vulnerability exists, you should be able to come up with a proof-of-concept, so do that and come back to me.”
2) Person B says to Person A, “Given XYZ reasoning, my software almost certainly has no critical vulnerabilities of any kind. I’m so confident, I give it a 99.99999%+ chance.” Person A says, “I can’t specify the exact vulnerability your software might have without it in front of me, but I’m fairly sure this confidence is unwarranted. In general it’s easy to underestimate how your security story can fail under adversarial pressure. If you want, I could name X hypothetical vulnerability, but this isn’t because I think X will actually be the vulnerability, I’m just trying to be illustrative.”
Story 1 seems to be the case where “POC or GTFO” is justified. Story 2 seems to be the case where “security mindset” is justified.
It’s very different to suppose a particular vulnerability exists (not just as an example, but as the scenario that will happen), than it is to suppose that some vulnerability exists. Of course in practice someone simply saying “your code probably has vulnerabilities,” while true, isn’t very helpful, so you may still want to say “POC or GTFO”—but this isn’t because you think they’re wrong, it’s because they haven’t given you any new information.
Curious what others have to say, but it seems to me like this post is more analogous to story 2 than story 1.
The reason Person A in scenario 2 has the intuition that Person B is very wrong is because there are dozens, if not hundreds of examples where people claimed no vulnerabilities and were proven wrong. Usually spectacularly so, and often nearly immediately. Consider the fact that the most robust software developed by the most wealthy and highly motivated companies in the world, who employ vast teams of talented software engineers, have monthly patch schedules to fix their constant stream vulnerabilities, and I think it’s pretty easy to immediately discount anybody’s claim of software perfection without requiring any further evidence.
All the evidence Person A needs is the complete and utter lack of anybody having achieved such a thing in the history of software to discount Person B’s claims.
I’ve never heard of an equivalent example for AI. It just seems to me like Scenario 2 doesn’t apply, or at least it cannot apply at this point in time. Maybe in 50 years we’ll have the vast swath of utter failures to point to, and thus a valid intuition against someone’s 9-9′s confidence of success, but we don’t have that now. Otherwise people would be pointing out examples in these arguments instead of vague unease regarding problem spaces.
Well, no one has built an AGI yet, and if your plan is to wait until we have years of experience with unaligned AGIs before it’s OK to start worrying about the problem, that’s a bad plan.
Also, there are things which are not AGI but which are similar in various ways (software, deep neural nets, rocket navigation mechanisms, prisons, childrearing strategies, tiger-training-strategies) which provide ample examples of unseen errors.
Also, like I said, there ARE plenty of POCs for AGI risk.
At the very least I think it would be more accurate to say “one aspect of actual computer security mindset is POC || GTFO”. Right? Are you really arguing that there’s nothing more to it than that?? That seems insane to me.
Even leaving that aside, here’s a random bug thread:
IIUC they treated these crashes as a security vulnerability, not a mere usability problem, and thus did things like not publicly disclosing the details until they had a fix ready to go, categorizing the fix as a high-priority security update, etc.
If your belief is that “actual computer security mindset is POC||GTFO”, then I think you’d have to say that these Mozilla developers do not have computer security mindset, and instead were being silly and overly paranoid. Is that what you think?
You’re right that this is definitely not “security mindset”. Iceman is distorting the point of the original post. But also, the reason Mozilla’s developers can do that and get public credit for it is partially because the infosec community has developed tens of thousands of catastrophic RCE’s from very similar exploit primitives, and so there is loads of historical evidence that those particular kinds of crashes lead to exploitable bugs. Alignment researchers lack the same shared understanding—they’re mostly philosopher-mathematicians with no consensus even among themselves about what the real issues are, and so if one tries to claim credit for averting catastrophe in a similar situation it’s impossible to tell if they’re right.
This is exactly right. To put it more succinctly: Memory corruption is a known vector for exploitation, therefore any bug that potentially leads to memory corruption also has the potential to be a security vulnerability. Thus memory corruption should be treated with similar care as a security vulnerability.
POC || GTFO is not “security mindset”, it’s a norm. It’s like science in that it’s a social technology for making legible intellectual progress on engineering issues, and allows the field to parse who is claiming to notice security issues to signal how smart they are vs. who is identifying actual bugs. But a lack of “POC || GTFO” culture doesn’t tell you that nothing is wrong, and demanding POCs for everything obviously doesn’t mean you understand what is and isn’t secure. Or to translate that into lesswrongese, reversed stupidity is not intelligence.
But POC||GTFO is really important to constraining your expectations. We do not really worry about Rowhammer since the few POCs are hard, slow and impractical. We worry about Meltdown and other speculative execution attacks because Meltdown shipped with a POC that read passwords from a password manager in a different process, was exploitable from within Chrome’s sandbox, and my understanding is that POCs like that were the only reason Intel was made to take it seriously.
Meanwhile, Rowhammer is maybe a real issue but is so hard to pull off consistently and stealthily that nobody worries about it. My recollection was when it was first discovered, people didn’t panic that much because there wasn’t warrant to panic. OK, so there was a problem with the DRAM. OK, what are the constraints on exploitation? Oh, the POCs are super tricky to pull off and will often make the machine hard to use during exploitation?
A POC provides warrant to believe in something.
I’m confused about how POC||GTFO fits together with cryptographers starting to worry about post-quantum cryptography already in 2006, when the proof of concept was we have factored 15 into 3×5 using Shor’s algorithm? (They were running a whole conference on it!)
Citation needed? The one computer security person I know who read Yudkowsky’s post said it was a good description of security mindset. POC||GTFO sounds useful and important too but I doubt it’s the core of the concept.
Also, if the toy models, baby-AGI-setups like AutoGPT, and historical examples we’ve provided so far don’t meet your standards for “example of the thing you’re maybe worried about” with respect to AGI risk, (and you think that we should GTFO until we have an example that meets your standards) then your standards are way too high.
If instead POC||GTFO applied to AGI risk means “we should try really hard to get concrete, use formal toy models when possible, create model organisms to study, etc.” then we are already doing that and have been.
On POCs for misalignment, specifically for goal misgeneralization, there are pretty fundamental differences between what was shown and what was predicted so far, and one of them is that the train and test behavior in different environments are similar or the same, while in goal misgeneralization speculations, the train and test behavior are wildly different:
Rohin Shah has a comment on why most POCs aren’t that great here:
https://www.lesswrong.com/posts/xsB3dDg5ubqnT7nsn/poc-or-or-gtfo-culture-as-partial-antidote-to-alignment#P3phaBxvzX7KTyhf5
Nevertheless, if you think that this isn’t good enough and that people worried about AGI risk should GTFO until they have something better, you are the one who is wrong.
I don’t think people worried about AGI risk should GTFO.
I do think we should stop giving them as much credit as we do, because of the fact that you are likely to privilege the hypothesis, and it does mean that we shouldn’t count the POCs as vindicating the people worried about AI safety, since their evidence doesn’t really work to support the claim of goal misgeneralization.
I think that’s a vague enough claim that it’s basically a setup for motte-and-bailey. “Stop giving them as much credit as we do.” Well I think that if ‘we’ = society in general, then we should start giving them way more credit, in fact. If ‘we’ = various LWers who don’t think for themselves and just repeat what Yudkowsky says, then yes I agree. If ‘we’ = me, then no thank you I believe I am allocating credit appropriately, I take the point about privileging the hypothesis but I was well aware of it already.
What this would look like in practice would be the following (Taken from the proposed break of optimization daemon/inner misalignment):
Someone proposes a break of AI that threatens alignment like optimization daemons.
We test the claim on toy AIs, and either it doesn’t work or it does work on them, then we move to the next step.
We test the alignment break on a more realistic setting, and it turns out that the perceived break was going away.
Now, the key point is if a proposed break goes away or is made harder in more realistic settings, and especially if it keeps happening, we need to avoid giving credit to them for predicting the failure.
More generally, one issue I have is that I perceive an asymmetry between AI is dangerous and AI is safe people, in that if people were wrong about a danger, they’ll forget or not reference the fact that they’re wrong, but if they’re right about a danger, even if it’s much milder and some of their other predictions were wrong, people will treat you as an oracle.
A quote from lc’s post on POC or GTFO culture as counter to alignment wordcelism explains my thoughts on the issue better than I can:
Scott Alexander writes about the asymmetry in From Nostradamus To Fukuyama. Reversing biases of public perception isn’t much use for sorting out correctness of arguments.
I do have other issues with the security mindset, but that is an important issue I had.
Turning to this part though, I think I might see where I disagree:
It’s not just public perception, but also the very researchers are biased to believe that danger is or will happen. Critically, since this is asymmetrical, it means that this has more implications for doomy people than for optimistic people.
It’s why I’m a priori a bit skeptical of AI doom, and it’s also why it’s consistent to believe that the real probability of doom is very low, almost arbitrarily low, while people think the probability of doom is quite high: You don’t pay attention to the not doom or the things that went right, only the things that went wrong.
The researchers are not the arguments. You are discussing correctness of researchers.
Yes, that’s true, but I have more evidence than that, and in particular I have evidence that directly argues against the proposition of AI doom, and that a lot of common arguments for AI doom.
The researchers aren’t the arguments, but the properties of the researchers looking into the arguments, especially the way they’re biased, does provide some evidence for certain proposition.
For white box vs black box, after further discussion I wound up feeling like people just use the term “black box” differently in different fields, and in practice maybe I’ll just taboo “black box” and “white box” going forward. Hopefully we can all agree on:
And likewise we can surely all agree that future AI programmers will be able to see the weights and perform SGD.
I don’t think any complete description of the LLM is going to be intuitive to a human, because it’s just too complex to fit in your head all at once. The best we can do is to come up with interpretations for selected components of the network. Just like a book or a poem, there’s not going to be a unique correct interpretation: different interpretations are useful for different purposes.
Theres also no guarantee that any of these mechanistic interpretations will be the most useful tool for what you’re trying to do (e.g. make sure the model doesn’t kill you, or whatever). The track record of mech interp for alignment is quite poor, especially compared to gradient based methods like RLHF. We should accept the Bitter Lesson: SGD is better than you at alignment.
I would definitely like to see that argument made, as I suspect that a lot of LWers might disagree with this statement.
I think this is essentially what people mean when they say “LLMs are a black box” and since you seem to be agreeing, I find myself very confused that you’ve been pushing a “white box” talking point.
It seems that all parties including Nora agree with “If a LLM outputs A rather than B, and you ask me why, then it might take me decades of work to give you a reasonable & intuitive answer”. The disagreements are (1) whether we should care—i.e., whether this fact is important and worrisome in the context of safe & beneficial AGI, (2) what the terms “black box” and “white box” mean.
I think Nora’s comment here was taking an opportunity to argue her side of (1).
In Nora’s recent post, to her credit, she defined exactly what she meant by “white box” the first time she used the term, and her discussion was valid given that definition.
I think her recent post (and ditto the OP here) would have been clearer if she had (A) noted that people in the AGI safety space sometimes use “black box” to say something like the “decades of work” claim above, (B) explicitly said that the “decades of work” claim is obviously true and totally uncontroversial, (C) clarified that this popular definition of “black box / white box” is not the definition she’s using in this post.
(A similar suggestion also applies to the other side of the debate including me, i.e. in the unlikely event that I use the term “black box” to mean the “decades of work” thing, in my future writing, I plan to immediately define it and also explicitly say that I’m not using the term to discuss whether or not you can see the weights and perform SGD.)
Hmm, I guess the point of using the term “white box” is then to illustrate that it is not a literal black box, while the point of the term “black box” is that while it’s a literal transparent system, we still don’t understand it in the ways that matter. There’s something that feels really off about the dynamic of term use here, but I can’t quite articulate it.
The terms “white box” and “black box”, like pretty much all terms, are more than just their literal definitions, they are also trojan horses full of connotations and vibes. So of course it’s natural (albeit unfortunate and annoying) for people on both sides of a debate to try to get those connotations and vibes to work in service of their side. :-P
I’ll edit the post soon to focus on the fact that the white-box definition is not a standard definition of the term, and instead refers to the computer analysis/security sense of the term.
I definitely agree that I think tabooing white box vs black box is good. One point though is that the innate reward system does targeted updates to neural circuits using simple learning rules, that means that we can probably use SGD to make ourselves an innate reward system combined with a weak prior to get good results.
Admittedly, I do thnk that the pathway isn’t as complete as I like, but I do actually think that the notion of seeing the weights, checking the Hessians, etc to be extremely powerful alignment tools, more powerful than appreciated.
This whole post seems to be about accident risk, under the assumption that competent programmers are trying in good faith to align AI to “human values”. It’s fine for you to write a blog post on that—it’s an important and controversial topic! But it’s a much narrower topic than “AI safety”, right? AI safety includes lots of other things too—like bad actors, or competitive pressures to make AIs that are increasingly autonomous and increasingly ruthless, or somebody making ChaosGPT just for the lols, etc. etc.
Indeed. No mention of misuse, multipolar traps, etc!
Given how scaling laws work the power of AGI systems is/will be proportional to net training compute, so ‘lols’ doesn’t seem like much of a concern. These systems are increasingly enormous industrial scale rapidly escalating towards manhattan scale projects.
One can argue that algorithmic & hardware improvements will never ever be enough to put human-genius-level human-speed AGI in the hands of tons of ordinary people e.g. university students with access to a cluster.
Or, one can argue that tons of ordinary people will get such access sooner or later, but meanwhile large institutional actors will have super-duper-AGIs, and they will use them to make the world resilient against merely human-genius-level-chaosGPTs, somehow or other.
Or, one can argue that ordinary people will never be able to do stupid things with human-genius-level AGIs because the government (or an AI singleton) will go around confiscating all the GPUs in the world or monitoring how they’re used with a keylogger and instant remote kill-switch or whatever.
As it happens, I’m pretty pessimistic about all of those things, and therefore I do think lols are a legit concern.
(Also, “just for the lols” is not the only way to get ChaosGPT; another path is “We should do this to better understand and study possible future threats”, but then fail to contain it. Large institutions could plausibly do that. If you disagree—if you’re thinking “nobody would be so stupid as to do that”—note the existence of gain-of-function research, lab leaks, etc. in biology.)
If ordinary people have access to human-genius-level AGIs, then there will be many AGIs at that level (along with some far more powerful above them) and thus these weaker agents almost certainly won’t be dangerous unless a significant fraction are not just specifically misaligned in the most likely failure mode (selfish empowerment), but co-aligned specifically against humanity in their true utility functions (ie terminal rather than just instrumental values). These numerous weak AGI are not much more dangerous to humanity than psycopaths (unaligned to humanity yes, but also crucially unaligned with each other).
EY/MIRI? has a weird argument about AIs naturally coordinating because they can “read each others source code”, but that wouldn’t actually cause true alignment of utility functions, just enable greater cooperation, and regardless is not really compatible with how DL AGI works. There are strong economic/power incentives against sharing source code (open source models lag), it’s also only really useful for deterministic systems and ANNs are increasingly non-deterministic and moving towards BNNs in that regard, and too difficult to verify against various spoofing mechanisms regardless (even if an agent’s source code is completely avail and you have a full deterministic hash chain, difficult to have any surety the actual agent isn’t in some virtual prison with other agent(s) actually in control unless it’s chain amounts to enormous compute ).
I’ll note that a potential disagreement I have with your post on out-of-control AGIs ruining the world is that I actually expect the defense-offense balance to be much less biased towards the attack than you show here, and in particular, I think that to the extent AI improves things, my prior is that it’s symmetric in the improvement, so the offense-defense balance ultimately doesn’t change.
I definitely agree with this, and I’ll probably change the title to focus on AI alignment.
My general view on the other problems of AI safety is that removing accident risk would make the following strategies much less positive EV:
General slowdowns of AI, because misuse is handlable in other, less negative EV ways.
Trying to break the Overton Window, as Eliezer Yudkowsky did, since governments and companies have incentives to restrict misuse.
And in particular, I think that removing the accidental risk probably ought to change a lot of people’s p(doom), especially if the main way they claim that people will die is due to accident risk, which is IMO my sense of a lot of people’s models on LW, and is arguably the main reason people are scared about AI.
Also, I think that the type of governance would change assuming no accident risk.
I’ve upvoted this post because it’s a good collection of object-level, knowledgeable, serious arguments, even though I disagree with most of them and strongly disagree with the bottom line conclusion.
There is a good analogy between genetic brain evolution and technological AGI evolution. In both cases there is a clear bi-level optimization, with the inner optimizer using a very similar UL/RL intra-lifetime SGD (or SGD-like) algorithm.
The outer optimizer of genetic evolution is reasonably similar to the outer optimizer of technological evolution. The recipe which produces an organic brain is a highly compressed encoding or low frequency prior on the brain architecture along with a learning algorithm to update the detailed wiring during lifetime training. The genes which encode the brain architectural prior and learning algorithms are very close analogically to the ‘memes’ which are propagated/exchanged in ML papers and encode AI architectural prior and learning algorithms (ie the initial pytorch code etc).
The key differences are mainly just that memetic evolution is much faster—like an amplified artificial selection and genetic engineering process. For tech evolution a large number of successful algorithm memes from many different past experiments can be flexibly recombined in a single new experiment, and the process guiding this recombination and selection is itself running on the inner optimizer of brains.
Humans individually are not robustly aligned to the outer genetic optimizer: as roughly 50% of humans choose to not have children and do the other thing instead, which likely is non-trivially misaligned with genetic fitness (IGF) [1]. Nate uses that as a doom argument, because if tech evolution proceeds like bio evolution except that the first AGI to cross some threshold ends up taking over the world, then 50% chance of non-trivial misalignment roughly translates to 50% doom.
Imagine if one historical person from thousands of years ago was given god like magic power to arbitrarily rearrange the future. Seems roughly 50⁄50 whether that future would be reasonably aligned with IGF.
But of course that is not what happened with the evolution of homo sapiens. Even if humans are not robustly aligned to fitness/IGF at the individual level, we are robustly aligned at the population/species level[2]. The enormous success of homo sapiens, despite common misalignment at the individual level, is a clear illustration of how much more robust multi-polar scenarios can be.
As you argue in this post, it also seems likely that the same factors which improve the efficiency of memetic evolution (ie human engineering) over genetic evolution can/will be applied to improve the capability-weighted expected alignment of AGI systems vs that of brains.
Finally, one other hidden source of potential disagreement is the higher level question of degree of alignment between our individual utility functions and the utility function of global market tech evolution as a system. If you largely believe the “system itself is out of control”, you probably won’t be especially satisfied even if there is strong alignment between AGI and the system, if you believe that system itself is on the completely wrong track. That aspect is discussed less (and explicitly not a pillar of EY/MIRI doomer views AFAIK), but I do suspect it is a subtle important factor on p(doom) for many.
Optimizing for IGF doesn’t actually require having children oneself especially if one’s genotype is already widely replicated, but it doesn’t seem likely that substantially shifts the conclusion.
The average/expectation or more generally a linear combination of many utility vectors/functions can be arbitrarily aligned to some target even if nearly every single individual utility function in the set is orthogonal to the target (misaligned). Smoothing out noise (variance reduction) is crucial for optimization—whether using SGD or evolutionary search.
I view your final point as crucial. I would put an additional twist on it, though. During the approach to AGI, if takeoff is even a little bit slow, the effective goals of the system can change. For example, most corporations arguably don’t pursue profit exclusively even though they may be officially bound to. They favor executives, board members, and key employees in ways both subtle and obvious. But explicitly programming those goals into an SGD algorithm is probably too blatant to get away with.
AI is obviously on an S-curve, since eventually you run out of energy to feed into the system. But the top of that S-curve is so far beyond human intelligence, that this fact is basically irrelevant when considering AI safety.
The arguments about fundamental limits of computation (halting problem,etc) also are irrelevant for similar reasons. Humans can’t even solve BB(6).
I definitely agree that the limit could end up being far beyond superhuman, but in that addendum, I was talking about limitations that would slow down AI right as it has equal the compute and memory that humans have. It’s possible that Addendum 2 does fail though, so I agree with you that this isn’t conclusive. It was more to check the inevitability of fast takeoff/AI explosion, not that it can’t happen.
I just saw this post and cannot parse it at all. You first say that you have removed the 9s of confidence. Then the next paragraph talks about a 99.9… figure. Then there are edit and quote paragraphs and I do not know whether these are your views or other or whether you endorse them.
I’ll probably need to edit that more completely, but for the moment a lot of the weirdness has to do with my original confidence was 99.9999%+, but I somehow didn’t make it clear enough for people that this was an original version, not the new version.
I think it’d make sense to clarify what you mean here, since the following are very different:
I am >99.99999% confident that friendly AI will happen.
I am e.g. 70% confident that in >99.99999% of cases we get friendly AI.
I assume you mean something more like the latter.
In that case it’d probably be useful to give a sense of your actual confidence in the 99.99999%+ claim.
“Mostly don’t need to worry” would imply extremely high confidence.
Or do you mean something like “In most worlds it’ll be clear in retrospect that we needn’t have worried”?
I definitely mean the first one, and I’ll try to give some reasons why I’m so confident on AI alignment:
I believe the evidence for the human case is actually really strong, and a lot of that comes from the fact that for arguably the past 10,000+ years, our reward system reliably imprints in us a set of values like for example empathy for the ingroup, getting revenge when people have harmed us, etc, and over 95%+ of humans share the values that the reward system has implemented, which is ridiculously reliable. We also have the ability to implement much more complicated reward functions than evolution can, and that lets us drive the probability really large, really fast, due to this phenomenon:
Strong evidence is common, and the ability to add in more bits of evidence very quickly makes the evidence get ridiculously strong, and I view the evidence from humans about alignment, as well as the ability to implement complicated reward functions meaning that you can get very strong evidence for things with a scarily weak prior, because the number of bits usually cuts probability in half.
Some comments and Marx Xu’s post Strong Evidence is Common below:
https://www.lesswrong.com/posts/JD7fwtRQ27yc8NoqS/strong-evidence-is-common
https://www.lesswrong.com/posts/JD7fwtRQ27yc8NoqS/strong-evidence-is-common#itdkXwhitCcsyXC4q
One theory I subscribe to called Prospect theory is that people drastically overestimate the probability of extraordinarily large positive or negative impact, and the application here is that we are likely biased to overweight the probability of events that have large negative impact like us going extinct, which is why I decided to avoid anchoring on LW estimates.
Ok, well thanks for clarifying.
I’d assumed you meant the second.
Some reasons I think that this confidence level is just plain silly (not an exhaustive list!):
First, you’re misapplying strong-evidence-is-common—see Donald’s comment (or indeed mine).
Strong evidence getting from [hugely unlikely] to [plausible] is common; from [plausible] to [hugely likely] is rare.
A lot of strong evidence comes from [locating a hypothesis h and having a strong reason to think that p(h is true | h was located) is high]. If you selected the hypothesis at random, locating it gives you almost no evidence, since you don’t have the second part. Similarly if wishful thinking is an easy way to locate a hypothesis.
All Mark’s examples have the form [trusted source tells me h is true].
Second, you should have nothing close to 99.99999% credence that an AI aligned as well as a human is aligned is safe. We have observations that humans usually behave well when they’re on distribution and in a game-theoretic context when it’s to their advantage to behave well. Take a human far off distribution, and we have no guarantees—not simply no guarantee that they’ll act the same: no guarantee that they’ll feel or think the same either.
I note that we don’t observe human values; we observe human behaviour. My internal sense that I really, truly care about x need not reflect reality—that caring can be entirely contingent without my being aware that it’s contingent.
Most on-distribution humans when asked to imagine what they’d do when far off distribution will naturally sound like they’d have similar values to on-distribution humans—but this is simply because you’re asking the on-distribution human.
The more sensible answer to “what would you do/feel/think when wildly off distribution?” is “I have very high uncertainty, since the version of me in that situation may have wildly different emotions/drives/thought-processes”.
The reason that being able to align a system [about as well as an average human] would be a huge step forward is not that this is close to a win condition in itself. It’s that the ability to do that would likely imply a level of understanding that would put us a lot closer to a win condition.
Note that this is a path to progress only if achieving alignment [about as well as an average human] is hard. If it’s hard, it implies that we needed a lot of understanding to get there, and can hope to leverage that understanding to get a solution that might work.
If we get something aligned approximately as much as the average human by default, this implies no great advance in understanding, and wouldn’t imply we were close to a solution. (it’d still be great; I’m certainly not claiming it’d be useless—just far from clear it’d be most of the way to a solution)
Could there, in principle, be a solution that relies on an analogous game-theoretic balance to that which exists in human society, so that average-human alignment is sufficient? Sure.
Is that a future where humans are ok? No particular reason to think so.
Is it a solution we should be >99% confident we can arrange? Clearly not.
Third, you’re clearly much too confident that you understand the arguments of those who think alignment is hard. Are you sure you’re understanding exactly what point is being made with the evolution analogy or sharp-left turn description? What are the odds that you’re attacking a straw-man version of these in places? What are the odds you’re missing the point? What are the odds that you’re failing to generalize appropriately—that your counter-arguments only apply to a narrow subset of the problem being gestured at?
Not anchoring on LW conclusions is sensible. Not worrying about understanding the arguments isn’t. (fine if you don’t have time—but then dial down your confidence levels appropriately)
The evolution analogy / sharp-left-turn are pointing at clusters of issues, not claiming to be directly analogous to ML in every sense. That neither argument convinces you is some evidence that the particular failure mode you believe they’re pointing at isn’t a failure mode. Going from there to >99% confidence that there are no failure modes is a leap.
I might want to reduce my confidence for now, and I have edited the post to remove the 9s for now, but a potential reason comes from Nora Belrose in the AI optimism discord:
“OTOH, if I put doom in the reference class of “things I used to believe, kinda” then perhaps I should feel comfortable putting e.g. 10^-5 credence in doom, since I put << 10^-5 credence in Christianity being true, and < 10^-5 credence in Marxism (although the truth conditions for Marxism are murkier.”
I sort of agree with this, but with a huge caveat, in that if an anthropologist 100,000 years ago somehow managed to understand the innate reward system, they would likely predict that the values of humans would be essentially fairly universal things like empathy for the ingroup, parental instinct, and revenge, and they would have an impressive track record of such predictions.
Some object-level stuff first:
I think my main disagreement comes down to:
Being [well-behaved as far as we can tell] in training is always very weak evidence that behaviour will generalize as we’d wish it to.
I don’t say “aligned in the training data”, since alignment is about robust generalization of good behaviour. Evidence of alignment is evidence of desirable generalization. Eliezer isn’t claiming we won’t get approximately perfect behaviour (as far as we can tell) on the training data; he’s claiming that this gets us almost nowhere in terms of alignment.
Caveat—this is contingent on what counts as ‘behaviour’ and on our tools; if behaviour includes activations, and our tools have hugely improved, this may be progress.
Arguments against particular failure modes often come down to [from what we can tell, inductive bias will tend to push against this particular type of failure].
Of course here I’d point at “from what we can tell” and “tend to”.
However, the more fundamental point is that we have no reason to think that inductive bias pushes towards success either.
Does the simplest solution compatible with [good behaviour as far as we can tell] on the training data generalize exactly as we’d wish it to? Why would this be the case?
Does the fastest? Again, why would we expect this?
Does the [insert our chosen metric]est? Why?
I do expect that there exists some metric with a rich set of inputs (including weights, activations etc.) that would give robustly desirable generalization.
I expect that finding such a metric will require deep understanding.
Expecting a simple metric found based on little understanding to be sufficient is equivalent to assuming that there’s something special about the kind of generalization we would like (other than that we like it).
This is baseless—it’s why I don’t like the term “misgeneralization”, since it can suggest that there’s some natural ‘correct’ generalization, which would be the default outcome if nothing goes wrong. There is no such natural correct generalization (or at least, I’ve seen/made no argument for one—I think natural abstractions may get you [AI will understand our ontology/values], but not that it’s likely to generalize according to those values).
One reply to this is that we don’t have to be that precise—just look at humans. However, humans aren’t an example of successful alignment. (see above—and below)
A few points here:
Given some claim x you can always find some category it belongs to that contains either [things much more likely to be true than x] or [things much less likely to be true than x] - particularly if you cherry-pick even within that category.
A general principle is that you need to use all your bearing-on-x evidence if you want to form an accurate estimate for x (and since you won’t have time, you want some unbiased approximation). If you pick a small subset of available evidence without care to avoid bias, then your estimate will tend to be badly wrong.
If the only evidence you had were [I had an argument for a very weird conclusion that I now realize is invalid], you’d be reasonable in thinking the conclusion were highly unlikely—but this is not your only evidence.
It’s a pretty standard mistake to overcompensate when moving from [I believe [thing with significant influence on how I live my life]] to [I don’t believe [thing with significant influence on how I live my life]].
It’s hard to break away from a strongly held, motivating belief, but it’s even harder to do so without overcorrecting. In fact, I’d guess that initial overcorrection is often the rational thing to do if we’re aiming at having an accurate assessment later.
It might be bad form to focus on psychology in debates, and I’d like to be clear that my claim is not [Nora/you are clearly making such errors].
The claim I will make is that reflecting on our own psychological reasons to want to believe x should be a standard tool. Ideally we’d do it all the time, but it’s most important when some aspect of your model/argument/belief-state is just as you’d wish it to be—that’s a red flag.
A complex, important-to-you thing being almost exactly as you’d wish it to be should be highly surprising, and therefore somewhat suspicious.
For example, I might:
Want to be certain about x.
Want x to be true.
Want my conclusion to appear measured/reasonable/balanced. (I’m so wise with my integration of twelve different perspectives and nuanced 60% credence in x!)
Only you have much hope to get at what’s going on in your head—but it’s important to look (and to be highly suspicious of reflex justifications that just happen to point at exactly the conclusions you’d wish them to).
Obviously I also need to do this, I also frequently fail etc. (many failures being of the form [not even noticing a question])
Going from [believe x] to [disbelieve x] tends to happen when I falsify my arguments for x. However, this shouldn’t take me to [disbelieve x], but initially only to [I believed x for invalid reasons]. Once I make the update to [my reasons were invalid] it’s important for me to reassess my takes on e.g. [the best-informed people believe x for reasons like mine] or [the reasons I believe(d) x are among the strongest arguments for x].
Psychological red flag here: it’s nicer to believe [all the people who believe x had invalid reasons] than to believe [I had invalid reasons, but perhaps others had good reasons I didn’t find/understand].
[Note: in the following, I’m saying [if such reasoning is used, it doesn’t lead where we’d like it to], and not [I fully endorse such reasoning] - though it’s at least plausible]
I expect that they may have made some good predictions on future behaviour (after taking a break to invent writing, elementary logic and suchlike...). However this works primarily on a [predict that <instrumentally useful for maintenance/increase of influence> things become values] basis.
That kind of approach allows us to make plausible predictions only so long as it’s difficult to preserve/increase influence—the constraint [you ‘must’ act so as to maintain/increase your influence on the future] tells us a lot in such cases.
Once the constraints are removed (simple example being a singleton ASI), such reasoning tells us nothing: maintenance/increase of influence is easy, so the agent has huge freedom.
What will an agent tend to want in such circumstances? Likely what it wanted before, only generalized by processes that would have been instrumentally useful. Note in particular that there’s never any pressure towards (behaviour x should generalize desirably to situations where there aren’t constraints). Precisely the reverse: behaviour in unconstrained situations is a degree of freedom we should expect to be used to increase influence in constrained situations.
The same reasoning that gets you to [empathy for the ingroup] gets you to [gain influence over the future] - I note again here that humans are in a game-theoretic situation where a lot of cooperation and nice/kind behaviour tends to coincide with maintaining/gaining influence (and/or tended to do so in the ancestral environment).
Decisions where various values have influence would tend to get resolved by [would have been instrumentally useful] processes. Importantly, such processes may contain pointers—e.g. to [figure out this value] or [calculate who gains here] or [find the best plan for …] (likely not explicitly in this form—but with some level of indirection).
If we e.g. dial up the available resources, should we usually expect [process that had desirable outcomes with fewer resources] to continue to have desirable outcomes? Only to the extent that there was strong pressure for the process to be robust in this sense. Will this be reliably true? No.
By default, we get no guarantees here. We might hope to get guarantees to the extent that we have good understanding of how internal processes will generalize, and great understanding of self-correction mechanisms.
If I imagine a scenario where something with humanlike values (or indeed a group) becomes more and more powerful, yet things go well, this relies on great caution together with extremely good self-understanding and self-correction. (I don’t expect these things to be used by default in a trained system, since any simpler [or preferred-by-inductive-bias] shortcut will be preferred to the general version)
One issue here is that ideally it’d be nice to test what a system would do without constraints (or with reduced constraints). However, we can’t do this so long as we maintain the ability to disempower it: that’s an extreme constraint.
But to summarize, I’d say:
I don’t expect we’ll get [generalizes like a human] without much better understanding, since I don’t expect that this is the outcome of inductive bias and [behaves like a human in training as far as we can tell].
If we did get [generalizes like a human], it wouldn’t be a win condition without a bunch of understanding. (since I expect we’d need great understanding, I do think it’d be progress—but almost entirely due to the understanding)
This will be a long comment, so get a drink and a snack.
I agree with this, assuming 0 prior, but I expect to disagree on the strength of the prior necessary in order to generalize correctly.
My claim is essentially the opposite of this, that the reason humans generalized correctly from limited examples of stuff like empathy for the ingroup, where empathy for the ingroup here could be replaced by almost any value and is thus a placeholder and didn’t just trick their reward system isn’t that special, and that it’s basically a consequence of weak prior information from the genome plus the innate reward system using backpropagation or a weaker variant of it to update the neural circuitry to reinforce certain behaviors and penalizing others.
This was meant to be an example of the values that the innate reward system could align us to, not what things resulted from holding this set of values. When I use an example, it’s essentially a wildcard, such that it can stand for almost arbitrary values.
This turns out to be a crux, in that I think that the understanding required is probably minimal, compared to the majority of LWers like you.
This is a tautology, not an example of successful alignment:
Humans trick their reward systems as much as humans trick their reward systems.
Imagine a case where we did “trick our reward system”. In such a case the human values we’d infer would be those that we’d infer from all the actions we were taking—including the actions that were “tricking our reward system”.
We would then observe that we’d generalized entirely correctly with respect to the values we inferred. From this we learn that things tend to agree with themselves. This tells us precisely nothing about alignment.
I note for clarity that it occurs to me to say:
Indeed we do observe some humans doing what most of us would think of as tricking their reward systems (e.g. self-destructive drug addictions).
You may respond “Ah, but that’s a small proportion of people—most people don’t do that!”—at which point we’re back to tautology: what most people do will determine what is meant by “human values”. Most people are normal, since that’s how ‘normal’ is defined.
The only possible evidence I could provide that we do “trick our reward system” is to point to things that aren’t normal, which must necessarily be unusual.
If you’re only going to think that alignment is hard if I can point to a case where most people are doing something unusual, then I’m out of options: that’s not a possible world.
I’ll rewrite that to “generalized correctly from limited examples of stuff like empathy for the ingroup, where empathy for the ingroup here could be replaced by almost any value and is thus a placeholder”, because I accidentally made a tautology here.
I don’t think it’s accidental—it seems to me that the tautology accurately indicates where you’re confused.
“generalised correctly” makes an equivalent mistake: correctly compared to what? Most people generalise according to the values we infer from the actions of most people? Sure. Still a tautology.
Treacherous turn failure modes, which examples will be posted below:
Humans seeming to have empathy only for say 25 years in order to play nice with their parents, and then making a treacherous turn to say kill other people that are part of their ingroup.
More generally, humans mostly avoid what’s called the treacherous turn type failure mode, where it appears to have values consistent with human morals, but then reveals that it didn’t have those values all along, and hurt other people.
More generally, the extreme stability of values gives evidence that it’s very difficult to have a human that executes a treacherous turn.
That’s the type of thing which I call generalizing correctly, since it basically excludes deceptive alignment out of the gate, contra Evan Hubinger’s fear of AIs having deceptive alignment.
In general, one of the miracles is that the innate reward system plus very weak genetic priors can rule out so many dangerous types of generalizations, which is a big source of my optimism here.
For this kind of thing to be evidence, you’d need the human treacherous turn to be a convergent instrumental strategy to achieve many goals.
The AI case for treacherous turns is:
AI ends up with weird-by-our-lights goal. (e.g. a rough proxy for the goal we intended)
The AI cooperates with us until it can seize power.
The AI does a load of treacherous-by-our-lights stuff in order to seize power.
The AI uses the power to effectively pursue its goal.
We don’t observe this in almost any human, since almost no human has the option to gain enormous power through treachery.
When humans do have the option to gain enormous power through treachery, they do sometimes do this.
Of course, even for the potentially-powerful it’s generally more effective not to screw people over (all else being equal), or at least not to be noticed screwing people over. Preserving options for cooperation is useful for psychopaths too.
The treacherous turn argument is centrally about instrumentally useful treachery.
Randomly killing other people is very rarely useful.
No-one is claiming that AI treachery will be based on deciding to be randomly nasty.
If we gave everyone a take-over-the-world button that only works if they first pretend that they’re lovely for 25 years, certainly some people would do this—though by no means all.
And here we’re back to the tautology issue:
Why is it considered treacherous for someone to pretend to be lovely for 25 years, then take over the world, so that many people wouldn’t want to do it? Because for a long time we’ve lived in a world where actions similar to this did not lead to cultures that win (noting here that this level of morality is cultural more than genetic—so we’re selecting for cultures-that-win).
If actions similar to this did lead to winning cultures, after a long time we’d expect to see [press button after pretending for 25 years] to be both something that most people would do, and something that most people would consider right to do.
We were never likely to observe common, universally-horrifying behaviour:
If it were detrimental to a (sub)culture, it’d be selected against and wouldn’t exist.
If it benefitted a culture, it’d be selected for, and no longer considered horrific.
(if it were approximately neutral, it’d similarly no longer be considered horrific—though I expect it’d take a fair bit longer: [considering things horrific] imposes costs; if it’s not beneficial, we’d expect it to be selected out)
If it were just too hard to get correct generalization, where “correct” here means [sufficient for humans to persist over many generations], then we wouldn’t observe incorrect generalization: we wouldn’t be here.
If anything, we’d find that everything else had adapted so that an achievable degree of correct generalization were sufficient. We’d see things like socially enforced norms, implicit threats of violence, judicial systems etc. This [achievable degree of correct generalization] would then be called “correct generalization”.
Again, I don’t see a plausible couterfactual world such that “correct” generalization would seem hard from within the world itself. Sufficiently correct generalization must be commonplace. “Sufficiently correct” is what the people will call “correct”.
My view on this is unfortunately unlikely to be resolved in a comment thread, but 2 things I’ll say about human values and evidence bases can be clarified here:
This: “If it were just too hard to get correct generalization, where “correct” here means [sufficient for humans to persist over many generations], then we wouldn’t observe incorrect generalization: we wouldn’t be here.
“If anything, we’d find that everything else had adapted so that an achievable degree of correct generalization were sufficient. We’d see things like socially enforced norms, implicit threats of violence, judicial systems etc. This [achievable degree of correct generalization] would then be called “correct generalization”.
Is probably not correct, and we can in fact update normally from the fact that human behavior is surprisingly good, in that this is probably a case of anthropic shadow, which has reasonable arguments against it existing.
For more on this, I’d read SSA Rejects Anthropic Shadow by Jessica Taylor and Anthropically Blind: The Anthropic Shadow is Reflectively Inconsistent by Christopher King.
Links are below:
https://www.lesswrong.com/posts/LGHuaLiq3F5NHQXXF/anthropically-blind-the-anthropic-shadow-is-reflectively
https://www.lesswrong.com/posts/EScmxJAHeJY5cjzAj/ssa-rejects-anthropic-shadow-too
I have a different causal story from yours about why this happens: “Why is it considered treacherous for someone to pretend to be lovely for 25 years, then take over the world, so that many people wouldn’t want to do it?”
At least for my own causal story on why people don’t usually want to take over the world and kill people, it goes something like this:
There is a weak prior in the genome for stuff like not taking power to kill people in your ingroup, and the prior is weak enough such that we can make it as a wildcard symbol such that aligning it to some other value more or less works.
The brain’s innate reward system uses DPO, RLHF or whatever else is used to create a preference model to guide the intelligence into being aligned to whatever values the innate reward system wants like say empathy for the ingroup, albeit this is only a motivating example.
It uses backprop or a weaker variant of it, and at a high level probably uses an optimizer that is probably at best comparable to Gradient descent, and since it has white-box access and can update the brain in a sort of targeted way, it can efficiently compute the optimal direction to improve it’s performance on say having empathy for the ingroup, but again this is a wildcard symbol in that it could stand in for almost any values.
The loop of weak prior + innate reward system + algorithm to implement it like backprop or it’s weaker variants means that eventually, the human by 25 years old is very aligned with the values that the innate reward system put in place like empathy for the ingroup, albeit again this is only an example of an alignment target, you could put almost arbitrary alignment targets in there.
That’s my story of how humans are mostly able to avoid misgeneralization, and learn values correctly in the vast majority of cases.
I’m not reasoning anthropically in any non-trivial sense—only claiming that we don’t expect to observe situations that can’t occur with more than infinitesimal probability.
This isn’t a [we wouldn’t be there] thing, but a [that situation just doesn’t happen] thing.
My point then is that human behaviour isn’t surprisingly good.
It’s not surprisingly good for human behaviour to usually follow the values we infer from human behaviour. This part is inevitable—it’s tautological.
Some things we could reasonably observe occurring differently are e.g.:
More or less variation in behaviour among humans.
More or less variation in behaviour in atypical situations.
More or less external requirements to keep behaviour generally ‘good’.
More or less deviation between stated preferences and revealed preferences.
However, I don’t think this bears on alignment, and I don’t think you’re interpreting the evidence reasonably.
As a simple model, consider four possibilities for traits:
x is common and good.
y is uncommon and bad.
z is uncommon and good.
w is common and bad.
x is common and good (e.g. empathy): evidence for correct generalisation!
y is uncommon and bad (e.g. psychopathy): evidence for mostly correct generalization!
z is uncommon and good (e.g. having boundless compassion): not evidence for misgeneralization, since we’re only really aiming for what’s commonly part of human values, not outlier ideals.
w is common and bad (e.g. selfishness, laziness, rudeness...) - choose between:
[w isn’t actually bad, all things considered… correct generalization!]
[w is common and only mildly bad, so it’s best to consider it part of standard human values—correct generalization!]
It seems to me that the only evidence you’d accept of misgeneralization would be [terrible and common] - but societies where terrible-for-that-society behaviours were common would not continue to exist (in the highly unlikely case that they existed in the first place).
Common behaviour that isn’t terrible for society tends to be considered normal/ok/fine/no-big-deal over time, if not initially (that or it becomes uncommon) - since there’d be a high cost both individually and societally to consider it a big deal if it’s common.
If you consider any plausible combination of properties to be evidence for correct generalization, then of course you’ll think there’s been correct generalization—but it’s an almost empty claim, since it rules out almost nothing.
Most people tend to act in ways that preserve/increase their influence, power, autonomy and relationships, since this is useful almost regardless of their values. This is not evidence of correct generalization—it’s evidence that these behaviours are instrumentally useful within the environment ([not killing people] being one example).
To get evidence of something like ‘correct’ generalization, you’d want to look at circumstances where people get to act however they want without the prospect of any significant negative consequence being imposed on them from outside.
Such circumstances are rarely documented (documentation being a potential source of negative consequences). However, I’m going to go out on a limb and claim that people are not reliably lovely in such situations. (though there’s some risk of sampling bias here: it usually takes conscious effort to arrange for there to be no consequences for significant actions, meaning there’s a selection effect for people/systems that wish to be in situations without consequences)
I do think it’d be interesting to get data on [what do humans do when there are truly no lasting consequences imposed externally], but that’s very rare.
I did try to provide a casual story for why humans could be aligned to some value without relying on societal incentives that much, so you can check out the second part of my comment.
My non-tautological claim is that the reason isn’t behavioral, but instead internal, and in particular the innate reward system plays a big role here.
In essence, my story on how humans are aligned with the values of the innate reward system wasn’t relying on a behavioral property.
I’ll reproduce it, so that you can focus on the fact that it didn’t rely on behavioral analysis:
There is a weak prior in the genome for stuff like not taking power to kill people in your ingroup, and the prior is weak enough such that we can make it as a wildcard symbol such that aligning it to some other value more or less works.
The brain’s innate reward system uses DPO, RLHF or whatever else is used to create a preference model to guide the intelligence into being aligned to whatever values the innate reward system wants like say empathy for the ingroup, albeit this is only a motivating example.
It uses backprop or a weaker variant of it, and at a high level probably uses an optimizer that is probably at best comparable to Gradient descent, and since it has white-box access and can update the brain in a sort of targeted way, it can efficiently compute the optimal direction to improve it’s performance on say having empathy for the ingroup, but again this is a wildcard symbol in that it could stand in for almost any values.
The loop of weak prior + innate reward system + algorithm to implement it like backprop or it’s weaker variants means that eventually, the human by 25 years old is very aligned with the values that the innate reward system put in place like empathy for the ingroup, albeit again this is only an example of an alignment target, you could put almost arbitrary alignment targets in there.
Critically, it makes very little reference to society or behavioral analysis, so I wasn’t making the mistake you said I made.
It is also no longer a tautology, as it depends on the innate reward system actually rewarding desired behavior by changing the brain’s weights, and removing the innate reward system or showing that the weak prior + value learning strategy was ineffective would break my thesis.
This still seems like the same error: what evidence do we have that tells us the “values the innate reward system put in place”? We have behaviour.
We don’t know that [system aimed for x and got x].
We know only [there’s a system that tends to produce x].
We don’t know the “values of the innate reward system”.
The reason I’m (thus far) uninterested in a story about the mechanism, is that there’s nothing interesting to explain. You only get something interesting if you assume your conclusion: if you assume without justification that the reward system was aiming for x and got x, you might find it interesting to consider how that’s achieved—but this doesn’t give you evidence for the assumption you used to motivate your story in the first place.
In particular, I find it implausible that there’s a system that does aim for x and get x (unless the ‘system’ is the entire environment):
If there are environmental regularities that tend to give you elements of x without your needing to encode them explicitly, those regularities will tend to be ‘used’ - since you get them for free. There’s no selection pressure to encode or preserve those elements of x.
If you want to sail quickly, you take advantage of the currents.
So I don’t think there’s any reasonable sense in which there’s a target being hit.
If a magician has me select a card, looks at it, then tells me that’s exactly the card they were aiming for me to pick, I’m not going to spend energy working out how the ‘trick’ worked.
It sounds like we’ve got to my crux for my optimism, in that you think that to have a system that aims for x, it essentially needs to be an entire environment, and the environment largely dictates human values, whereas I think human values are less dependent on the environment, and far more dependent on their genome + learning process. Equivalently speaking, I place a lot more emphasis on the internal stuff of humans as the main contributor to values, while you emphasize the external environment a lot more than the internals like the genome or learning process.
This could be disentangled into 2 cruxes:
Where are human values generated.
How cheap is it to specify values, or alternatively how weak do our priors need to be to encode values (if you are encoding values internally.)
And I’d expect the answers from me to be mostly internal, like the genome + learning process with a little help from the environment on the first question and relatively cheap to specify values on the second question, whereas you’d probably think the answers to these questions are basically the environment sets the values , with little or no help from the internals of humans on the first question and very expensive to specify values for the second question.
For some of my reasoning on this, I’d probably read some posts like these:
https://www.lesswrong.com/posts/HEonwwQLhMB9fqABh/human-preferences-as-rl-critic-values-implications-for
(Basically argues that the critic in the brain generates the values)
https://www.lesswrong.com/posts/CQAMdzA4MZEhNRtTp/human-values-and-biases-are-inaccessible-to-the-genome
(The genomic prior can’t be strong, because it has massive limitations in what it can encode).
The central crux really isn’t where values are generated. That’s a more or less trivial aside. (though my claim was simply that it’s implausible the values aimed for would be entirely determined by genome + learning process; that’s a very weak claim; 98% determined is [not entirely determined])
The crux is the tautology issue: I’m saying there’s nothing to explain, since the source of information we have on [what values are being “aimed for”] is human behaviour, and the source of information we have on what values are achieved, is human behaviour.
These things must agree with one-another: the learning process that produced human values produces human values. From an alignment difficulty perspective, that’s enough to conclude that there’s nothing to learn here.
An argument of the form [f(x) == f(x), therefore y] is invalid.
f(x) might be interesting for other reasons, but that does nothing to rescue the argument.
That’s our disagreement, in that we have more information than that. I agree human behavior plays a role in my evidence base, but there’s more evidence I have than that.
In particular I am using results from both ML/AI and human brain studies to inform my conclusion.
Basically, my claim is that [f(x) == f(y), therefore z].
But humans are capable of thinking about what their values “actually should be” including whether or not they should be the values evolution selected for (either alone or in addition to other things). We’re also capable of thinking about whether things like wireheading are actually good to do, even after trying it for a bit.
We don’t simply commit to tricking our reward systems forever and only doing that, for example.
So that overall suggests a level of coherency and consistency in the “coherent extrapolated volition” sense. Evolution enabled CEV without us becoming completely orthogonal to evolution, for example.
A few points here:
We don’t have the option to “trick our reward systems forever”—e.g. because becoming a heroin addict tends to be self-destructive. If [guaranteed 80-year continuous heroin high followed by painless death] were an option, many people would take it (though not all).
The divergence between stated preferences and revealed preferences is exactly what we’d expect to see in worlds where we’re constantly “tricking our reward system” in small ways: our revealed preferences are not what we think they “actually should be”.
We tend to define large ways of tricking our reward systems as those that are highly self-destructive. It’s not surprising that we tend to observe few of these, since evolution tends to frown upon highly self-destructive behaviour.
Again, I’d ask for an example of a world plausibly reachable through an evolutionary process where we don’t have the kind of coherence and consistency you’re talking about.
Being completely orthogonal to evolution clearly isn’t plausible, since we wouldn’t be here (I note that when I don’t care about x, I sacrifice x to get what I do care about—I don’t take actions that are neutral with respect to x).
Being not-entirely-in-line with evolution, and not-entirely-in-line with our stated preferences is exactly what we observe.
Regarding security mindset, I think that where it really kicks in is when you have a system utilising its intelligence to work around any limitations such that you’re no longer looking at a “broad, reasonable” distribution of space, but now a “very, specific” scenario that a powerful optimiser has pushed you towards. In that case, doing things like doubling the size may make your safety schemes if the AI now has the intelligence to get around it.
The problem here is that it shares a similar issue to optimization daemons/goal misgeneralization, etc, and a comment from Iceman sums it up perfectly:
“or trying to translate that into lesswrongesse, you do not have warrant to believe in something until you have an example of the thing you’re maybe worried about being a real problem because you are almost certain to be privileging the hypothesis.”
https://www.lesswrong.com/posts/99tD8L8Hk5wkKNY8Q/?commentId=xF5XXJBNgd6qtEM3q
Or equivalently from lc: “you only start handing out status points after someone has successfully demonstrated the security failure, ideally in a deployed product or at the very least a toy program.”
This is to a large extent the issue I have with attempted breaks on alignment, in that pretty much no alignment break has been demonstrated, and the cases where they had, we have very mixed results to slight positive results at best.
The POC || GTFO article was very interesting.
I do worry though that it is mixing together pragmatics and epistemics (even though it does try to distinguish the two). Like there’s a distinction between when it’s reasonable to believe something and when it’s reasonable to act upon something.
For example, when I was working as a web developer, there’s lots of potential bugs where it would have made sense to believe that there was a decent chance we were vulnerable, but pragmatically we couldn’t spare the time to fix every potential security issue. It doesn’t mean that I should walk around saying: “Therefore they aren’t there” though.
I’ll admit, if someone randomly messaged you some of the AI risk arguments and no one else was worried about them, it’d probably be reasonable to conclude that there’s a flaw there and put them aside.
On the other hand, when even two deep learning Turing prize winners are starting to get concerned, and the stakes are so high, I think we should be a bit more cautious regarding dismissing the arguments out of hand.
I agree, which is why I have an entire section or 2 about why I think ML/AI isn’t like computer security.
Who thinks that? I don’t think that. Ajeya doesn’t think that.
I’m going to defend that addendum weakly, but I think it’s implicit in a lot of models that assume that intelligence will grow to superhumanity by say, the 2040s, like Scott Alexander’s or Kurzweil after 2030 or your model after 2029, and I suspect that Ajeya does in fact think that AI progress will continue to be like the past, and she thinks it will be even faster.
If she believes that AI progress will slow down in a decade, then I’ll probably edit or remove that statement.
I literally heard her saying a few weeks ago something to the effect of “it’ll be such a relief when we get through these next few OOMs of progress. Everything is happening so fast now because we are scaling up through so many OOMs so quickly in various metrics. But after a few more years the pace will slow down and we’ll get back to a much slower rate of progress in AI capabilities.”
Her bio anchors model also incorporates some of these effects IIRC.
My model after 2029--what are you referring to? I currently think that probably we’ll have superintelligence by 2029. I definitely agree that if I’m wrong about that and AGI is a lot harder to build than I think, progress in AI will be slowing down significantly around 2030 relative to today’s pace.
Is that realistic? When I plug some estimates that I find reasonable into the Epoch interactive model, I find that scaling shouldn’t slow down significantly until about 2030. And at that point we might be getting into a regime where the economy should be growing quickly enough to support further rapid scaling, if TAI is attainable at lower FLOP levels. So, actually, our current regime of rapid scaling might not slow down until we approach the limits of the solar system, which is likely over 10 OOMs above our current level.
The reason for this relatively dramatic prediction seems to be that we have a lot of slack left. The current largest training run is GPT-4, which apparently only cost OpenAI about $50 million. That’s roughly 4-5 OOMs away from the maximum amount I’d expect our current world economy would be willing to spend on a single training run before running into fundamental constraints. Moreover hardware progress and specialization might add another 1 OOM to that in the next 6 years.
Oh I agree, the scaling will not slow down. But that’s because I think TAI/AGI/etc. isn’t that far off in terms of OOMs of various inputs. If I thought it was farther off, say 1e36 OOMs, I’d think that before AI R&D or the economy began to accelerate, we’d run out of steam and scaling would slow significantly and we’d hit another AI winter.
Ultimately, that’s why I decided to cut the section: It was probably false, and it didn’t even matter for my thesis statement on AI safety/alignment.
I’ll grant that Ajeya was misrepresented in this post, and I’ll probably either edit or remove the section.
This isn’t a crux on why I believe AI to be safe, but I think my potential disagreement is that once you manage to reach the human compute and memory regime, I do expect it to be more difficult to scale upwards.
I definitely assign some credence to you being right, so I’ll probably edit or remove that section.
I’m going to read this as ”...1 new potential gradient hacking pathway” because I think that’s what the section is mainly about. (It appears to me that throughout the section you’re conflating mesa-optimization with gradient hacking, but that’s not the main thing I want to talk about.)
The following quote indicates at least two potential avenues of gradient hacking: “In an RL context”, “supervised learning with adaptive data sampling”. These both flow through the gradient hacker affecting the data distribution, but they seem worth distinguishing, because there are many ways a malign gradient hacker could affect the data distribution.
Broadly, I’m confused about why others (confidently) think gradient hacking is difficult. Like, we have this pretty obvious pathway of a gradient hacker affecting training data. And it seems very likely that AIs are going to be training on their own outputs or otherwise curating their data distribution — see e.g.,
Phi, recent small scale success of using lots of synthetic data in pre-training,
Constitutional AI / self-critique,
using LLMs for data labeling and content moderation,
The large class of self-play approaches that I often lump together under “Expert-iteration” which involve iteratively training on the best of your previous actions,
the fact that RLHF usually uses a preference model derived from the same base/SFT model being trained.
Sure, it may be difficult to predictably affect training via partial control over the data distribution. Personally I have almost zero clue how to affect model training via data curation, so my epistemic state is extremely uncertain. I roughly feel like the rest of humanity is in a similar position — we have an incredibly poor understanding of large language model training dynamics — so we shouldn’t be confident that gradient hacking is difficult. On the other hand, it’s reasonable to be like “if you’re not smarter than (some specific set of) current humans, it is very hard for you to gradient hack, as evidence by us not knowing how to do it.”
I don’t think strong confidence in either direction is merited by our state of knowledge on gradient hacking.
Basically, it’s a combo of not being incentivized to do it, combined with the fact that SGD is actually really powerful in ways that undermines the traditional story for gradient hacking.
One of the most important things to keep in mind is that gradient descent optimizes independently and simultaneously, which means that for a gradient hacker, unless it contains non-differentiable components, there’s no way for the inner misaligned agent to escape being optimized away by SGD, and since it optimizes the entire causal graph leading to the loss, there is very little avenue for a gradient hacker to escape being optimized away.
In general, this is a big problem with a lot of stories of danger that rely on goal divergences between the base and the mesa optimizer: How do you prevent the mesa-optimizer from being optimized away by SGD? For a lot of stories, the likely answer is you can’t, and the stories that people propose usually fall victim to the issue that SGD is too good at credit assignment, compared to genetic algorithms or evolutionary methods.
Thanks a lot for writing that post.
One question I have regarding fast takeoff is: don’t you expect learning algorithms much more efficient than SGD to show up and accelerate a lot the rate of development of capabilities?
One “overhang’ I can see it the fact that humans have written a lot of what they know how to do all kinds of task on the internet and so a pretty data efficient algo could just leverage this and fairly suddenly learn a ton of tasks quite rapidly. For instance, in context learning is way more data efficient than SGD in pre-training. Right now it doesn’t seem like in context learning is exploited nearly as much as it could be. If we manage to turn ~any SGD learning problem into an in-context learning problem, which IMO could happen with an efficient long term memory and a better long context length, things could accelerate pretty wildly. Do you think that even things like that (i.e. we unlock a more data efficient Algo which allows much faster capabilities development) will necessarily be smoothed?
Brains use somewhat less lifetime training compute (perhaps 0 to a few OOM less) than GPT4, and 2 or 3 OOM less data, which provides existence proof of somewhat better scaling curves, along with some evidence that scaling curves much better than those brains are on are probably hard.
AI systems already train on the entire internet so I don’t see how that is an overhang.
There are diminishing returns to context for in-context learning; it is extremely RAM intensive and GPUs are RAM starved compared to the brain, and finally brains already use it with much longer context, so its more like one of the hard challenges to achieve brain parity at all rather than a big overhang.
I am definitely semi-agnostic to whether SGD will ultimately be the base optimizer of choice, and whether the inner algorithm does better than SGD and causes a fast takeoff.
But I’ll assume that you are right about fast takeoff happening, and my response to that is that this would leave the alignment schemes proposed intact, for the following reasons:
Even if fast takeoff happens, the sharp left turn in the form of misgeneralization is still less likely to happen, because unlike evolution, we are unlikely to run fresh versions of an AI, and retain the same AI throughout the training run.
It mostly doesn’t affect how easy it is to learn values, and the trick of using our control of SGD to be the innate reward system still works, because of the fact that weak genetic priors that are easy to trick plus the innate reward system’s local update rule still suffices to make people reliably have a set of values like empathy for the ingroup.
SGD still has really strong corrective properties against inner misaligned agents, unlike evolution.
I do agree that fast takeoff complicates the analysis, but I don’t think it breaks the alignment methods shown in the post. If it required very strong priors to align (But with SGD we can align them to reward functions that are much more complicated than genetic priors can do), or we can’t control the innate reward system, this would be a much bigger issue.
I think there are plausible stories in which a hard left turn could happen (but as you’ve pointed out, it is extremely unlikely under the current deep learning paradigm).
For example, suppose it turns out that a class of algorithms I will simply call heuristic AIXI are much more powerful than the current deep learning paradigm.
The idea behind this class of algorithm is you basically do evolution but instead of using blind hillclimbing, you periodically ask what is the best learning algorithm I have, and then apply that to your entire process. Because this means you are constantly changing the learning algorithm, you could get the same sort of 1Mx overhang that caused the sharp left turn in human evolution.
The obvious counter is that if we think heuristic, AIXI is not safe, then we should just not use it. But the obvious counter to that is when have humans ever not done some thing because someone else told them it wasn’t safe.
I definitely agree with the claim that evolutionary strategies being effective would weaken my entire case. I do think that evolutionary methods like GAs are too hobbled by their inability to exploit white-box optimization, unlike SGD, but we shall see.
I genuinely don’t know if heuristic AIXI is a real thing or not, but if it is it combines the ability to search the whole space of possible algorithms (which evolution has but SGD doesn’t) with the ability to take advantage of higher order statistics (like SGD does but evolution doesn’t).
My best guess is that just as there was a “Deep learning” regime that only got unlocked once we had tons of compute from GPUs, there’s also a heuristic AIXI regime that unlocks at some level of compute.
The analogy was about the alignment problem, not the capabilities problem.
A rocket won’t get to the moon if you randomly double one of the variables used to navigate, like the amount of thrust applied in maneuvers or the angle of attack. (well, not unless you’ve built in good error-correction and redundancy etc.)
The point here is that there are enough results in ML like this that I’m more skeptical of the security mindset being accurate, and ML/AI alignment is a strange enough domain such that we shouldn’t port over intuitions from other fields, like you shouldn’t port over intuitions from the large scale to quantum mechanics.
For a specific example relevant to alignment, I talked about SGD’s corrective properties in a section of the post.
Another good example has to do with with the fact that AIs are generally modular and you can switch out parts without breaking the AI, which couldn’t be done under a security mindset as it would predict that either the AI spits out nonsense or breaks it’s security, none of which have happened.
Good to see your point of view. The old arguments about AI doom are not convincing to me anymore, however getting alignment 100% right, whatever that means in no way guarantees a positive Singularity.
Should we be talking about concrete plans about that now? For example I believe with a slow takeoff if we don’t get Neuralink or mind uploading, then our P(doom) → 1 as the Super AI gets ever more ahead of us. The kind scenarios I can see
“dogs in a war zone” great powers make ever more powerful AI and use them as weapons. We don’t understand our environment and it isn’t safe. The number of humans steadily drops to zero.
Some kind of Moloch hell, without explicit shooting. Algorithms run our world, we don’t understand it anymore and they bring out the worst in us. We keep making more sentient AI, we are greatly outnumbered by them until no more.
WALL-E type scenario—basic needs met, digital narcotics etc we lose all ambitions.
I can’t see a good one as ASI gets way further ahead of us. With a slow takeoff there is no sovereign to help with our CEV, pivotal acts are not possible etc.
I personally support some kind of hardware pause—when Moores law runs out at 1-2nm don’t make custom AI chips to overcome the von Neumann bottleneck, in combination with accelerating hard Neural interface, WBE/mind uploading. Doomer types seem also to back something similar.
I don’t see the benefit of arguing with the conventional 2010′s era alignment ideas anymore—only data will change people’s minds now. Like if you believe in a fast takeoff, nothing short of having IQ 180 AI+/weak superintelligence saying that “I can’t optimize myself further unless you build me some new hardware” would make a difference I can see.
Thanks for writing this! I strongly appreciate a well-thought out post in this direction.
My own level of worry is pretty dependent on a belief that we know and understand shaping NN behaviors much better than we do (values/goals/motivations/desires) (although I don’t think eg chatGPT has any of the latter in the first place). Do you have thoughts on the distinction between behaviors and goals? In particular, do you feel like you have any evidence we know how to shape/create/guide goals and values, rather than just behaviors?
Arguments about inner misalignment work as arguments for optimism only inside “outer/inner alignment” framework, in deep learning version of it. If we have good outer loss function, such as closer to the minimum means better, then yes, our worries should be about weird inner misalignment issues. But we don’t have good outer loss function so we kinda should hope for inner misalignment.
That’s definitely a claim that I contest, and my disagreement comes down to my optimism on weak priors sufficing for alignment at humans, and the fact that we can do better than that, combined with my view that deceptive alignment is so terrible that we’re generally better off having more inner-alignment than not, because deception is one of the few ways to break this analysis on alignment, meaning that I generally find inner alignment more useful than not.
Okay, let’s break down this.
Inner misalignment is when we have “objective function” (reward, loss function, etc.) and select systems that produce better results according this function (using evolutionary search, SGD, etc) and resulting system doesn’t produce actions which optimize this objective function. The most obvious example of inner misalignment is RL-trained agents that doesn’t maximize reward.
Your argument against possibility of inner misalignment is, basically, “SGD is so powerful optimizer that no matter what it will drag the system towards minimum of loss function”. Let’s suppose this is true.
We don’t have “good” outer function, defined over training data, such that, given observation and action, this function scores action higher if this action, given observation, is better. Instead of this we have outer functions that favors things like good predictions and outputs receiving high score from human/AI overseer.
If you have some alignment benchmark, you can’t see the difference between superhumanly capable aligned and deceptively aligned systems. They both give you correct answers, because they both are superhumanly capable.
Because they give you the same correct answers, loss function assignes minimal values to their outputs. They are both either inside local minimum or on flat basin of loss function landscape.
Therefore, you don’t need inner misalignment to get deceptive alignment.
While I dislike using the framing of loss functions here, I do think that this is probably false, especially with even weak prior information about the shape of the alignment solutions. This might turn out to be a crux, but I do think that rewarding AIs for bad actions will likely be rare, at least in the regime where we can supervise things, and in particular, I think a hypothetical alignment scheme via an outer function would look like this:
Place a weak prior over goal space, such that there already is a bias towards say being helpful.
Use the fact that we are the innate reward system to use backpropagation to compute the optimal direction towards being helpful, or really any criterion we can specify.
Repeat reinforcing preferred values and not rewarding/disrewarding dispreferred values with backpropagation until it gets to minimum loss or near minimal loss.
After millions of iterations of that loop by SGD, you can get a very aligned agent.
This is roughly how I believe that the innate reward system manages to align us with values like empathy for the ingroup, but really we could replace the backprop algorithm with bio-realistic algorithms, and we could replace the values with mostly arbitrary values and get the same results.
http://allmanlab.caltech.edu/biCNS217_2008/PDFs/Meaney2001.pdf
My first impression skimming through, is that what it’s arguing is that abuse by parents can negatively affect a child, and that stress can have both positive and negative effects, and that individual responses to stress determine the balance of positive to negative effects.
2 things I want to point out:
I think that the conclusions from this study are almost certainly extremely limited, and I wouldn’t trust these results to generalize to other species like us.
I expect the results, in so far as they are real and generalizable, to be essentially that the genome can influence things later in life via indirect methods, but mostly can’t directly specify it via hardcoding it or baking it directly in as prior information, and the transfer seems very limited, and critically the timescale is likely on evolutionary timescales, which is far, far slower than human within-lifetime learning timescales, and certainly not as much as the many bits cultural evolution can give in a much shorter timeframe.
I will edit the post to modify the any to more as many bits as cultural evolution, and edit it more to say what I really meant here.
[downvoted]
The reason I trust my impression here is because I have information where I have good reason to suspect that epigenetics in general is basically a P-hacked field, where the results are pure noise and indicate that epigenetics probably can’t work, so yes I’m skeptical of epigenetics being a viable way to transmit information throughout the generations, or really epigenetics being useful at all.
https://www.lesswrong.com/posts/zazA44CaZFE7rb5zg/transhumanism-genetic-engineering-and-the-biological-basis#DyJvphnBuwiK6MNpo
https://www.lesswrong.com/posts/zazA44CaZFE7rb5zg/transhumanism-genetic-engineering-and-the-biological-basis#JeDuMpKED7k9zAiYC
Then I should update on epigenetics is not supported by evidence. And also about my chances to post nasty and arrogant when my medication change. Sorry about that.
However, I have a question about the large or small amount of bits.
Suppose Musk offers you a private island with a colony of hominids – the kind raw enough that they haven’t yet invented cooking with fire. Then he insists very hard that you introduce strong sexual selection, which led to one of those big monkeys inventing parading in front of the girls with a stick on fire.
Soon everyone is cooking, ensuring so many slack, physiology speaking, that chatting with the girls becomes the main driver of their evolution. So hard, in fact, that if you were a selfish gene living in some good girl, you’d be better off hurting your pussy than refusing to raise babies with the biggest brains possible.
At this point, I would consider that you may have replicated the basic recipe for creating the human mind. Of course, maybe this is just a fairy tale. Or something in between, like a real but less important component than, say, chimpanzee wars. But if you were able to measure the bit ratio in this scenario (number of bits from epigenetics versus number of bits from the genome), what do you think that would look like?