I’m not sure which part of this you think is assuming the conclusion. That our values are not maximally about power acquisition should be clear. The part saying evolution is continuously pushing the world is what I’ve tried to explain in the post, though it should be read as “there is a force pushing in that direction” rather than “the resulting force is in that direction”.
Ricardo Meneghin
Natural selection doesn’t produce “bad” outcomes, it produces expansionist, power-seeking outcomes, not at the level of the individual, but the whole system. In the meantime it will produce several intermediate states that may have adjacent even more power-seeking states, but are harder to find.
Humans developed several altruistic values because it was what produced the most fitness in the local search natural selection was running at the time. Cooperating with individuals from your tribe would lead to better outcomes than purely selfish behavior.
The modern world doesn’t make “evil” optimal. The reason violence has reduced is because negative-sum games among similarly capable individuals are an incredible waste of resources and we are undergoing selection in many different levels against that: violent people died in battle or were executed frequently during history, societies that enforced strong punishment against violence prospered more that ones that didn’t, cultures and religions that encouraged less harm made the groups that adopted them prosper more.
I’m not sure what about the OP or the linked paper would make you conclude anything you have concluded.
The reason we shouldn’t expect cooperation from AI is that it is remarkably more powerful than humans, and it may very well have better outcomes by paying the tiny cost of fighting humans if it can then turn all of us into more of it. I’m sure the pigs caged in our factory farms wouldn’t agree with your sense that the passage of time is favoring “goodness”.
There is also a huge asymmetry in AIs capability for self-modification, expansion and merging. In fact, I’d expect them to be less violent than humans among themselves, merging into single entities to avoid wasteful negative-sum competition, which is something that is impossible for humans to do.
One final thought: it may be that natural selection actually favors AI that cares more about humans than humans care about each other. Sound preposterous? Consider that there are species (such as Tasmanian devils) that present-day humans care about conserving but where the members of the species don’t show much friendliness to each other.
Regarding this, I don’t think it’s preposterous at all. It might be that initial cooperation with humans gives a head-start to the first AI which “locks-in” a cooperative value into it, and it carries it on even as it doesn’t need to. But longer term, I don’t know what would happen.
A simple evolutionary argument is enough to justify a very strong prior that kidney donation is significantly harmful for health: we have two of them, they aren’t on an evolutionary path to disappearing, and modern conditions have changed almost nothing about the usage or availability of kidneys.
I think the whole situation with kidney donations reflects quite poorly on the epistemic rigor of the community. Scott Alexander probably paid more than $5k merely in the opportunity cost of the time he spent researching the topic, given the positive externalities of his work.
High growth rates means there is a higher opportunity cost in lending money, since you could invest it elsewhere and get a higher return, reducing the supply of loans, and more demand for loans, since if interests are low, people will borrow to buy assets that appreciate more than the interest rate.
The vast majority of the risk seems to lie on following through with synthesizing and releasing the pathogen, not learning how to do it, and I think open-source LLMs change little about that.
My interpretation of calling something “ideal” is that it presents that thing as unachievable from the start, and it wouldn’t be your fault if you failed to achieve that, whereas “in a sane world” clearly describes our current behavior as bad and possible to change.
I think the idea we’re going to be able to precisely steer government policy to achieve nuanced outcomes is dead on arrival—we’ve been failing at that forever. What’s in our favor this time is that there are many more ways to cripple advance than to accelerate it, so it may be enough for the push to be simply directionally right for things to slow down (with a lot of collateral damage).
There is a massive tradeoff between nuance/high epistemic integrity and reach. The general population is not going to engage in complex nuanced arguments about this, and prestigious or high-power people who are able to understand the discussion and potentially steer government policy in a meaningful way won’t engage in this type of protest for many reasons, so the movement should be ready for dumbing-down or at least simplifying the message in order to increase reach, or risk remaining a niche group (I think “Pause AI” is already a good slogan in that sense).
Really interesting to go back to this today. Rates are at their highest level in 16 years, and TTT is up 60%+.
Humans are not choosing to reward specific instances of actions of the AI—when we build intelligent agents, at some point they will leave the confines of curated training data and go operate on new experiences in the real world. At that point, their circuitry and rewards are out of human control, so that makes our position perfectly analogous to evolution’s. We are choosing the reward mechanism, not the reward.
I think the focus on “inclusive genetic fitness” as evolution’s “goal” is weird. I’m not even sure it makes sense to talk about evolution’s “goals”, but if you want to call it an optimization process, the choice of “inclusive genetic fitness” as its target is arbitrary as there are many other boundaries one could trace. Evolution is acting at all levels, e.g. gene, cell, organism, species, the entirety of life on Earth. For example, it is not selecting adaptations which increase the genetic fitness of an individual but lead to the extinction of the species later. In the most basic sense evolution is selecting for “things that expand”, in the entire universe, and humans definitely seem partially aligned with that—the ways in which they aren’t seem non-competitive with this goal.
Since the outdoor temperature was lower in the control, ignoring it will inflate how much the two-hose unit outperforms by bringing the effect of both units closer to zero. If we assume the temperature difference the units and the control produce are approximately constant in this outdoor temperature range, then the difference to control would be 3.1ºC for the one hose unit and 5ºC for the two hose unit if the control outdoor temperature was the same, meaning two-hose only outperforms by ~60% with the fan on high, and merely ~30% with the fan on low.
Retracted
The claim for “self-preserving” circuits is pretty strong. A much simpler explanation is that humans learn to value diversity early own because diversity of things around you, like tools, food sources, etc, improves fitness/reward.
Another non-competing explanation is that this is simply a result from boredom/curiosity—the brain wants to make observations that make it learn, not observations that it already predicts well, so we are inclined to observe things that are new. So again there is a force towards valuing diversity and this could become locked in our values.
First, there is a lot packed in “makes the world much more predictable”. The only way I can envision this is taking over the world. After you do that, I’m not sure there is a lot more to do than wirehead.
But even if doesn’t involve that, I can pick other aspects that are favored by the base optimizer, like curiosity and learning, which wireheading goes against.
But actually, thinking more about this I’m not even sure it makes sense to talk about inner aligment in the brain. What is the brain being aligned with? What is the base optimizer optimizing for? It is not intelligent, it does not have intent or a world model—it’s doing some simple, local mechanical update on neural connections. I’m reminded of the Blue-Minimizing robot post.
If humans decide to cut the pleasure sensors and stimulate the brain directly would that be aligned? If we uploaded our brains into computers and wireheaded the simulation would that be aligned? Where do we place the boundary for the base optimizer?
It seems this question is posed in the wrong way, and it’s more useful to ask the question this post asks—how do we get human values, and what kind of values does a system trained in a way similar to the human brain develops? If there is some general force behind learning values that favors some values to be learned rather than others, that could inform us about likely values of AIs trained via RL.
The original footnote provides one example of this, which is for the model to check if its objective satisfies some criterion, and fail hard if it doesn’t. Now, if the model gets to the point where it’s actually just failing because of this, then gradient descent will probably just remove that check—but the trick is never to actually get there. By having such a check in the first place, the model makes it so that gradient descent won’t actually change its objective,
There is a direction in the gradient which both changes the objective and removes that check. The model doesn’t need to be actually failing for that to happen—there is one piece of its parameters encoding the objective, and another piece encoding this deceptive logic which checks the objective and decides to perform poorly if it doesn’t pass the check. The gradient can update both at the same time—any slight change to the objective can also be applied to the deceptive objective detector, making the model not misbehave and still updating its objective.
since any change to its objective (keeping all the other parameters fixed, which is what gradient descent does since it computes partial derivatives) would lead to such a failure.
I don’t think this is possible. The directional derivative is the dot product of the gradient and the direction vector. That means if the loss decreases in a certain direction (in this case, changing the deceptive behavior and the objective), at least one of the partial derivatives is negative, so gradient descent can move in that direction and we will get what we want (different objective or less deception or both).
You should stop thinking about AI designed nanotechnology like human technology and start thinking about it like actual nanotechnology, i.e. life. There is no reason to believe you can’t come up with a design for self-replicating nanorobots that can also self-assemble into larger useful machines, all from very simple and abundant ingredients—life does exactly that.
I think the key insight here is that the brain is not inner aligned, not even close
You say that but don’t elaborate further in the comment. Which learned human values go against the base optimizer values (pleasure, pain, learning).
Avoiding wireheading doesn’t seem like failed inner alignment—avoiding wireheading now can allow you to get even more pleasure in the future because wireheading makes you vulnerable/less powerful. The base optimizer is also searching for brain configurations which make good predictions about the world, and wireheading goes against that.
If you change the analogy to developing nuclear weapons instead of launching them, the picture becomes much grimmer.
Because that’s a way more specific bet depending on other factors (e.g. inflation, other factors influencing Bitcoin demand). Rising (real) interest rates seems way more certain to happen than rising prices of any given company, especially considering most of the companies pursuing AGI are private.
I’m not sure where the 10% returns come from, but you can make way, way more than that by betting on rising rates. For example, you can currently buy deep OTM SOFR calls expiring in 3 years with a strike on 100bp for 137 dollars. If I understand the pricing of the contracts correctly, if quarterly rates double from 1 to 2%, that would represent a change of $2500 in the price of the contract, so a 17x return.