The AI system builders’ time horizon seems to be a reasonable starting point
David Johnston
Mechanistic Anomaly Detection Research Update
Nora and/or Quentin: you talk a lot about inductive biases of neural nets ruling scheming out, but I have a vague sense that scheming ought to happen in some circumstances—perhaps rather contrived, but not so contrived as to be deliberately inducing the ulterior motive. Do you expect this to be impossible? Can you propose a set of conditions you think sufficient to rule out scheming?
What in your view is the fundamental difference between world models and goals such that the former generalise well and the latter generalise poorly?
One can easily construct a model with a free parameter X and training data such that many choices of X will match the training data but results will diverge in situations not represented in the training data (for example, the model is a physical simulation and X tracks the state of some region in the simulation that will affect the learner’s environment later, but hasn’t done so during training). The simplest x_s could easily be wrong. We can even moralise the story: the model regards its job as predicting the output under x_s and if the world happens to operate according to some other x’ then the model doesn’t care. However it’s still going to be ineffective in the future where the value of X matters.
Another comment on timing updates: if you’re making a timing update for zoonosis vs DEFUSE, and you’re considering a long timing window w_z for zoonosis, then your prior for a DEFUSE leak needs to be adjusted for the short window w_d in which this work could conceivably cause a leak, so you end up with something like p(defuse_pandemic)/p(zoo_pandemic)= rr_d w_d/w_z, where rr_d is the riskiness of DEFUSE vs zoonosis per unit time. Then you make the “timing update” p(now |defuse_pandemic)/p(now |zoo_pandemic) = w_z/w_d and you’re just left with rr_d.
Sorry, I edited (was hoping to get in before you read it)
If your theory is: there is a lab leak from WIV while working on defuse derived work then I’ll buy that you can assign a high probability to time & place … but your prior will be waaaaaay below the prior on “lab leak, nonspecific” (which is how I was originally reading your piece).
You really think in 60% of cases where country A lifts a ban on funding gain of function research a pandemic starts in country B within 2 years? Same question for “warning published in Nature”.
If people now don’t have strong views about exactly what they want the world to look like in 1000 years but people in 1000 years do have strong views then I think we should defer to future people to evaluate the “human utility” of future states. You seem to be suggesting that we should take the views of people today, although I might be misunderstanding.
Edit: or maybe you’re saying that the AGI trajectory will be ~random from the point of view of the human trajectory due to a different ontology. Maybe, but different ontology → different conclusions is less obvious to me than different data → different conclusions. If there’s almost no mutual information between the different data then the conclusions have to be different, but sometimes you could come to the same conclusions under different ontologies w/data from the same process.
Given this assumption, the human utility function(s) either do or don’t significantly depend on human evolutionary history. I’m just going to assume they do for now.
There seems to be a missing possibility here that I take fairly seriously, which is that human values depend on (collective) life history. That is: human values are substantially determined by collective life history, and rather than converging to some attractor this is a path dependent process. Maybe you can even trace the path taken back to evolutionary history, but it’s substantially mediated by life history.
Under this view, the utility of the future wrt human values depends substantially on whether, in the future, people learn to be very sensitive to outcome differences. But “people are sensitive to outcome differences and happy with the outcome” does not seem better to me than “people are insensitive to outcome differences and happy with the outcome” (this is a first impression; I could be persuaded otherwise), even though it’s higher utility, whereas “people are unhappy with the outcome” does seem worse than “people are happy with the outcome”.
Under this view, I don’t think this follows:
there is some dependence of human values on human evolutionary history, so that a default unaligned AGI would not converge to the same values
My reasoning is that a “default AGI” will have its values contingent on a process which overlaps with the collective life history that determines human values. This is a different situation to values directly determined by evolutionary history, where the process that determines human values is temporally distant and therefore perhaps more-or-less random from the point of view of the AGI. So there’s a compelling reason to believe in value differences in the “evolution history directly determines values” case that’s absent in the “life history determines values” case.
Different values are still totally plausible, of course—I’m objecting to the view that we know they’ll be different.
(Maybe you think this is all an example of humans not really having values, but that doesn’t seem right to me).
You’re changing the topic to “can you do X without wanting Y?”, when the original question was “can you do X without wanting anything at all?”.
A system that can, under normal circumstances, explain how to solve a problem doesn’t necessarily solve the problem if it gets in the way of explaining the solution. The notion of wanting that Nate proposes is “solving problems in order to achieve the objective”, and this need not apply to the system that explains solutions. In short: yes.
If we are to understand you as arguing for something trivial, then I think it only has trivial consequences. We must add nontrivial assumptions if we want to offer a substantive argument for risk.
Suppose we have a collection of systems of different ability that can all, under some conditions, solve . Let’s say an “-wrench” is an event that defeats systems of lower ability but not systems of higher ability (i.e. prevents them from solving ).
A system that achieves with probability must defeat all -wrenches but those with a probability of at most . If the set of events that are -wrenches but not -wrenches has probability , then the system can defeat all -wrenches but a collection with probability of at most .
That is, if the challenges involved in achieving are almost the same as the challenges involved in achieving , then something good at achieving is almost as good at achieving (granting the somewhat vague assumptions about general capability baked into the definition of wrenches).
However, if is something that people basically approve of and is something people do not approve of, then I do not think the challenges almost overlap. In particular, to do , with high probability you need to defeat a determined opposition, which is not likely to be necessary if you want . That is: no need to kill everyone with nanotech if your doing what you were supposed to.
In order to sustain the argument for risk, we need to assume that the easiest way to defeat -wrenches is to learn a much more general ability to defeat wrenches than necessary and apply it to solving and, furthermore, this ability is sufficient to also defeat -wrenches. This is plausible—we do actually find it helpful to build generally capable systems to solve very difficult problems—but also plausibly false. Even highly capable AI that achieves long-term objectives could end up substantially specialised for those objectives.
As an aside, if the set of -wrenches includes the gradient updates received during training, then an argument that an -solver generalises to a -solver may also imply that deceptive alignment is likely (alternatively, proving that -solvers generalise to -solvers is at least as hard as proving deceptive alignment).
Two observations:
-
If you think that people’s genes would be a lot fitter if people cared about fitness more then surely there’s a good chance that a more efficient version of natural selection would lead to people caring more about fitness.
-
You might, on the other hand, think that the problem is more related to feedbacks. I.e. if you’re the smartest monkey, you can spend your time scheming to have all the babies. If there are many smart monkeys, you have to spend a lot of time worrying about what the other monkeys think of you. If this is how you’re worried misalignment will arise, then I think “how do deep learning models generalise?” is the wrong tree to bark up
C. If people did care about fitness, would Yudkowsky not say “instrumental convergence! Reward hacking!”? I’d even be inclined to grant he had a point.
-
I can’t speak for janus, but my interpretation was that this is due to a capacity budget meaning it can be favourable to lose a bit of accuracy on token n if you gain more on n+m. I agree som examples would be great.
there are strong arguments that control of strongly superhuman AI systems will not be amenable to prosaic alignment
In which section of the linked paper is the strong argument for this conclusion to be found? I had a quick read of it but could not see it—I skipped the long sections of quotes, as the few I read were claims rather than arguments.
I don’t disagree with any of what you say here—I just read Anton as assuming we have a program on that frontier
The mistake here is the assumption that a program that models the world better necessarily has a higher Kolmogorov complexity.
I think Anton assumes that we have the simplest program that predicts the world to a given standard, in which case this is not a mistake. He doesn’t explicitly say so, though, so I think we should wait for clarification.
But it’s a strange assumption; I don’t see why the minimum complexity predictor couldn’t carry out what we would interpret as RSI in the process of arriving at its prediction.
I think he’s saying “suppose p1 is the shortest program that gets at most loss . If p2 gets loss , then we must require a longer string than p1 to express p2, and p1 therefore cannot express p2”.
This seems true, but I don’t understand its relevance to recursive self improvement.
I think it means that whatever you get is conservative in cases where it’s unsure of whether it’s in training, which may translate to being conservative where it’s unsure of success in general.
I agree it doesn’t rule out an AI that takes a long shot at takeover! But whatever cognition we posit that the AI executes, it has to yield very high training performance. So AIs that think they have a very short window for influence or are less-than-perfect at detecting training environments are ruled out.
When do you think is the right time to work in these issues? Monitoring, trust displacement and fine grained permission management all look liable to raise issues that weren’t anticipated and haven’t already been solved, because they’re not the way things have been done historically. My gut sense is that GPT4 performance is much lower when you’re asking it to do novel things. Maybe it’s also possible to make substantial gains with engineering and experimentation, but you’ll need a certain level of performance in order to experiment.
Some wild guesses: maybe the right time to start work is one generation before it’s feasible, and that might mean start now for fine grained permissions, gpt 4.5 for monitoring, gpt 5 for trust displacement.