David Johnston

Karma: 510

David Johnston Feb 29, 2024, 6:59 AM
5 points
2
on: Counting arguments provide no evidence for AI doom
Nora and/or Quentin: you talk a lot about inductive biases of neural nets ruling scheming out, but I have a vague sense that scheming ought to happen in some circumstances—perhaps rather contrived, but not so contrived as to be deliberately inducing the ulterior motive. Do you expect this to be impossible? Can you propose a set of conditions you think sufficient to rule out scheming?

David Johnston Feb 29, 2024, 6:49 AM
1 point
0
in reply to: evhub’s comment on: Counting arguments provide no evidence for AI doom
What in your view is the fundamental difference between world models and goals such that the former generalise well and the latter generalise poorly?

One can easily construct a model with a free parameter X and training data such that many choices of X will match the training data but results will diverge in situations not represented in the training data (for example, the model is a physical simulation and X tracks the state of some region in the simulation that will affect the learner’s environment later, but hasn’t done so during training). The simplest x_s could easily be wrong. We can even moralise the story: the model regards its job as predicting the output under x_s and if the world happens to operate according to some other x’ then the model doesn’t care. However it’s still going to be ineffective in the future where the value of X matters.

David Johnston Feb 8, 2024, 10:56 AM
1 point
0
in reply to: Roko’s comment on: A Back-Of-The-Envelope Calculation On How Unlikely The Circumstantial Evidence Around Covid-19 Is
Another comment on timing updates: if you’re making a timing update for zoonosis vs DEFUSE, and you’re considering a long timing window w_z for zoonosis, then your prior for a DEFUSE leak needs to be adjusted for the short window w_d in which this work could conceivably cause a leak, so you end up with something like p(defuse_pandemic)/p(zoo_pandemic)= rr_d w_d/w_z, where rr_d is the riskiness of DEFUSE vs zoonosis per unit time. Then you make the “timing update” p(now |defuse_pandemic)/p(now |zoo_pandemic) = w_z/w_d and you’re just left with rr_d.

David Johnston Feb 8, 2024, 10:24 AM
1 point
0
in reply to: Roko’s comment on: A Back-Of-The-Envelope Calculation On How Unlikely The Circumstantial Evidence Around Covid-19 Is
Sorry, I edited (was hoping to get in before you read it)

David Johnston Feb 8, 2024, 10:06 AM
1 point
0
in reply to: Roko’s comment on: A Back-Of-The-Envelope Calculation On How Unlikely The Circumstantial Evidence Around Covid-19 Is
If your theory is: there is a lab leak from WIV while working on defuse derived work then I’ll buy that you can assign a high probability to time & place … but your prior will be waaaaaay below the prior on “lab leak, nonspecific” (which is how I was originally reading your piece).

David Johnston Feb 8, 2024, 4:23 AM
3 points
1
in reply to: Roko’s comment on: A Back-Of-The-Envelope Calculation On How Unlikely The Circumstantial Evidence Around Covid-19 Is
You really think in 60% of cases where country A lifts a ban on funding gain of function research a pandemic starts in country B within 2 years? Same question for “warning published in Nature”.

David Johnston Jan 1, 2024, 11:57 PM
3 points
0
in reply to: jessicata’s comment on: A case for AI alignment being difficult
If people now don’t have strong views about exactly what they want the world to look like in 1000 years but people in 1000 years do have strong views then I think we should defer to future people to evaluate the “human utility” of future states. You seem to be suggesting that we should take the views of people today, although I might be misunderstanding.

Edit: or maybe you’re saying that the AGI trajectory will be ~random from the point of view of the human trajectory due to a different ontology. Maybe, but different ontology → different conclusions is less obvious to me than different data → different conclusions. If there’s almost no mutual information between the different data then the conclusions have to be different, but sometimes you could come to the same conclusions under different ontologies w/data from the same process.

David Johnston Jan 1, 2024, 11:46 PM
3 points
0
on: A case for AI alignment being difficult

Given this assumption, the human utility function(s) either do or don’t significantly depend on human evolutionary history. I’m just going to assume they do for now.

There seems to be a missing possibility here that I take fairly seriously, which is that human values depend on (collective) life history. That is: human values are substantially determined by collective life history, and rather than converging to some attractor this is a path dependent process. Maybe you can even trace the path taken back to evolutionary history, but it’s substantially mediated by life history.

Under this view, the utility of the future wrt human values depends substantially on whether, in the future, people learn to be very sensitive to outcome differences. But “people are sensitive to outcome differences and happy with the outcome” does not seem better to me than “people are insensitive to outcome differences and happy with the outcome” (this is a first impression; I could be persuaded otherwise), even though it’s higher utility, whereas “people are unhappy with the outcome” does seem worse than “people are happy with the outcome”.

Under this view, I don’t think this follows:

there is some dependence of human values on human evolutionary history, so that a default unaligned AGI would not converge to the same values

My reasoning is that a “default AGI” will have its values contingent on a process which overlaps with the collective life history that determines human values. This is a different situation to values directly determined by evolutionary history, where the process that determines human values is temporally distant and therefore perhaps more-or-less random from the point of view of the AGI. So there’s a compelling reason to believe in value differences in the “evolution history directly determines values” case that’s absent in the “life history determines values” case.

Different values are still totally plausible, of course—I’m objecting to the view that we know they’ll be different.

(Maybe you think this is all an example of humans not really having values, but that doesn’t seem right to me).

David Johnston Nov 26, 2023, 12:23 AM
3 points
0
in reply to: Rob Bensinger’s comment on: Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense

You’re changing the topic to “can you do X without wanting Y?”, when the original question was “can you do X without wanting anything at all?”.

A system that can, under normal circumstances, explain how to solve a problem doesn’t necessarily solve the problem if it gets in the way of explaining the solution. The notion of wanting that Nate proposes is “solving problems in order to achieve the objective”, and this need not apply to the system that explains solutions. In short: yes.

David Johnston Nov 26, 2023, 12:00 AM
1 point
−2
on: Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense
If we are to understand you as arguing for something trivial, then I think it only has trivial consequences. We must add nontrivial assumptions if we want to offer a substantive argument for risk.

Suppose we have a collection of systems of different ability that can all, under some conditions, solve $X$ . Let’s say an “ $X$ -wrench” is an event that defeats systems of lower ability but not systems of higher ability (i.e. prevents them from solving $X$ ).

A system that achieves $X$ with $1 - ϵ$ probability must defeat all $X$ -wrenches but those with a probability of at most $ϵ$ . If the set of events that are $Y$ -wrenches but not $X$ -wrenches has probability $δ$ , then the system can defeat all $Y$ -wrenches but a collection with probability of at most $ϵ + δ$ .

That is, if the challenges involved in achieving $X$ are almost the same as the challenges involved in achieving $Y$ , then something good at achieving $X$ is almost as good at achieving $Y$ (granting the somewhat vague assumptions about general capability baked into the definition of wrenches).

However, if $X$ is something that people basically approve of and $Y$ is something people do not approve of, then I do not think the challenges almost overlap. In particular, to do $Y$ , with high probability you need to defeat a determined opposition, which is not likely to be necessary if you want $X$ . That is: no need to kill everyone with nanotech if your doing what you were supposed to.

In order to sustain the argument for risk, we need to assume that the easiest way to defeat $X$ -wrenches is to learn a much more general ability to defeat wrenches than necessary and apply it to solving $X$ and, furthermore, this ability is sufficient to also defeat $Y$ -wrenches. This is plausible—we do actually find it helpful to build generally capable systems to solve very difficult problems—but also plausibly false. Even highly capable AI that achieves long-term objectives could end up substantially specialised for those objectives.

As an aside, if the set of $Y$ -wrenches includes the gradient updates received during training, then an argument that an $X$ -solver generalises to a $Y$ -solver may also imply that deceptive alignment is likely (alternatively, proving that $X$ -solvers generalise to $Y$ -solvers is at least as hard as proving deceptive alignment).

David Johnston Oct 26, 2023, 7:24 AM
1 point
0
in reply to: habryka’s comment on: AI as a science, and three obstacles to alignment strategies
Two observations:
1. If you think that people’s genes would be a lot fitter if people cared about fitness more then surely there’s a good chance that a more efficient version of natural selection would lead to people caring more about fitness.
2. You might, on the other hand, think that the problem is more related to feedbacks. I.e. if you’re the smartest monkey, you can spend your time scheming to have all the babies. If there are many smart monkeys, you have to spend a lot of time worrying about what the other monkeys think of you. If this is how you’re worried misalignment will arise, then I think “how do deep learning models generalise?” is the wrong tree to bark up
C. If people did care about fitness, would Yudkowsky not say “instrumental convergence! Reward hacking!”? I’d even be inclined to grant he had a point.

David Johnston Jul 27, 2023, 11:54 AM
3 points
2
in reply to: David Scott Krueger (formerly: capybaralet)’s comment on: How LLMs are and are not myopic
I can’t speak for janus, but my interpretation was that this is due to a capacity budget meaning it can be favourable to lose a bit of accuracy on token n if you gain more on n+m. I agree som examples would be great.

David Johnston Jul 6, 2023, 10:56 PM
3 points
0
on: A Defense of Work on Mathematical AI Safety

there are strong arguments that control of strongly superhuman AI systems will not be amenable to prosaic alignment

In which section of the linked paper is the strong argument for this conclusion to be found? I had a quick read of it but could not see it—I skipped the long sections of quotes, as the few I read were claims rather than arguments.

David Johnston Jul 1, 2023, 9:54 AM
1 point
0
in reply to: DaemonicSigil’s comment on: Contra Anton 🏴‍☠️ on Kolmogorov complexity and recursive self improvement
I don’t disagree with any of what you say here—I just read Anton as assuming we have a program on that frontier

David Johnston Jul 1, 2023, 1:39 AM
0 points
0
on: Contra Anton 🏴‍☠️ on Kolmogorov complexity and recursive self improvement
The mistake here is the assumption that a program that models the world better necessarily has a higher Kolmogorov complexity.
I think Anton assumes that we have the simplest program that predicts the world to a given standard, in which case this is not a mistake. He doesn’t explicitly say so, though, so I think we should wait for clarification.
But it’s a strange assumption; I don’t see why the minimum complexity predictor couldn’t carry out what we would interpret as RSI in the process of arriving at its prediction.

David Johnston Jun 30, 2023, 11:26 AM
1 point
0
on: Contra Anton 🏴‍☠️ on Kolmogorov complexity and recursive self improvement
I think he’s saying “suppose p1 is the shortest program that gets at most loss $x$ . If p2 gets loss $y < x$ , then we must require a longer string than p1 to express p2, and p1 therefore cannot express p2”.

This seems true, but I don’t understand its relevance to recursive self improvement.

David Johnston Jun 29, 2023, 1:57 AM
3 points
0
in reply to: RobertM’s comment on: A “weak” AGI may attempt an unlikely-to-succeed takeover
I think it means that whatever you get is conservative in cases where it’s unsure of whether it’s in training, which may translate to being conservative where it’s unsure of success in general.

I agree it doesn’t rule out an AI that takes a long shot at takeover! But whatever cognition we posit that the AI executes, it has to yield very high training performance. So AIs that think they have a very short window for influence or are less-than-perfect at detecting training environments are ruled out.

David Johnston Jun 29, 2023, 12:49 AM
1 point
0
on: A “weak” AGI may attempt an unlikely-to-succeed takeover
An AI that wants something and is too willing to take low-probability shots at takeover (or just wielding influence) would get trained away, no?

What I mean is, however it makes decisions, it has to be compatible with very high training performance.

David Johnston Jun 10, 2023, 11:27 AM
1 point
0
in reply to: Lauro Langosco’s comment on: Uncertainty about the future does not imply that AGI will go well
If I can make my point a bit more carefully: I don’t think this post successfully surfaces the bits of your model that hypothetical Bob doubts. The claim that “historical accidents are a good reference class for existential catastrophe” is the primary claim at issue. If they were a good reference class, very high risk would obviously be justified, in my view.

Given that your post misses this, I don’t think it succeeds as an defence of high P(doom).

I think a defence of high P(doom) that addresses the issue above would be quite valuable.

Also, for what it’s worth, I treat “I’ve gamed this out a lot and it seems likely to me” as very weak evidence except in domains where I have a track record of successful predictions or proving theorems that match my intuitions. Before I have learned to do either of these things, my intuitions are indeed pretty unreliable!

David Johnston Jun 8, 2023, 11:32 PM
3 points
0
on: Question for Prediction Market people: where is the money supposed to come from?
There is a situation in which information markets could be positive sum, though I don’t know how practical it is:

I own a majority stake in company X. Someone has proposed an action A that company X take, I currently think this is worse than the status quo, but I think it’s plausible that with better information I’d change my mind. I set up an exchange of X-shares-conditional-on-A for USD-conditional-on-A and the analogous exchange conditional on not-A, subsidised by some fraction of my X shares using an automatic market maker. If, by the closing date, X-shares-conditional-on-A trade at a sufficient premium to X-shares-conditional-on-not-A, I do A.

In this situation, my actions lose money vs the counterfactual of doing A and not subsidising the market, but compared to the counterfactual of not subsidising the market and not doing A I gain money because the rest of my stock is now worth more. It’s unclear how I do compared to the most realistic counterfactual of “spend $Y researching action A more deeply and act accordingly”.

(note that conditional prediction markets also have incentive issues WRT converging to the correct prices, though I’m not sure how important these are in practice)