David Johnston

Karma: 499

David Johnston Dec 9, 2024, 5:00 AM
3 points
0
in reply to: Charlie Steiner’s comment on: RL, but don’t do anything I wouldn’t do
If you’re in a situation where you can reasonably extrapolate from past rewards to future reward, you can probably extrapolate previously seen “normal behaviour” to normal behaviour in your situation. Reinforcement learning is limited—you can’t always extrapolate past reward—but it’s not obvious that imitative regularisation is fundamentally more limited.

(normal does not imply safe, of course)

David Johnston Dec 9, 2024, 12:26 AM
5 points
0
on: RL, but don’t do anything I wouldn’t do
Their empirical result rhymes with adversarial robustness issues—we can train adversaries to maximise ~arbitrary functions subject to small perturbation from ground truth constraints. Here the maximised function is a faulty reward model and the constraint is KL to a base model instead of distance to a ground truth image.

I wonder if multiscale aggregation could help here too as it does with image adversarial robustness. We want the KL penalty to ensure that the generations should look normal at any “scale”, whether we look at them token by token or read a high-level summary of them. However, I suspect their “weird, low-KL” generations will have weird high-level summaries, whereas more desired policies would look more normal in summary (though it’s not immediately obvious if this translates to low and high probability summaries respectively—one would need to test). I think a KL penalty to the “true base policy” should operate this way automatically, but as the authors note we can’t actually implement that.

David Johnston Dec 5, 2024, 10:59 PM
3 points
0
in reply to: Seth Herd’s comment on: Model Integrity: MAI on Value Alignment
Is your view closer to:
- there’s two hard steps (instruction following, value alignment), and of the two instruction following is much more pressing
- instruction following is the only hard step; if you get that, value alignment is almost certain to follow

David Johnston Nov 14, 2024, 6:36 AM
1 point
0
in reply to: sanxiyn’s comment on: o1 is a bad idea
Mathematical reasoning might be specifically conducive to language invention because our ability to automatically verify reasoning means that we can potentially get lots of training data. The reason I expect the invented language to be “intelligible” is that it is coupled (albeit with some slack) to automatic verification.

David Johnston Nov 14, 2024, 1:48 AM
1 point
0
in reply to: Daniel Kokotajlo’s comment on: o1 is a bad idea
There’s a regularization problem to solve for 3.9 and 4, and it’s not obvious to me that glee will be enough to solve it (3.9 = “unintelligible CoT”).

I’m not sure how o1 works in detail, but for example, backtracking (which o1 seems to use) makes heavy use of the pretrained distribution to decide on best next moves. So, at the very least, it’s not easy to do away with the native understanding of language. While it’s true that there is some amount of data that will enable large divergences from the pretrained distribution—and I could imagine mathematical proof generation eventually reaching this point, for example—more ambitious goals inherently come with less data, and it’s not obvious to me that there will be enough data in alignment-critical applications to cause such a large divergence.

There’s an alternative version of language invention where the model invents a better language for (e.g.) maths then uses that for more ambitious projects, but that language is probably quite intelligible!

David Johnston Oct 22, 2024, 1:46 AM
1 point
0
in reply to: cubefox’s comment on: A brief theory of why we think things are good or bad
For what it’s worth, one idea I had as a result of our discussion was this:
- We form lots of beliefs as a result of motivated reasoning
- These beliefs are amenable to revision due to evidence, reason or (maybe) social pressure
- Those beliefs that are largely resilient to these challenges are “moral foundations”
So philosophers like “pain is bad” as a moral foundation because we want to believe it + it is hard to challenge with evidence or reason. Laypeople probably have lots of foundational moral beliefs that don’t stand up as well to evidence or reason, but (perhaps) are equally attributable to motivated reasoning.

Social pressure is a bit iffy to include because I think lots of people relate to beliefs that they adopted because of social pressure as moral foundations, and believing something because you’re under pressure to do so is an instance of motivated reasoning.

I don’t think this is a response to your objections, but I’m leaving it here in case it interests you.

David Johnston Oct 22, 2024, 1:09 AM
1 point
0
in reply to: cubefox’s comment on: A brief theory of why we think things are good or bad
I can explain why I believe bachelors are unmarried: I learned that this is what the word bachelor means, I learned this because it is what bachelor means, and the fact that there’s a word “bachelor” that means “unmarried man” is contingent on some unimportant accidents in the evolution of language. A) it is certainly not the result of an axiomatic game and B) if moral beliefs were also contingent on accidents in the evolution of language (I think most are not), that would have profound implications for metaethics.

Motivated belief can explain non-purely-selfish beliefs. I might believe pain is bad because I am motivated to believe it, but the belief still concerns other people. This is even more true when we go about constructing higher order beliefs and trying to enforce consistency among beliefs. Undesirable moral beliefs could be a mark against this theory, but you need more than not-purely-selfish moral beliefs.

I’m going to bow out at this point because I think we’re getting stuck covering the same ground.

David Johnston Oct 21, 2024, 9:49 PM
1 point
0
in reply to: cubefox’s comment on: A brief theory of why we think things are good or bad
Thanks for your continued engagement.

I’m interested in explaining foundational moral beliefs like suffering is bad, not beliefs like “animals do/don’t suffer”, which is about badness only because we accept the foundational assumption that suffering is bad. Is that clear in the updated text?

Now, I don’t think these beliefs come from playing axiomatic games like “define good as that which increases welfare”. There are many lines of evidence for this. First: “define bad as that which increases suffering” is not equally as plausible as “define good as that which increases suffering”. We have pre-existing beliefs about this.

Second: you talk about philosophers analysing welfare. However, the method that philosophers use to do this usually involves analysing a bunch of fundamental moral assumptions. For example, from the Stanford encyclopaedia of philosophy:

Correspondingly, no amount of empirical investigation seems by itself, without some moral assumption(s) in play, sufficient to settle a moral question https://plato.stanford.edu/entries/metaethics/

I am suggesting that the source of these fundamental moral assumptions may not be mysterious—we have a known ability to form beliefs based on what we want, and fundamental moral beliefs often align with what we want.

David Johnston Oct 21, 2024, 3:47 AM
1 point
0
in reply to: Seth Herd’s comment on: A brief theory of why we think things are good or bad
I think precisely defining “good” and “bad” is a bit beside the point—it’s a theory about how people come to believe things are good and bad, and we’re perfectly capable of having vague beliefs about goodness and badness. That said, the theory is lacking a precise account of what kind of beliefs it is meant to explain.

The LLM section isn’t meant as support for the theory, but speculation about what it would say about the status of “experiences” that language models can have. Compared to my pre-existing notions, the theory seems quite willing to accommodate LLMs having good and bad experiences on par with those that people have.

David Johnston Oct 21, 2024, 12:04 AM
1 point
0
in reply to: cubefox’s comment on: A brief theory of why we think things are good or bad
I have a pedantic and a non-pedantic answer to this. Pedantic: you say X is “usually considered good” if it increases welfare. Perhaps you mean to imply that if X is usually considered good then it is good. In this case, I refer you to the rest of the paragraph you quote.

Non-pedantic: yes, it’s true that once you accept some fundamental assumptions about goodness and badness you can go about theorising and looking for evidence. I’m suggesting that motivated reasoning is the mechanism that makes those fundamental assumptions believable.

I added a paragraph mentioning this, because I think your reaction is probably common.

A brief theory of why we think things are good or bad

David JohnstonOct 20, 2024, 8:31 PM

7 points

10 comments LW link

David Johnston Oct 20, 2024, 2:25 AM
1 point
0
in reply to: Noosphere89’s comment on: The Hidden Complexity of Wishes
Here’s a basic model of policy collapse: suppose there exist pathological policies of low prior probability (/high algorithmic complexity) such that they play the training game when it is strategically wise to do so, and when they get a good opportunity they defect in order to pursue some unknown aim.

Because they play the training game, a wide variety of training objectives will collapse to one of these policies if the system in training starts exploring policies of sufficiently high algorithmic complexity. So, according to this crude model, there’s a complexity bound: stay under it and you’re fine, go over it and you get pathological behaviour. Roughly, whatever desired behaviour requires the most algorithmically complex policy is the one that is most pertinent for assessing policy collapse risk (because that’s the one that contributes most of the algorithmic complexity, and so it give your first order estimate of whether or not you’re crossing the collapse threshold). So, which desired behaviour requires the most complex policy: is it, for example, respecting commonsense moral constraints, or is it inventing molecular nanotechnology?

Tangentially, the policy collapse theory does not predict outcomes that look anything like malicious compliance. It predicts that, if you’re in a position of power over the AI system, your mother is saved exactly as you want her to be. If you are not in such a position then your mother is not saved at all and you get a nanobot war instead or something. That is, if you do run afoul of policy collapse, it doesn’t matter if you want your system to pursue simple or complex goals, you’re up shit creek either way.

David Johnston Oct 19, 2024, 12:11 PM
1 point
0
in reply to: Eliezer Yudkowsky’s comment on: The Hidden Complexity of Wishes
Algorithmic complexity is precisely analogous to difficulty-of-learning-to-predict, so saying “it’s not about learning to predict, it’s about algorithmic complexity” doesn’t make sense. One read of the original is: learning to respect common sense moral side constraints is tricky^[1], but AI systems will learn how to do it in the end. I’d be happy to call this read correct, and is consistent with the observation that today’s AI systems do respect common sense moral side constraints given straightforward requests, and that it took a few years to figure out how to do it. That read doesn’t really jive with your commentary.

Your commentary seems to situate this post within a larger argument: teaching a system to “act” is different to teaching it to “predict” because in the former case a sufficiently capable learner’s behaviour can collapse to a pathological policy, whereas teaching a capable learner to predict does not risk such collapse. Thus “prediction” is distinguished from “algorithmic complexity”. Furthermore, commonsense moral side constraints are complex enough to risk such collapse when we train an “actor” but not a “predictor”. This seems confused.

First, all we need to turn a language model prediction into an action is a means of turning text into action, and we have many such means. So the distinction between text predictor and actor is suspect. We could consider an alternative knows/cares distinction: does a system act properly when properly incentivised (“knows”) vs does it act properly when presented with whatever context we are practically able to give it (“”“cares”””)? Language models usually act properly given simple prompts, so in this sense they “care”. So rejecting evidence from language models does not seem well justified.

Second, there’s no need to claim that commonsense moral side constraints in particular are so hard that trying to develop AI systems that respect them leads to policy collapse. It need only be the case that one of the things we try to teach them to do leads to policy collapse. Teaching values is not particularly notable among all the things we might want AI systems to do; it certainly does not seem to be among the hardest. Focussing on values makes the argument unnecessarily weak.

Third, algorithmic complexity is measured with respect to a prior. The post invokes (but does not justify) an “English speaking evil genie” prior. I don’t think anyone thinks this is a serious prior for reasoning about advanced AI system behaviour. But the post is (according to your commentary, if not the post itself) making a quantitative point—values are sufficiently complex to induce policy collapse—but it’s measuring this quantity using a nonsense prior. If the quantitative argument was indeed the original point, it is mystifying why a nonsense prior was chosen to make it, and also why no effort was made to justify the prior.
1. ↩︎
  the text proposes full value alignment as a solution to the commonsense side constraints problem, but this turned out to be stronger than necessary.

Mechanistic Anomaly Detection Research Update

Nora Belrose and David Johnston

Aug 6, 2024, 10:33 AM

11 points

0 comments1 min readLW link

(blog.eleuther.ai)

David Johnston Jun 9, 2024, 12:30 AM
2 points
1
on: Access to powerful AI might make computer security radically easier
When do you think is the right time to work in these issues? Monitoring, trust displacement and fine grained permission management all look liable to raise issues that weren’t anticipated and haven’t already been solved, because they’re not the way things have been done historically. My gut sense is that GPT4 performance is much lower when you’re asking it to do novel things. Maybe it’s also possible to make substantial gains with engineering and experimentation, but you’ll need a certain level of performance in order to experiment.

Some wild guesses: maybe the right time to start work is one generation before it’s feasible, and that might mean start now for fine grained permissions, gpt 4.5 for monitoring, gpt 5 for trust displacement.

David Johnston Feb 29, 2024, 7:06 AM
5 points
4
in reply to: Wei Dai’s comment on: Counting arguments provide no evidence for AI doom
The AI system builders’ time horizon seems to be a reasonable starting point

David Johnston Feb 29, 2024, 6:59 AM
5 points
2
on: Counting arguments provide no evidence for AI doom
Nora and/or Quentin: you talk a lot about inductive biases of neural nets ruling scheming out, but I have a vague sense that scheming ought to happen in some circumstances—perhaps rather contrived, but not so contrived as to be deliberately inducing the ulterior motive. Do you expect this to be impossible? Can you propose a set of conditions you think sufficient to rule out scheming?

David Johnston Feb 29, 2024, 6:49 AM
1 point
0
in reply to: evhub’s comment on: Counting arguments provide no evidence for AI doom
What in your view is the fundamental difference between world models and goals such that the former generalise well and the latter generalise poorly?

One can easily construct a model with a free parameter X and training data such that many choices of X will match the training data but results will diverge in situations not represented in the training data (for example, the model is a physical simulation and X tracks the state of some region in the simulation that will affect the learner’s environment later, but hasn’t done so during training). The simplest x_s could easily be wrong. We can even moralise the story: the model regards its job as predicting the output under x_s and if the world happens to operate according to some other x’ then the model doesn’t care. However it’s still going to be ineffective in the future where the value of X matters.

David Johnston Feb 8, 2024, 10:56 AM
1 point
0
in reply to: Roko’s comment on: A Back-Of-The-Envelope Calculation On How Unlikely The Circumstantial Evidence Around Covid-19 Is
Another comment on timing updates: if you’re making a timing update for zoonosis vs DEFUSE, and you’re considering a long timing window w_z for zoonosis, then your prior for a DEFUSE leak needs to be adjusted for the short window w_d in which this work could conceivably cause a leak, so you end up with something like p(defuse_pandemic)/p(zoo_pandemic)= rr_d w_d/w_z, where rr_d is the riskiness of DEFUSE vs zoonosis per unit time. Then you make the “timing update” p(now |defuse_pandemic)/p(now |zoo_pandemic) = w_z/w_d and you’re just left with rr_d.

David Johnston Feb 8, 2024, 10:24 AM
1 point
0
in reply to: Roko’s comment on: A Back-Of-The-Envelope Calculation On How Unlikely The Circumstantial Evidence Around Covid-19 Is
Sorry, I edited (was hoping to get in before you read it)

David Johnston

A brief the­ory of why we think things are good or bad

Mechanis­tic Ano­maly De­tec­tion Re­search Update

A brief theory of why we think things are good or bad

Mechanistic Anomaly Detection Research Update