janos

Karma: 239

On scalable oversight with weak LLMs judging strong LLMs

zac_kenton, Noah Siegel, janos, Jonah Brown-Cohen, Samuel Albanie, David Lindner and Rohin Shah

Jul 8, 2024, 8:59 AM

49 points

18 comments7 min readLW link

(arxiv.org)

Power-seeking can be probable and predictive for trained agents

Vika and janos

Feb 28, 2023, 9:10 PM

56 points

22 comments9 min readLW link

(arxiv.org)

janos Sep 21, 2016, 2:49 AM
0 points
on: Problems with learning values from observation
Is there a reason to think this problem is less amenable to being solved by complexity priors than other learning problems? / Might we build an unaligned agent competent enough to be problematic without solving problems similar to this one?

janos Jan 27, 2016, 5:12 PM
2 points
on: Learning Mathematics in Context
What is Mathematics? by Courant and Robbins is a classic exploration that goes reasonably deep into most areas of math.

janos May 7, 2015, 9:07 PM
0 points
in reply to: SteveG’s comment on: Superintelligence 8: Cognitive superpowers
This makes me think of two very different things.

One is informational containment, ie how to run an AGI in a simulated environment that reveals nothing about the system it’s simulated on; this is a technical challenge, and if interpreted very strictly (via algorithmic complexity arguments about how improbable our universe is likely to be in something like a Solomonoff prior), is very constraining.

The other is futurological simulation; here I think the notion of simulation is pointing at a tool, but the idea of using this tool is a very small part of the approach relative to formulating a model with the right sort of moving parts. The latter has been tried with various simple models (eg the thing in Ch 4); more work can be done, but justifying the models&priors will be difficult.

janos Apr 28, 2015, 7:58 PM
0 points
in reply to: Mark_Friedenbach’s comment on: Why IQ shouldn’t be considered an external factor
Certainly, interventions may be available, just as for anything else; but it’s not fundamentally more accessible or malleable than other things.

janos Apr 4, 2015, 7:35 PM
2 points
in reply to: estimator’s comment on: Why IQ shouldn’t be considered an external factor
I’m arguing that the fuzzy-ish definition that corresponds to our everyday experience/usage is better than the crisp one that doesn’t.

Re IQ and “way of thinking”, I’m arguing they both affect each other, but neither is entirely under conscious control, so it’s a bit of a moot point.

Apropos the original point, under my usual circumstances (not malnourished, hanging out with smart people, reading and thinking about engaging, complex things that can be analyzed and have reasonable success measures, etc), my IQ is mostly not under my control. (Perhaps if I was more focused on measurements, nootropics, and getting enough sleep, I could increase my IQ a bit; but not very much, I think.) YMMV.

janos Apr 4, 2015, 7:03 PM
10 points
on: Why IQ shouldn’t be considered an external factor
I think what you’re saying is that if we want a coherent, nontrivial definition of “under our control” then the most natural one is “everything that depends on the neural signals from your brain”. But this definition, while relatively clean from the outside, doesn’t correspond to what we ordinarily mean; for example, if you have a mental illness, this would suggest that “stop having that illness!!” is reasonable advice, because your illness is “under your control”.

I don’t know enough neuroscience to give this a physical backing, but there are certain conscious decisions or mental moves that feel like they’re very much under my control, and I’d say the things under my control are just those, plus the things I can reliably affect using them. I think the correct intuitive definition of “locus of control” is “those things you can do if you want to”.

Regarding causal arrows between your IQ and your thoughts, I don’t think this is a well-defined query. Causality is entirely about hypothetical interventions; to say “your way of thinking affects your IQ” is just to say that if I was to change your way of thinking, I could change your IQ.

But how would I change your way of thinking? There has to be an understanding of what is being held constant, or of what range of changes we’re talking about. For instance we could change your way of thinking to any that you’d likely reach from different future influences, or to any that people similar to you have had, etc. Normally what we care about is the sort of intervention that we could actually do or draw predictions from, so the first one here is what we mean. And to some degree it’s true, your IQ would be changed.

From the other end, what does it mean to say your way of thinking is affected by your IQ? It means if we were to “modify your IQ” without doing anything else to affect your thinking, then your way of thinking would be altered. This seems true, though hard to pin down, since IQ is normally thought of as a scalar, rather than a whole range of phenomena like your “way of thinking”. IQ is sort of an amalgam of different abilities and qualities, so if we look closely enough we’ll find that IQ can’t directly affect anything at all, similarly to how g can’t (“it wasn’t your IQ that helped you come up with those ideas, it was your working memory, and creativity, and visualization ability!”); but on the other hand if most things that increase IQ make the same sort of difference (eg to academic success) then it’s fairly compact and useful to say that IQ affects those things.

Causality with fuzzy concepts is tricky.

janos Feb 28, 2015, 8:53 PM
3 points
in reply to: Gondolinian’s comment on: Harry Potter and the Methods of Rationality discussion thread, February 2015, chapter 113
March 2nd isn’t a Tuesday; is it Monday night or Tuesday night?

janos Dec 18, 2014, 6:17 PM
2 points
in reply to: [deleted]’s comment on: How many words do we have and how many distinct concepts do we have?
If you want to discuss the nature of reality using a similar lexicon to what philosophers use, I recommend consulting the Stanford Encyclopedia of Philosophy: http://plato.stanford.edu/

janos Oct 28, 2014, 4:47 PM
13 points
in reply to: [deleted]’s comment on: Link: Elon Musk wants gov’t oversight for AI
Musk has joined the advisory board of FLI and CSER, which are younger sibling orgs of FHI and MIRI. He’s aware of the AI xrisk community.

janos Oct 5, 2014, 4:08 PM
1 point
in reply to: gjm’s comment on: [MIRIx Cambridge MA] Limiting resource allocation with bounded utility functions and conceptual uncertainty
Cool. Regarding bounded utility functions, I didn’t mean you personally, I meant the generic you; as you can see elsewhere in the thread, some people do find it rather strange to think of modelling what you actually want as a bounded utility function.

This is where I thought you were missing the point:

Or you might say it’s a suboptimal outcome because you just know that this allocation is bad, or something. Which amounts to saying that actually you know what the utility function should be and it isn’t the one the analysis assumes.

Sometimes we (seem to) have stronger intuitions about allocations than about the utility function itself, and parlaying that to identify what the utility function should be is what this post is about. This may seem like a non-step to you; in that case you’ve already got it. Cheers! I admit it’s not a difficult point. Or if you always have stronger intuitions about the utility function than about resource allocation, then maybe this is useless to you.

I agree with you that there are some situations where the sublinear allocation (and exponentially-converging utility function) seems wrong and some where it seems fine; perhaps the post should initially have said “person-enjoying-chocolate-tronium” rather than chocolate.

janos Oct 4, 2014, 6:20 PM
1 point
in reply to: gjm’s comment on: [MIRIx Cambridge MA] Limiting resource allocation with bounded utility functions and conceptual uncertainty
Certainly given a utility function and a model, the best thing to do is what it is. The point was to show that some utility functions (eg using the exponential-decay sigmoid) have counterintuitive properties that don’t match what we’d actually want.

Every response to this post that takes the utility function for granted and remarks that the optimum is the optimum is missing the point: we don’t know what kind of utility function is reasonable, and we’re showing evidence that some of them give optima that aren’t what we’d actually want if we were turning the world into chocolate/hedonium.

If it seems strange to you to consider representing what you want by a bounded utility function, a post about that will be forthcoming.

janos Jun 30, 2014, 1:36 PM
3 points
in reply to: BenjaminFox’s comment on: An Attempt at Logical Uncertainty
One nonconstructive (and wildly uncomputable) approach to the problem is this one: http://www.hutter1.net/publ/problogics.pdf

janos Jan 4, 2013, 2:33 AM
0 points
in reply to: prase’s comment on: How much to spend on a high-variance option?
I think you’re making the wrong comparisons. If you buy $1 worth, you get p(win) U(jackpot) + (1-p(win)) U(-$1), which is more-or-less p(win)U(jackpot)+U(-$1); this is a good idea if p(win) U(jackpot) > -U(-$1). But under usual assumptions -U(-$2)>-2U(-$1). This adds up to normality; you shouldn’t actually spend all your money. :)

janos Oct 25, 2012, 2:48 PM
1 point
in reply to: fubarobfusco’s comment on: Ambitious utilitarians must concern themselves with death
One good negation is “the value/intrinsic utility of a life is the sum of the values/intrinsic utilities of all the moments/experiences in it, evaluated without reference to their place/context in the life story, except inasmuch as is actually part of that moment/experience”.

The “actually” gets traction if people’s lives follow narratives that they don’t realize as they’re happening, but such that certain narratives are more valuable than others; this seems true.

janos Nov 28, 2011, 5:16 PM
5 points
on: Statisticsish Question
If your prior distribution for “yes” conditional on the number of papers is still uniform, i.e. if the number of papers has nothing to do with whether they’re “yes” or not, then the rule still applies.

janos Sep 9, 2011, 6:51 PM
2 points
in reply to: lessdazed’s comment on: “Friends do not let friends compute p values.”
You can comfortably do Bayesian model comparison here; have priors for µcon, µamn, and µsim, and let µpat be either µamn (under hypothesis Hamn) or µsim (under hypothesis Hsim), and let Hamn and Hsim be mutually exclusive. Then integrating out µcon, µamn, and µsim, you get a marginal odds-ratio for Hamn vs Hsim, which tells you how to update.

The standard frequentist method being discussed is nested hypothesis testing, where you want to test null hypothesis H0 with alternative hypothesis H1, and H0 is supposed to be nested inside H1. For instance you could easily test null hypothesis µcon >= µamn >= µpat = µsim against µcon >= µamn >= µpat >= µsim. However, for testing non-nested hypotheses, the methodology is weaker, or at least less standard.

janos Aug 15, 2011, 4:58 PM
4 points
in reply to: ArisKatsaris’s comment on: Take heed, for it is a trap

“Alice is a banker” is a simpler statement than “Alice is a feminist banker who plays the piano.”. That’s why the former must be assigned greater probability than the latter.

Complexity weights apply to worlds/models, not propositions. Otherwise you might as well say:

“Alice is a banker” is a simpler statement than “Alice is a feminist, a banker, or a pianist.”. That’s why the former must be assigned greater probability than the latter.

janos Apr 7, 2011, 10:59 PM
2 points
on: Looking for information on scoring calibration
tl;dr : miscalibration means mentally interpreting loglikelihood of data as being more or less than its actual loglikelihood; to infer it you need to assume/infer the Bayesian calculation that’s being made/approximated. Easiest with distributions over finite sets (i.e. T/F or multiple-choice questions). Also, likelihood should be called evidence.

I wonder why I didn’t respond to this when it was fresh. Anyway, I was running into this same difficulty last summer when attempting to write software to give friendly outputs (like “calibration”) to a bunch of people playing the Aumann game with trivia questions.

My understanding was that evidence needs to be measured on the logscale (as the difference between prior and posterior), and miscalibration is when your mental conversion from gut feeling of evidence to the actual evidence has a multiplicative error in it. (We can pronounce this as: “the true evidence is some multiplicative factor (called the calibration parameter) times the felt evidence”.) This still seems like a reasonable model, though of course different kinds of evidence are likely to have different error magnitudes, and different questions are likely to get different kinds of evidence, so if you have lots of data you can probably do better by building a model that will estimate your calibration for particular questions.

But sticking to the constant-calibration model, it’s still not possible to estimate your calibration from your given confidence intervals because for that we need an idea of what your internal prior (your “prior” prior, before you’ve taken into account the felt evidence) is, which is hard to get any decent sense of, though you can work off of iffy assumptions, such as assuming that your prior for percentage answers from a trivia game is fitted to the set of all the percentage answers from this trivia game, and has some simple form (e.g. Beta). The Aumann game gave an advantage in this respect, because rather than comparing your probability distribution before&after thinking about the question, it makes it possible to compare the distribution before&after hearing other people’s arguments&evidence; if you always speak in terms of standard probability distributions, it’s not too hard to infer your calibration there.

Further “funny” issues can arise when you get down to work; for instance if your prior was a Student-t with df n1 and your posterior was a Student-t with df n2s1^2 then your calibration cannot be more than 1/(1-s1^2/s2^2) without having your posterior explode. It’s tempting to say the lesson is that things break if you’re becoming asymptotically less certain, which makes some intuitive sense: if your distributions are actually mixtures of finitely many different hypotheses that you’re Bayesianly updating the weights of, then you will never become asymptotically less certain; in particular the Student-t scenario I described can’t happen. However this is not a satisfactory conclusion because the Normal scenario (where you increase your variance by upweighting a hypothesis that gives higher variance) can easily happen.

A different resolution to the above is that the model of evidence=calibration*felt evidence is wrong, and needs an error term or two; that can give a workable result, or at least not catch fire and die.

Another thought: if your mental process is like the one two paragraphs up, where you’re working with a mixture of several fixed (e.g. normal) hypotheses, and the calibration concept is applied to how you update the weights of the hypotheses, then the change in the mixture distribution (i.e. the marginal) will not follow anything like the calibration model.

So the concept is pretty tricky unless you carefully choose problems where you can reasonably model the mental inference, and in particular try to avoid “mixture-of-hypotheses”-type scenarios (unless you know in advance precisely what the hypotheses imply, which is unusual unless you construct the questions that way, .. but then I can’t think of why you’d ask about the mixture instead of about the probabilities of the hypotheses themselves).

You might be okay when looking at typical multiple-choice questions; certainly you won’t run into the issues with broken posteriors and invalid calibrations. Another advantage is that “the” prior (i.e. uniform) is uncontroversial, though whether the prior to use for computing calibration should be “the” prior is not obvious; but if you don’t have before-and-after results from people then I guess it’s the best you can do.

I just noticed that what’s usually called the “likelihood” I was calling “evidence” here. This has probably been suggested by someone before, but: I’ve never liked the term “likelihood”, and this is the best replacement for it that I know of.

janos

On scal­able over­sight with weak LLMs judg­ing strong LLMs

Power-seek­ing can be prob­a­ble and pre­dic­tive for trained agents

On scalable oversight with weak LLMs judging strong LLMs

Power-seeking can be probable and predictive for trained agents