Adam Jermyn Jan 24, 2023, 9:54 AM
3 points
0
on: Has private AGI research made independent safety research ineffective already? What should we do about this?
I think there’s tons of low-hanging fruit in toy model interpretability, and I expect at least some lessons from at least some such projects to generalize. A lot of the questions I’m excited about in interpretability are fundamentally accessible in toy models, like “how do models trade off interference and representational capacity?”, “what priors do MLP’s have over different hypotheses about the data distribution?”, etc.

Underspecification of Oracle AI

Rubi J. Hudson, Adam Jermyn and Johannes Treutlein

Jan 15, 2023, 8:10 PM

30 points

12 comments19 min readLW link

Adam Jermyn Dec 22, 2022, 11:20 PM
LW: 5 AF: 2
0
AF
on: Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)
A thing I really like about the approach in this paper is that it makes use of a lot more of the model’s knowledge of human values than traditional RLHF approaches. Pretrained LLM’s already know a ton of what humans say about human values, and this seems like a much more direct way to point models at that knowledge than binary feedback on samples.

Adam Jermyn Dec 22, 2022, 5:47 AM
1 point
0
in reply to: So8res’s comment on: K-complexity is silly; use cross-entropy instead
How does this correctness check work?

I usually think of gauge freedom as saying “there is a family of configurations that all produce the same observables”. I don’t think that gives a way to say some configurations are correct/incorrect. Rather some pairs of configurations are equivalent and some aren’t.

That said, I do think you can probably do something like the approach described to assign a label to each equivalence class of configurations and do your evolution in that label space, which avoids having to pick a gauge.

Adam Jermyn Dec 12, 2022, 1:00 AM
1 point
0
on: Deconfusing Direct vs Amortised Optimization
How would you classify optimization shaped like “write a program to solve the problem for you”. It’s not directly searching over solutions (though the program you write might). Maybe it’s a form of amortized optimization?

Separately: The optimization-type distinction clarifies a circle I’ve run around talking about inner optimization with many people, namely “Is optimization the same as search, or is search just one way to get optimization?” And I think this distinction gives me something to point to in saying “search is one way to get (direct) optimization, but there are other kinds of optimization”.

Adam Jermyn Dec 12, 2022, 12:36 AM
LW: 3 AF: 3
2
AF
on: You can still fetch the coffee today if you’re dead tomorrow
I might be totally wrong here, but could this approach be used to train models that are more likely to be myopic (than e.g. existing RL reward functions)? I’m thinking specifically of the form of myopia that says “only care about the current epoch”, which you could train for by (1) indexing epochs, (2) giving the model access to its epoch index, (3) having the reward function go negative past a certain epoch, (4) giving the model the ability to shutdown. Then you could maybe make a model that only wants to run for a few epochs and then shuts off, and maybe that helps avoid cross-epoch optimization?

Adam Jermyn Dec 12, 2022, 12:22 AM
1 point
0
in reply to: Jozdien’s comment on: Latent Adversarial Training
That’s definitely a thing that can happen.

I think the surgeon can always be made ~arbitrarily powerful, and the trick is making it not too powerful/trivially powerful (in ways that e.g. preclude the model from performing well despite the surgeon’s interference).

So I think the core question is: are there ways to make a sufficiently powerful surgeon which is also still defeasible by a model that does what we want?

Adam Jermyn Dec 9, 2022, 4:05 AM
1 point
on: Your Time Might Be More Valuable Than You Think
A trick I sometimes use, related to this post, is to ask whether my future self would like to buy back my present time at some rate. This somehow makes your point about intertemporal substitution more visceral for me, and makes it easier to say “oh yes this thing which is pricier than my current rate definitely makes sense at my plausible future rate”.

Adam Jermyn Dec 9, 2022, 3:58 AM
6 points
4
on: AI Safety Seems Hard to Measure
In fact, it’s not 100% clear that AI systems could learn to deceive and manipulate supervisors even if we deliberately tried to train them to do it. This makes it hard to even get started on things like discouraging and detecting deceptive behavior.
Plausibly we already have examples of (very weak) manipulation, in the form of models trained with RLHF saying false-but-plausible-sounding things, or lying and saying they don’t know something (but happily providing that information in different contexts). [E.g. ChatGPT denies having information about how to build nukes, but will also happily tell you about different methods for Uranium isotope separation.]

Adam Jermyn Dec 5, 2022, 7:55 PM
LW: 1 AF: 1
0
AF
in reply to: jacquesthibs’s comment on: Is the “Valley of Confused Abstractions” real?
Yeah. Or maybe not even to zero but it isn’t increasing.

Adam Jermyn Dec 5, 2022, 4:13 PM
LW: 2 AF: 2
0
AF
on: Is the “Valley of Confused Abstractions” real?
Could it be that Chris’s diagram gets recovered if the vertical scale is “total interpretable capabilities”? Like maybe tiny transformers are more interpretable in that we can understand ~all of what they’re doing, but they’re not doing much, so maybe it’s still the case that the amount of capability we can understand has a valley and then a peak at higher capability.

Adam Jermyn Dec 5, 2022, 4:07 PM
2 points
0
in reply to: Jozdien’s comment on: Latent Adversarial Training
That’s a good point: it definitely pushes in the direction of making the model’s internals harder to adversarially attack. I do wonder how accessible “encrypted” is here versus just “actually robust” (which is what I’m hoping for in this approach). The intuition here is that you want your model to be able to identify that a rogue thought like “kill people” is not a thing to act on, and that looks like being robust.

Adam Jermyn Dec 4, 2022, 4:45 PM
7 points
0
in reply to: sapphire’s comment on: Semi-conductor/AI Stock Discussion.
And: having a lot of capital could be very useful in the run up to TAI. Eg for pursuing/funding safety work.

Adam Jermyn Dec 3, 2022, 5:24 AM
2 points
0
in reply to: Wei Dai’s comment on: Jailbreaking ChatGPT on Release Day
Roughly, I think it’s hard to construct a reward signal that makes models answer questions when they know the answers and say they don’t know when they don’t know. Doing that requires that you are always able to tell what the correct answer is during training, and that’s expensive to do. (Though Eg Anthropic seems to have made some progress here: https://arxiv.org/abs/2207.05221).

Adam Jermyn Dec 2, 2022, 9:08 PM
LW: 1 AF: 1
0
AF
in reply to: LawrenceC’s comment on: Multi-Component Learning and S-Curves
So indeed with cross-entropy loss I see two plateaus! Here’s rank 2:
(note that I’ve offset the loss to so that equality of Z and C is zero loss)
I have trouble getting rank 10 to find the zero-loss solution:
But the phenomenology at full rank is unchanged:

Adam Jermyn 2 Dec 2022 12:49 UTC
1 point
0
in reply to: Arthur Conmy’s comment on: Science of Deep Learning—a technical agenda
Got it, I see. I think of the two as really intertwined (e.g. a big part of my agenda at the moment is studying how biases/path-dependence in SGD affect interpretability/polysemanticity).

Adam Jermyn

Con­di­tion­ing Pre­dic­tive Models: Mak­ing in­ner al­ign­ment as easy as possible

Con­di­tion­ing Pre­dic­tive Models: The case for competitiveness

Con­di­tion­ing Pre­dic­tive Models: Outer al­ign­ment via care­ful conditioning

Con­di­tion­ing Pre­dic­tive Models: Large lan­guage mod­els as predictors

Un­der­speci­fi­ca­tion of Or­a­cle AI

Conditioning Predictive Models: Making inner alignment as easy as possible

Conditioning Predictive Models: The case for competitiveness

Conditioning Predictive Models: Outer alignment via careful conditioning

Conditioning Predictive Models: Large language models as predictors

Underspecification of Oracle AI