Hoagy

Karma: 1,074

Hoagy Jun 28, 2023, 2:32 PM
9 points
2
on: Aligning AI by optimizing for “wisdom”
On first glance I thought this was too abstract to be a useful plan but coming back to it I think this is promising as a form of automated training for an aligned agent, given that you have an agent that is excellent at evaluating small logic chains, along the lines of Constitutional AI or training for consistency. You have training loops using synthetic data which can train for all of these forms of consistency, probably implementable in an MVP with current systems.

The main unknown would be detecting when you feel confident enough in the alignment of its stated values to human values to start moving down the causal chain towards fitting actions to values, as this is clearly a strongly capabilities-enhancing process.

Perhaps you could at least get a measure by looking at comparisons which require multiple steps, of human value → value → belief etc, and then asking which is the bottleneck to coming to the conclusion that humans would want. Positing that the agent is capable of this might be assuming away a lot of the problem though.

[Replication] Conjecture’s Sparse Coding in Small Transformers

Hoagy and Logan Riggs

Jun 16, 2023, 6:02 PM

52 points

0 comments5 min readLW link

[Replication] Conjecture’s Sparse Coding in Toy Models

Hoagy and Logan Riggs

Jun 2, 2023, 5:34 PM

24 points

0 comments1 min readLW link

Hoagy May 22, 2023, 6:54 PM
LW: 3 AF: 1
0
AF
on: Steering GPT-2-XL by adding an activation vector
Do you have a writeup of the other ways of performing these edits that you tried and why you chose the one you did?
In particular, I’m surprised by the method of adding the activations that was chosen because the tokens of the different prompts don’t line up with each other in a way that I would have thought would be necessary for this approach to work, super interesting to me that it does.
If I were to try and reinvent the system after just reading the first paragraph or two I would have done something like:
- Take multiple pairs of prompts that differ primarily in the property we’re trying to capture.
- Take the difference in the residual stream at the next token.
- Take the average difference vector, and add that to every position in the new generated text.
I’d love to know which parts were chosen among many as the ones which worked best and which were just the first/only things tried.

Hoagy May 3, 2023, 3:38 PM
0 points
0
on: Monthly Roundup #6: May 2023
eedly → feedly

Hoagy Apr 28, 2023, 12:36 PM
3 points
2
in reply to: jacob_cannell’s comment on: Contra Yudkowsky on Doom from Foom #2
Yeah I agree it’s not in human brains, not really disagreeing with the bulk of the argument re brains but just about whether it does much to reduce foom %. Maybe it constrains the ultra fast scenarios a bit but not much more imo.

“Small” (ie << 6 OOM) jump in underlying brain function from current paradigm AI → Gigantic shift in tech frontier rate of change → Exotic tech becomes quickly reachable → YudFoom

Hoagy Apr 27, 2023, 10:22 AM
11 points
6
on: Contra Yudkowsky on Doom from Foom #2
The key thing I disagree with is:

In some sense the Foom already occurred—it was us. But it wasn’t the result of any new feature in the brain—our brains are just standard primate brains, scaled up a bit[14] and trained for longer. Human intelligence is the result of a complex one time meta-systems transition: brains networking together and organizing into families, tribes, nations, and civilizations through language. … That transition only happens once—there are not ever more and more levels of universality or linguistic programmability. AGI does not FOOM again in the same way.

Although I think agree the ‘meta-systems transition’ is a super important shift, which can lead us to overestimate the level of difference between us and previous apes, it also doesn’t seem like it was just a one time shift. We had fire, stone tools and probably language for literally millions of years before the Neolithic revolution. For the industrial revolution it seems that a few bits of cognitive technology (not even genes, just memes!) in renaissance Europe sent the world suddenly off on a whole new exponential.

The lesson, for me, is that the capability level of the meta-system/technology frontier is a very sensitive function of the kind of intelligences which are operating, and we therefore shouldn’t feel at all confident generalising out of distribution. Then, once we start to incorporate feedback loops from the technology frontier back into the underlying intelligences which are developing that technology, all modelling goes out the window.

From a technical modelling perspective, I understand that the Roodman model that you reference below (hard singularity at median 2047) has both hyperbolic growth and random shocks, and so even within that model, we shouldn’t be too surprised to see a sudden shift in gears and a much sooner singularity, even without accounting for RSI taking us somehow off-script.

Hoagy Apr 26, 2023, 4:45 PM
1 point
0
in reply to: Robi Rahman’s comment on: The Brain is Not Close to Thermodynamic Limits on Computation
disingenuous probably the intended

Hoagy Apr 20, 2023, 5:35 PM
3 points
1
on: An open letter to SERI MATS program organisers
I think strategically, only automated and black-box approaches to interpretability make practical sense to develop now.
Just on this, I (not part of SERI MATS but working from their office) had a go at a basic ‘make ChatGPT interpret this neuron’ system for the interpretability hackathon over the weekend. (GitHub)
While it’s fun, and managed to find meaningful correlations for 1-2 neurons / 50, the strongest takeaway for me was the inadequacy of the paradigm ‘what concept does neuron X correspond to’. It’s clear (no surprise, but I’d never had it shoved in my face) that we need a lot of improved theory before we can automate. Maybe AI will automate that theoretical progress but it feels harder, and further from automation, than learning how to handoff solidly paradigmatic interpretability approaches to AI. ManualMechInterp combined with mathematical theory and toy examples seems like the right mix of strategies to me, tho ManualMechInterp shouldn’t be the largest component imo.
FWIW, I agree with learning history/philosophy of science as a good source of models and healthy experimental thought patterns. I was recommended Hasok Chang’s books (Inventing Temperature, Is Water H20) by folks at Conjecture and I’d heartily recommend them in turn.
I know the SERI MATS technical lead @Joe_Collman spends a lot of his time thinking about how they can improve feedback loops, he might be interested in a chat.
You also might be interested in Mike Webb’s project to set up programs to pass quality decision-making from top researchers to students, being tested on SERI MATS people at the moment.
What links here?
- An open letter to SERI MATS program organisers by Roman Leventov (Apr 20, 2023, 4:34 PM; 26 points)

Hoagy Apr 12, 2023, 4:22 PM
1 point
0
on: Alignment of AutoGPT agents
Agree that it’s super important, would be better if these things didn’t exist but since they do and are probably here to stay, working out how to leverage their own capability to stay aligned rather than failing to even try seems better (and if anyone will attempt a pivotal act I imagine it will be with systems such as these).

Only downside I suppose is that these things seem quite likely to cause an impactful but not fatal warning shot which could be net positive, v unsure how to evaluate this consideration.

Universality and Hidden Information in Concept Bottleneck Models

HoagyApr 5, 2023, 2:00 PM

23 points

0 comments11 min readLW link

Hoagy Apr 5, 2023, 12:42 PM
1 point
0
in reply to: ChosunOne’s comment on: Steelman / Ideological Turing Test of Yann LeCun’s AI X-Risk argument?
I’ve not noticed this but it’d be interesting if true as it seems that the tuning/RLHF has managed to remove most of the behaviour where it talks down to the level of the person writing as evidenced by e.g. spelling mistakes. Should be easily testable too.

Hoagy Mar 31, 2023, 1:23 PM
11 points
6
in reply to: ESRogs’s comment on: The 0.2 OOMs/year target
Moore’s law is a doubling every 2 years, while this proposes doubling every 18 months, so pretty much what you suggest (not sure if you were disagreeing tbh but seemed like you might be?)

Hoagy Mar 30, 2023, 7:03 PM
8 points
6
on: The 0.2 OOMs/year target

0.2 OOMs/year is equivalent to a doubling time of 8 months.

I think this is wrong, that’s nearly 8 doublings in 5 years, should instead be doubling every 5 years, should instead be doubling every 5 / log2(10) = 1.5.. years

I think pushing GPT-4 out to 2029 would be a good level of slowdown from 2022, but assuming that we could achieve that level of impact, what’s the case for having a fixed exponential increase? Is it to let of some level of ‘steam’ in the AI industry? So that we can still get AGI in our lifetimes? To make it seem more reasonable to policymakers?

I would still rather have a moratorium until some measure of progress of understanding personally. We don’t have a fixed temperature increase per decade built into our climate targets.

Hoagy Mar 30, 2023, 3:36 PM
5 points
0
on: Imitation Learning from Language Feedback
Thoughts:
- Seems like useful work.
- With RLHF I understand that when you push super hard for high reward you end up with nonsense results so you have to settle for quantilization or some such relaxation of maximization. Do you find similar things for ‘best incorporates the feedback’?
- Have we really pushed the boundaries of what language models giving themselves feedback is capable of? I’d expect SotA systems are sufficiently good at giving feedback, such that I wouldn’t be surprised that they’d be capable of performing all steps, including the human feedback, in these algorithms, especially lots of the easier categories of feedback, leading to possibility of unlimited of cheap finetuning. Nonetheless I don’t think we’ve reached the point of reflexive endorsement that I’d expect to result from this process (GPT-4 still doing harmful/hallucinated completions that I expect it would be able to recognise). Expect it must be one of
  - It in fact is at reflexive equilibrium / it wouldn’t actually recognise these failures
  - OAI haven’t tried pushing it to the limit
  - This process doesn’t actually result in reflexive endorsement, probably because it only reaches RE within a narrow distribution in which this training is occurring.
  - OAI stop before this point for other reasons, most likely degradation of performance.
  - Not sure which of these is true though?
- Though the core algorithm I expect to be helpful because we’re stuck with RLHF-type work at the moment, having a paper focused on accurate code generation seems to push the dangerous side of a dual-use capability to the fore.

Hoagy Mar 30, 2023, 1:17 PM
12 points
4
on: Nobody’s on the ball on AGI alignment

OpenAI would love to hire more alignment researchers, but there just aren’t many great researchers out there focusing on this problem.

This may well be true—but it’s hard to be a researcher focusing on this problem directly unless you have access to the ability to train near-cutting edge models. Otherwise you’re going to have to work on toy models, theory, or a totally different angle.

I’ve personally applied for the DeepMind scalable alignment team—they had a fixed, small available headcount which they filled with other people who I’m sure were better choices—but becoming a better fit for those roles is tricky, unless by just doing mostly unrelated research.

Do you have a list of ideas for research that you think is promising and possible without already being inside an org with big models?

Hoagy Mar 27, 2023, 10:14 AM
2 points
0
on: Please help me sense-check my assumptions about the needs of the AI Safety community and related career plans
Your first link is broken :)

My feeling with the posts is that given the diversity of situations for people who are currently AI safety researchers, there’s not likely to be a particular key set of understandings such that a person could walk into the community as a whole and know where they can be helpful. This would be great but being seriously helpful as a new person without much experience or context is just super hard. It’s going to be more like here are the groups and organizations which are doing good work, what roles or other things do they need now, and what would help them scale up their ability to produce useful work.

Not sure this is really a disagreement though! I guess I don’t really know what role ‘the movement’ is playing, outside of specific orgs, other than that it focusses on people who are fairly unattached, because I expect most useful things, especially at the meta level, to be done by groups of some size. I don’t have time right now to engage with the post series more fully, so this is just a quick response, sorry!

there is uncertainty → we need shared understanding → we need shared language vs there is uncertainty → what are organizations doing to bring people from individuals with potential together into productive groups making progress → what are their bottlenecks to scaling up?

Hoagy Mar 17, 2023, 6:03 PM
2 points
1
in reply to: Cleo Nardo’s comment on: The algorithm isn’t doing X, it’s just doing Y.
Hmm, yeah there’s clearly two major points:
1. The philosophical leap from voltages to matrices, i.e. allowing that a physical system could ever be ‘doing’ high level description X. This is a bit weird at first but also clearly true as soon you start treating X as having a specific meaning in the world as opposed to just being a thing that occurs in human mind space.
2. The empirical claim that this high level description X fits what the computer is doing.
I think the pushback to the post is best framed in terms of which frame is best for talking to people who deny that it’s ‘really doing X’. In terms of rhetorical strategy and good quality debate, I think the correct tactic is to try and have the first point mutually acknowledged in the most sympathetic case, and try to have a more productive conversation about the extent of the correlation, while I think aggressive statements of ‘it’s always actually doing X if it looks like its doing X’ are probably unhelpful and become a bit of a scissor. (memetics over usefulness har har!)

Hoagy 17 Mar 2023 17:27 UTC
1 point
0
in reply to: Cleo Nardo’s comment on: The algorithm isn’t doing X, it’s just doing Y.
Maybe worth thinking about this in terms of different examples:
- NN detecting the presence of tanks just by the brightness of the image (possibly apocryphal—Gwern)
- NN recognising dogs vs cats as part of an image net classifier that would class a piece of paper with ‘dog’ written on as a dog
- GPT-4 able to describe an image of a dog/cat in great detail
- Computer doing matrix multiplication.
The range of cases in which the equivalence between the what the computer is doing, and our high level description is doing holds increases as we do down this list, and depending on what cases are salient, it becomes more or less explanatory to say that the algorithm is doing task X.

Nokens: A potential method of investigating glitch tokens

Hoagy15 Mar 2023 16:23 UTC

21 points

0 comments4 min readLW link

Hoagy

[Repli­ca­tion] Con­jec­ture’s Sparse Cod­ing in Small Transformers

[Repli­ca­tion] Con­jec­ture’s Sparse Cod­ing in Toy Models

Univer­sal­ity and Hid­den In­for­ma­tion in Con­cept Bot­tle­neck Models

No­kens: A po­ten­tial method of in­ves­ti­gat­ing glitch tokens

[Replication] Conjecture’s Sparse Coding in Small Transformers

[Replication] Conjecture’s Sparse Coding in Toy Models

Universality and Hidden Information in Concept Bottleneck Models

Nokens: A potential method of investigating glitch tokens