Hoagy

Karma: 1,068

Hoagy Sep 29, 2023, 10:12 AM
1 point
0
in reply to: JavierCC’s comment on: how do short-timeliners reason about the differences between brain and AI?
See e.g. “So I think backpropagation is probably much more efficient than what we have in the brain.” from https://www.therobotbrains.ai/geoff-hinton-transcript-part-one

More generally, I think the belief that there’s some kind of important advantage that cutting edge AI systems have over humans comes more from human-AI performance comparisons e.g. GPT-4 way outstrips the knowledge about the world of any individual human in terms of like factual understanding (though obv deficient in other ways) with probably 100x less params. A bioanchors based model of AI development would imo predict that this is very unlikely. Whether the core of this advantage is in the form or volume or information density of data, or architecture, or something about the underlying hardware I am less confident.

Hoagy Sep 29, 2023, 9:13 AM
1 point
0
in reply to: JavierCC’s comment on: how do short-timeliners reason about the differences between brain and AI?
Not totally sure but i think it’s pretty likely that scaling gets us to AGI, yeah. Or more particularly, gets us to the point of AIs being able to act as autonomous researchers or act as high (>10x) multipliers on the productivity of human researchers which seems like the key moment of leverage for deciding how the development to AI will go.

Don’t have a super clean idea of what self-reflective thought means. I see that e.g. GPT-4 can often say something, think further about it, and then revise its opinion. I would expect a little bit of extra reasoning quality and general competence to push this ability a lot further.

Hoagy Sep 27, 2023, 1:13 PM
9 points
0
on: how do short-timeliners reason about the differences between brain and AI?
1 line summary is that NNs can transmit signals directly from any part of the network to any other, while brain has to work only locally.

More broadly I get the sense that there’s been a bit of a shift in at least some parts of theoretical neuroscience from understanding how we might be able to implement brain-like algorithms to understanding how the local algorithms that the brain uses might be able to approximate backprop, suggesting that artificial networks might have an easier time than the brain and so it would make sense that we could make something which outcompetes the brain without a similar diversity of neural structures.

This is way outside my area tbh, working off just a couple of things like this paper by Beren Millidge https://arxiv.org/pdf/2006.04182.pdf and some comments by Geoffrey Hinton that I can’t source.

Hoagy Sep 26, 2023, 1:40 PM
LW: 4 AF: 3
0
AF
in reply to: Scott Emmons’s comment on: Sparse Autoencoders Find Highly Interpretable Directions in Language Models
Hi Scott, thanks for this!

Yes I did do a fair bit of literature searching (though maybe not enough tbf) but very focused on sparse coding and approaches to learning decompositions of model activation spaces rather than approaches to learning models which are monosemantic by default which I’ve never had much confidence in, and it seems that there’s not a huge amount beyond Yun et al’s work, at least as far as I’ve seen.

Still though, I’ve not seen almost any of these which suggests a big hole in my knowledge, and in the paper I’ll go through and add a lot more background to attempts to make more interpretable models.

Hoagy Sep 25, 2023, 10:29 PM
8 points
1
in reply to: evhub’s comment on: Amazon to invest up to $4 billion in Anthropic
Cheers, I did see that and wondered whether still to post the comment but I do think that having a gigantic company owning a large chunk and presumably a lot of leverage over the company is a new form of pressure so it’d be reassuring to have some discussion of how to manage that relationship.

Hoagy Sep 25, 2023, 4:47 PM
10 points
7
on: Amazon to invest up to $4 billion in Anthropic
Would be interested to hear from Anthropic leadership about how this is expected to interact with previous commitments about putting decision making power in the hands of their Long-Term Benefit Trust.

I get that they’re in some sense just another minority investor but a trillion-dollar company having Anthropic be a central plank in their AI strategy with a multi-billion investment and a load of levers to make things difficult for the company (via AWS) is a step up in the level of pressure to aggressively commercialise.

Hoagy Sep 22, 2023, 3:01 PM
LW: 3 AF: 2
0
AF
in reply to: Charlie Steiner’s comment on: Sparse Autoencoders Find Highly Interpretable Directions in Language Models
Hi Charlie, yep it’s in the paper—but I should say that we did not find a working CUDA-compatible version and used the scikit version you mention. This meant that the data volumes used are somewhat limited—still on the order of a million examples but 10-50x less than went into the autoencoders.

It’s not clear whether the extra data would provide much signal since it can’t learn an overcomplete basis and so has no way of learning rare features but it might be able to outperform our ICA baseline presented here, so if you wanted to give someone a project of making that available, I’d be interested to see it!

Sparse Autoencoders Find Highly Interpretable Directions in Language Models

Logan Riggs, Hoagy, Aidan Ewart and Robert_AIZI

Sep 21, 2023, 3:30 PM

159 points

8 comments5 min readLW link

Hoagy Sep 17, 2023, 12:01 AM
1 point
1
in reply to: Gurkenglas’s comment on: Influence functions—why, what and how
How do you know?

Hoagy Aug 24, 2023, 11:23 PM
1 point
0
in reply to: Nathan Young’s comment on: Simple AI Forecasting—Katja Grace
seems like it’d be better formatted as a nested list given the volume of text

Hoagy Aug 23, 2023, 8:20 PM
1 point
0
in reply to: veered’s comment on: Which possible AI systems are relatively safe?
Why would we expect the expected level of danger from a model of a certain size to rise as the set of potential solutions grows?

Hoagy Aug 23, 2023, 7:11 PM
7 points
10
on: Why Is No One Trying To Align Profit Incentives With Alignment Research?
I think both Leap Labs and Apollo Research (both fairly new orgs) are trying to position themselves as offering model auditing services in the way you suggest.

Hoagy Aug 1, 2023, 1:57 AM
7 points
2
on: The “public debate” about AI is confusing for the general public and for policymakers because it is a three-sided debate
A useful model for why it’s both appealing and difficult to say ‘Doomers and Realists are both against dangerous AI and for safety—let’s work together!’.

Hoagy Jul 28, 2023, 3:08 AM
LW: 5 AF: 4
2
AF
on: Open problems in activation engineering
Try decomposing the residual stream activations over a batch of inputs somehow (e.g. PCA). Using the principal directions as activation addition directions, do they seem to capture something meaningful?
It’s not PCA but we’ve been using sparse coding to find important directions in activation space (see original sparse coding post, quantitative results, qualitative results).
We’ve found that they’re on average more interpretable than neurons and I understand that @Logan Riggs and Julie Steele have found some effect using them as directions for activation patching, e.g. using a “this direction activates on curse words” direction to make text more aggressive. If people are interested in exploring this further let me know, say hi at our EleutherAI channel or check out the repo :)

Hoagy 27 Jul 2023 17:25 UTC
4 points
on: Neuronpedia—AI Safety Game
Hi, nice work! You mentioned the possibility of neurons being the wrong unit. I think that this is the case and that our current best guess for the right unit is directions in the output space, ie linear combinations of neurons.
We’ve done some work using dictionary learning to find these directions (see original post, recent results) and find that with sparse coding we can find dictionaries of features that are more interpretable the neuron basis (though they don’t explain 100% of the variance).
We’d be really interested to see how this compares to neurons in a test like this and could get a sparse-coded breakdown of gpt2-small layer 6 if you’re interested.

Hoagy 27 Jul 2023 16:56 UTC
1 point
0
on: AI-Plans.com 10-day Critique-a-Thon
Link at the top doesn’t work for me

Hoagy 24 Jul 2023 21:12 UTC
2 points
1
in reply to: Bogdan Ionut Cirstea’s comment on: Going Beyond Linear Mode Connectivity: The Layerwise Linear Feature Connectivity
I still don’t quite see the connection—if it turns out that LLFC holds between different fine-tuned models to some degree, how will this help us interpolate between different simulacra?

Is the idea that we could fine-tune models to only instantiate certain kinds of behaviour and then use LLFC to interpolate between (and maybe even extrapolate between?) different kinds of behaviour?

Hoagy 22 Jul 2023 21:42 UTC
LW: 7 AF: 3
1
AF
on: Compute Thresholds: proposed rules to mitigate risk of a “lab leak” accident during AI training runs
For the avoidance of doubt, this accounting should recursively aggregate transitive inputs.
What does this mean?

Hoagy 19 Jul 2023 19:06 UTC
5 points
0
on: Hedonic Loops and Taming RL
Importantly, this policy would naturally be highly specialized to a specific reward function. Naively, you can’t change the reward function and expect the policy to instantly adapt; instead you would have to retrain the network from scratch.
I don’t understand why standard RL algorithms in the basal ganglia wouldn’t work. Like, most RL problems have elements that can be viewed as homeostatic—if you’re playing boxcart then you need to go left/right depending on position. Why can’t that generalise to seeking food iff stomach is empty? Optimizing for a specific reward function doesn’t seem to preclude that function itself being a function of other things (which just makes it a more complex function).
What am I missing?

AutoInterpretation Finds Sparse Coding Beats Alternatives

Hoagy17 Jul 2023 1:41 UTC

57 points

1 comment7 min readLW link

Hoagy

Sparse Au­toen­coders Find Highly In­ter­pretable Direc­tions in Lan­guage Models

Au­toIn­ter­pre­ta­tion Finds Sparse Cod­ing Beats Alternatives

Sparse Autoencoders Find Highly Interpretable Directions in Language Models

AutoInterpretation Finds Sparse Coding Beats Alternatives