I’m not sure if this is intentional but this explanation implies that edge patching can only be done between nodes in adjacent layers, which is not the case.
Joseph Miller
Yes you’re correct that it does not work with LayerNorm between layers. I’m not aware of any models that do this. Are you?
Did you try how this works in practice? I could imagine an SGD-based circuit finder could be pretty efficient (compared to brute-force algorithms like ACDC), I’d love to see that comparison some day!
Yes it does work well! I did a kind of write up here but decided not to publish for various reasons.
Do you have a link to a writeup of Li et al. (2023) beyond the git repo?
How To Do Patching Fast
I quit YouTube a few years ago and it was probably the single best decision I’ve ever made.
However I also found that I naturally substitute it with something else. For example, I subsequently became addictived to Reddit. I quit Reddit and substituted for Hackernews and LessWrong. When I quit those I substituted for checking Slack, Email and Discord.
Thankfully being addicted to Slack does seem to be substantially less harmful than YouTube.
I’ve found the app OneSec very useful for reducing addictions. It’s an app blocker that doesn’t actually block, it just delays you opening the page, so you’re much less likely to delete it in a moment of weakness.
Or is that sentence meant to indicate that an instance running after training might figure out how to hack the computer running it so it can actually change it’s own weights?
I was thinking of a scenario where OpenAI deliberately gives it access to its own weights to see if it can self improve.
I agree that it would be more likely to just speed up normal ML research.
While I want people to support PauseAI
the small movement that PauseAI builds now will be the foundation which bootstraps this larger movement in the future
Is one of the main points of my post. If you support PauseAI today you may unleash a force which you cannot control tomorrow.
Why I’m doing PauseAI
If you want to be healthier, we know ways you can change your diet that will help: Increase your overall diet “quality”. Eat lots of fruits and vegetables. Avoid processed food. Especially avoid processed meats. Eat food with low caloric density. Avoid added sugar. Avoid alcohol. Avoid processed food.
I’m confused—why are you so confident that we should avoid processed food. Isn’t the whole point of your post that we don’t know whether processed oil is bad for you? Where’s the overwhelming evidence that processed food in general is bad?
Reconstruction loss is the CE loss of the patched model
If this is accurate then I agree that this is not the same as “the KL Divergence between the normal model and the model when you patch in the reconstructed activations”. But Fengyuan described reconstruction score as:
measures how replacing activations changes the total loss of the model
which I still claim is equivalent.
I think just showing would be better than reconstruction score metric because is very noisy.
there is a validation metric called reconstruction score that measures how replacing activations change the total loss of the model
That’s equivalent to the KL metric. Would be good to include as I think it’s the most important metric of performance.
Patch loss is different to L2. It’s the KL Divergence between the normal model and the model when you patch in the reconstructed activations at some layer.
It would be good to benchmark the normalized and baseline SAEs using the standard metrics of patch loss and L0.
What is Neuron Activity?
Does anything think this could actually be done in <20 years?
materialism where we haven’t discovered all the laws of physics yet — specifically, those that constitute the sought-for materialist explanation of consciousness
It seems unlikely that new laws of physics are required to understand consciousness? My claim is that understanding consciousness just requires us to understand the algorithms in the brain.
Without that real explanation, “atoms!” or “materialism!”, is just a label plastered over our ignorance.
Agreed. I don’t think this contradicts what I wrote (not sure if that was the implication).
The Type II error of behaving as if these and future systems are not conscious in a world where they are in fact conscious.
Consciousness does not have a commonly agreed upon definition. The question of whether an AI is conscious cannot be answered until you choose a precise definition of consciousness, at which point the question falls out of the realm of philosophy into standard science.
This might seem like mere pedantry or missing the point, because the whole challenge is to figure out the definition of consciousness, but I think it is actually the central issue. People are grasping for some solution to the “hard problem” of capturing the je ne sais quoi of what it is like to be a thing, but they will not succeed until they deconfuse themselves about the intangible nature of sentience.
You cannot know about something unless it is somehow connected the causal chain that led to the current state of your brain. If we know about a thing called “consciousness” then it is part of this causal chain. Therefore “consciousness”, whatever it is, is a part of physics. There is no evidence for, and there cannot ever be evidence for, any kind of dualism or epiphenomenal consciousness. This leaves us to conclude that either panpsychism or materialism is correct. And causally-connected panpsychism is just materialism where we haven’t discovered all the laws of physics yet. This is basically the argument for illusionism.
So “consciousness” is the algorithm that causes brains to say “I think therefore I am”. Is there some secret sauce that makes this algorithm special and different from all currently known algorithms, such that if we understood it we would suddenly feel enlightened? I doubt it. I expect we will just find a big pile of heuristics and optimization procedures that are fundamentally familiar to computer science. Maybe you disagree, that’s fine! But let’s just be clear that that is what we’re looking for, not some other magisterium.
If consciousness is indeed sufficient for moral patienthood, then the stakes seem remarkably high from a utilitarian perspective.
Agreed. If your utility function is that you like computations similar to the human experience of pleasure and you dislike computations similar to the human experience of pain (mine is!). But again, let’s not confuse ourselves by thinking there’s some deep secret about the nature of reality to uncover. Your choice of meta-ethical system is of the same type signature as your choice of favorite food.
The subfaction veto only applies to faction level policy. The faction veto is decided by pure democracy within the faction.
I would guess in most scenarios most subfactions would agree when to use the faction veto. Eg. all the Southern states didn’t want to end slavery.
Thanks, this is really useful.
Do you have any particular examples as evidence of this? This is something I’ve been thinking a lot about for AI and I’m quite uncertain. It seems that ~0% of advocacy campaigns have good epistemics, so it’s hard to have evidence about this. Emotional appeals are important and often hard to reconcile with intellectual honesty.
Of course there are different standards for good epistemics and it’s probably bad to outright lie, or be highly misleading. But by EA standards of “good epistemics” it seems less clear if the benefits are worth the costs.
As one example, the AI Safety movement may want to partner with advocacy groups who care about AI using copyrighted data or unions concerned about jobs. But these groups basically always have terrible epistemics and partnering usually requires some level of endorsement of their positions.
As an even more extreme example, as far as I can tell about 99.9% of people have terrible epistemics by LessWrong standards so to even expand to a decently sized movement you will have to fill the ranks with people who will constantly say and think things that you think are wrong.