Personally I feel like the value from doing more non-Sequence LW posts is probably highest, since the Sequences already exist on Audible (you can get all books for a single credit), and my impression is that wiki tags wouldn’t generalise to audio format particularly well. One idea might be to have some kind of system where you can submit particular posts for consideration and/or vote on them, which could be (1) recent ones that weren’t otherwise going to be recorded, or (2) old non-Sequence classics like “ugh fields”.
CallumMcDougall
I think the key point here is that we’re applying a linear transformation to move from neuron space into feature space. Sometimes neurons and features do coincide and you can actually attribute particular concepts to neurons, but unless the neurons are a privileged basis there’s no reason to expect this in general. We’re taking the definition of feature here as a linear combination of neurons which represents some particular important and meaningful (and hopefully human-comprehensible) concept.
Probably the best explanation of this comes from John Wentworth’s recent AXRP podcast, and a few of his LW posts. To put it simply, modularity is important because modular systems are usually much more interpretable (case in point: evolution has produced highly modular designs, e.g. organs and organ systems, whereas genetic algorithms for electronic circuit design frequently fail to find designs that are modular, and so they’re really hard for humans to interpret, and verify that they’ll work as expected). If we understood a bit more about the factors that select for modularity under a wide range of situations (e.g. evolutionary selection, or standard ML selection), then we might be able to use these factors to encourage more modular designs. On the more abstract level, it might help us break down fuzzy statements like “certain types of inner optimisers have separate world models and models of the objective”, which are really statements about modules within a system. But in order to do any of this, we need to come up with a robust measure for modularity, and basically there isn’t one at present.
This may not exactly answer the question, but I’m in a research group which is studying selection for modularity, and yesterday we published our fourth post, which discusses the importance of causality in developing a modularity metric.
TL;DR—if you want to measure information exchanged in a network, you can’t just observe activations, because two completely separate tracks of the network measuring the same thing will still have high mutual information even though they’re not communicating with each other (the input is a confounder for both of them). Instead, it seems like you’ll need to use do calculus and counterfactuals.
We haven’t actually started testing out our measure yet so this is currently only at the theorising stage, hence may not be a very satisfying answer to the question
I guess another point here is that we won’t know how different (for example) our results when sampling from the training distribution will be from our results if we just run the network on random noise and then intervene on neurons; this would be an interesting thing to experimentally test. If they’re very similar, this neatly sidesteps the problem of deciding which one is more “natural”, and if they’re very different then that’s also interesting
Yeah I think the key point here more generally (I might be getting this wrong) is that C represents some partial state of knowledge about X, i.e. macro rather than micro-state knowledge. In other words it’s a (non-bijective) function of X. That’s why (b) is true, and the equation holds.
A few of Scott Alexander’s blog posts (made into podcast episodes) are really good (he’s got a sequence summarising the late 2021 MIRI conversations; the Bio Anchors and Takeoff Speeds ones I found especially informative & comprehensible). These doesn’t make up the bulk of content and isn’t super technical but thought I’d mention it anyway
Yeah I think this is Evan’s view. This is from his research agenda (I’m guessing you might have already seen this given your comment but I’ll add it here for reference anyway in case others are interested)
I suspect we can in fact design transparency metrics that are robust to Goodharting when the only optimization pressure being applied to them is coming from SGD, but cease to be robust if the model itself starts actively trying to trick them.
And I think his view on deception through inner optimisation pressure is that this is something we’ll basically be powerless to deal with once it happens, so the only way to make sure it doesn’t happen it to chart a safe path through model space which never enters the deceptive region in the first place.
Okay I see, yep that makes sense to me (-:
Source: original, but motivated by trying to ground WFLL1-type scenarios in what we already experience in the modern world, so heavily based on this. Also the original idea came from reading Neel Nanda’s “Bird’s Eye View of AI Alignment—Threat Models”
Intended audience: mainly policymakers
A common problem in the modern world is when incentives don’t match up with value being produced for society. For instance, corporations have an incentive to profit-maximise, which can lead to producing value for consumers, but can also involve less ethical strategies such as underpaying workers, regulatory capture, or tax avoidance. Laws & regulations are designed to keep behaviour like this in check, and this works fairly well most of the time. Some reasons for this are: (1) people have limited time/intelligence/resources to find and exploit loopholes in the law, (2) people usually follow societal and moral norms even if they’re not explicitly represented in law, and (3) the pace of social and technological change has historically been slow enough for policymakers to adapt laws & regulations to new circumstances. However, advancements in artificial intelligence might destabilise this balance. To return to the previous example, an AI tasked with maximising profit might be able to find loopholes in laws that humans would miss, they would have no particular reason to pay attention to societal norms, and they might be improving and becoming integrated with society at a rate which makes it difficult for policy to keep pace. The more entrenched AI becomes in our society, the worse these problems will get.
Thanks for the post! I just wanted to clarify what concept you’re pointing to with use of the word “deception”.
From Evan’s definition in RFLO, deception needs to involve some internal modelling of the base objective & training process, and instrumentally optimising for the base objective. He’s clarified in other comments that he sees “deception” as only referring to inner alignment failures, not outer (because deception is defined in terms of the interaction between the model and the training process, without introducing humans into the picture). This doesn’t include situations like the first one, where the reward function is underspecified to produce behaviour we want (although it does produce behaviour that looks like it’s what we want, unless we peer under the hood).
To put it another way, it seems like the way deception is used here refers to the general situation where “AI has learnt to do something that humans will misunderstand / misinterpret, regardless of whether the AI actually has an internal representation of the base objective it’s being trained on and the humans doing the training.”
In this situation, I don’t really know what the benefit is of putting these two scenarios into the same class, because they seem pretty different. My intuitions about this might be wrong though. Also I guess this is getting into the inner/outer alignment distinction which opens up quite a large can of worms!
Oh wow, I wish I’d come across that plugin previously, that’s awesome! Thanks a bunch (-:
Sorry for forgetting to reply to this at first!
There are 2 different ways I create code cards, one is in Jupyter notebooks and one is the “normal way”, i.e. by using the Anki editor. I’ve just created a GitHub describing the second one:
https://github.com/callummcdougall/anki_templates
Please let me know if there’s anything unclear here!
Thanks! Yeah so there is one add-on I use for tag management. It’s called Search and Replace Tags, basically you can select a bunch of cards in the browser and Ctrl+Alt+Shift+T to change them. When you press that, you get to choose any tag that’s possessed by at least one of the cards you’re selecting, and replace it with any other tag.
There are also built-in Anki features to add, delete, and clear unused tags (to find those, right-click on selected cards in the browser, and hover over “Notes”). I didn’t realise those existed for a long time, was pretty annoyed when I found them! XD
Hope this helps!
It seems like an environment that changes might cause modularity. Though, aside from trying to make something modular, it seem like it could potentially fall out of stuff like ‘we want something that’s easier to train’.
This seems really interesting in the biological context, and not something we discussed much in the other post. For instance, if you had two organisms, one modular and one not modular, even if there’s currently no selection advantage for the modular one, it might just be trained much faster and hence be more likely to hit on a good solution before the nonmodular network (i.e. just because it’s searching over parameter space at a larger rate).
Reasoning: Training independent parts to each perform some specific sub-calculation should be easier than training the whole system at once.
Since I’ve not been involved in this discussion for as long I’ll probably miss some subtlety here, but my immediate reaction is that “easier” might depend on your perspective—if you’re explicitly enforcing modularity in the architecture (e.g. see the “Direct selection for modularity” section of our other post) then I agree it would be a lot easier, but whether modular systems are selected for when they’re being trained on factorisable tasks is kinda the whole question. Since sections of biological networks do sometimes evolve completely in isolation from each other (because they’re literally physically connected) then it does seem plausible that something like this is happening, but it doesn’t really move us closer to a gears-level model for what’s causing modularity to be selected for in the first place. I imagine I’m misunderstanding something here though.
So if module three is doing great, but module five is doing abysmally, and the answer depends on both being right, your loss is really bad. So the optimiser is going to happily modify three away from the optimum it doesn’t know it’s in.
Maybe one way to get around it is that the loss function might not just be a function of the final outputs of each subnetwork combined, it might also reward bits of subcomputation? e.g. to take a deep learning example which we’ve discussed, suppose you were training a CNN to calculate the sum of 2 MNIST digits, and you were hoping the CNN would develop a modular representation of these two digits plus an “adding function”—maybe the network could also be rewarded for the subtask of recognising the individual digits? It seems somewhat plausible to me that this kind of thing happens in biology, otherwise there would be too many evolutionary hurdles to jump before you get a minimum viable product. As an example, the eye is a highly complex and modular structure, but the very first eyes were basically just photoreceptors that detected areas of bright light (making it easier to navigate in the water, and hide from predators I think). So at first the loss function wasn’t so picky as to only tolerate perfect image reconstructions of the organism’s surroundings; instead it simply graded good brightness-detection, which I think could today be regarded as one of the “factorised tasks” of vision (although I’m not sure about this).
Yep thanks! I would imagine if progress goes well on describing modularity in an information-theoretic sense, this might help with (2), because information entanglement between a single module and the output would be a good measure of “relevance” in some sense
Thanks for the comment!
Subtle point: I believe the claim you’re drawing from was that it’s highly likely that the inputs to human values (i.e. the “things humans care about”) are natural abstractions.
To check that I understand the distinction between those two: inputs to human values are features of the environment around which our values are based. For example, the concept of liberty might be an important input to human values because the freedom to exercise your own will is a natural thing we would expect humans to want, whereas humans can differ greatly in things like (1) metaethics about why liberty matters, (2) the extent to which liberty should be traded off with other values, if indeed it can be traded off at all. People might disagree about interpretations of these concepts (especially different cultures), but in a world where these weren’t natural abstractions, we might expect disagreement in the first place to be extremely hard because the discussants aren’t even operating on the same wavelength, i.e. they don’t really have a set of shared concepts to structure their disagreements around.
One theme I notice throughout the “evidence” section is that it’s mostly starting from arguments that the NAH might not be true, then counterarguments, and sometimes counter-counterarguments.
Yeah, that’s a good point. I think partly that’s because my thinking about the NAH basically starts with “the inside view seems to support it, in the sense that the abstractions that I use seem natural to me”, and so from there I start thinking about whether this is a situation in which the inside view should be trusted, which leads to considering the validity of arguments against it (i.e. “am I just anthropomorphising?”).
However, to give a few specific reasons I think it seems plausible that don’t just rely on the inside view:
Humans were partly selected for their ability to act in the world to improve their situations. Since abstractions are all about finding good high-level models that describe things you might care about and how they interact with the rest of the world, it seems like there should have been a competitive pressure for humans to find good abstractions. This argument doesn’t feel very contingent on the specifics human cognition or what our simplicity priors are; rather the abstractions should be a function of the environment (hence convergence to the same abstractions by other cognitive systems which are also under competition, e.g. in the form of computational efficiency requirements, seems intuitive)
There’s lots of empirical evidence that seems to support it, at least at a weak level (e.g. CLIP as discussed in my post, or GPT-3 as mentioned by Rohin in his summary for the newsletter)
Returning to the clarification you made about inputs to human values being the natural abstraction rather than the actual values, it seems like the fact that different cultures can have a shared basis for disagreement might support some form of the NAH rather than arguing against it? I guess that point has a few caveats though, e.g. (1) all cultures have been shaped significantly by global factors like European imperialism, and (2) humans are all very close together in mind design space so we’d expect something like this anyway, natural abstraction or not
Yep those were both typos, fixed now, thanks!