also, it appears that the two diagrams in the Frankenstein Rule section differ in their d-separation of (x_1 \indep x_4 | x_5) (which doesn’t hold in the the left), so these are not actually equivalent (we can’t have an underlying distribution satisfy both of these diagrams)
David Reber
The theorems in this post all say something like “if the distribution (approximately) factors according to <some DAGs>, then it also (approximately) factors according to <some other DAGs>”
So one motivating research question might be phrased as “Probability distributions have an equivalence class of Bayes nets / causal diagrams which are all compatible. But what is the structure within a given equivalence class? In particular, if we have a representative Bayes net of an equivalence class, how might we algorithmically generate other Bayes nets in that equivlance class?”
Could you clarify how this relates to e.g. the PC (Peter-Clark) or FCI (Fast Causal Inference) algorithms for causal structure learning?
Like, are you making different assumptions (than e.g. minimality, faithfulness, etc)?
So the contributions of vnm theory are shrunken down into “intention”?
(Background: I consider myself fairly well-read w.r.t. causal incentives, not very familiar with vnm theory, and well-versed in Pearlian causality. I have gotten a sneak peak at this sequence so have a good sense of what’s coming)
I’m not sure I understand VNM theory, but I would suspect the relationship is more like “VNM theory and <this agenda> are two takes on how to reason about the behavior of agents, and they both refer to utilities and Bayesian networks, but have important differences in their problem statements (and hence, in their motivations, methodologies, exact assumptions they make, etc)”.
I’m not terribly confident in that appraisal at the moment, but perhaps it helps explain my guess for the next question:
Will you recapitulate that sort of framing (such as involving the interplay between total orders and real numbers)
Based on my (decent?) level of familiarity with the causal incentives research, I don’t think there will be anything like this. Just because two research agendas use a few of the same tools doesn’t mean they’re answering the same research questions, let alone sharing methodologies.
...or are you feeling more like it’s totally wrong and should be thrown out?
When two different research agendas are distinct enough (as I suspect VNM and this causal-framing-of-AGI-safety are), their respective success/failures are quite independent. In particular, I don’t think the authors’ choice to pursue this research direction over the last few years should be taken by itself as a strong commentary on VNM.
But maybe I didn’t fully understand your comment, since I haven’t read up on VNM.
Distinguish two types of shutdown goals: temporary and permanent. These types of goals may differ with respect to entrenchment. AGIs that seek temporary shutdown may be incentivized to protect themselves during their temporary shutdown. Before shutting down, the AGI might set up cyber defenses that prevent humans from permanently disabling it while ‘asleep’. This is especially pressing if the AGI has a secondary goal, like paperclip manufacturing. In that case, protection from permanent disablement increases its expected goal satisfaction. On the other hand, AGIs that desire permanent shutdown may be less incentivized to entrench.
It seems like an AGI built to desire permanent shutdown may have an incentive to permanently disempower humanity, then shut down. Otherwise, there’s a small chance that humanity may revive the AGI, right?
Another related work: Concept Algebra for Text-Controlled Vision Models (Discloser: while I did not author this paper, I am in the PhD lab who did, under Victor Veitch at UChicago. Any mistakes made in this comment are my own). We haven’t prioritized a blog post about the paper so it makes sense that this community isn’t familiar with it.
The concept algebra paper demonstrates that for text-to-image models like Stable Diffusion, there exist linear subspaces in the score embedding space, on which you can do the same manner of concept editing/control as Word-to-Vec.
Importantly, the paper comes with some theoretical investigation into why this might be the case, including articulating necessary assumptions/conditions (which this purely-empirical post does not).
I conjecture that the reason that <some activation additions in this post fail to have the desired effect> may be because they violate some conditions analogous to those in Concept Algebra: it feels a bit deja-vu to look at section E.1 in the appendix, of some empirical results which fail to act as expected when the conditions of completeness and causal separability don’t hold.
Also, just to make sure we share a common understanding of Schölkopf 2021: Wouldn’t you agree that asking “how do we do causality when we don’t even know what level abstraction on which to define causal variables?” is beyond the “usual pearl causality story” as usually summarized in FFS posts? It certainly goes beyond Pearl’s well-known works.
I don’t think my claim is that “FFS is already subsumed by work in academia”: as I acknowledge, FFS is a different theoretical framework than Pearl-based causality. I view them as two distinct approaches, but my claim is that they are motivated by the same question (that is, how to do causal representation learning).
It was intentional that the linked paper is an intro survey paper to the Pearl-ish approach to causal rep. learning: I mean to indicate that there are already lots of academic researchers studying the question “what does it mean to study causality if we don’t have pre-defined variables?”
It may be that FFS ends up contributing novel insights above and beyond <Pearl-based causal rep. learning>, but a priori I expect this to occur only if FFS researchers are familiar with the existing literature, which I haven’t seen mentioned in any FFS posts.
My line of thinking is: It’s hard to improve on a field you aren’t familiar with. If you’re ignorant of the work of hundreds of other researchers who are trying to answer the same underlying question you are, odds are against your insights being novel / neglected.
Tho as a counterpoint, maybe Auto-GPT presents some opportunities to empirically test the IDA proposal? To have a decent experiment, you would need a good metric for alignment (does that exist?) and demonstrate that as you implement IDA using Auto-GPT, your metric is at least maintained, even as capabilities improve on the newer models.
I’m overall skeptical of my particular proposal however, because 1. I’m not aware of any well-rounded “alignment” metrics, and 2. you’d need to be confident that you can scale it up without losing control (because if the experiment fails, then by definition you’ve developed a more powerful AI which is less aligned).
But it’s plausible to me that someone could find some good use for Auto-GPT for alignment research, now that it has been developed. It’s just not clear to me how you would do so in a net-positive way.
To clarify, here I’m not taking a stance on whether IDA should be central to alignment or not, simply claiming that unless you have a crux of “whether or not recursive improvement is easy to do” as the limiting factor for IDA being a good alignment strategy, your assessment of IDA should probably stay largely unchanged.
My understanding of Auto-GPT is that it strings together many GPT-4 requests, while notably also giving it access to memory and the internet. Empirically, this allocation of resources and looping seems promising for solving complex tasks, such as debugging the code of Auto-GPT itself. (For those interested, this paper discusses how to use looped transformers can serve as general-purpose computers).
But to my ears, that just sounds like an update of the form “GPT can do many tasks well”, not in the form of “Aligned oversight is tractable”. Put another way, Auto-GPT sounds like evidence for capabilities, not evidence for the ease of scalable oversight. The question of whether human values can be propagated up through increasingly amplified models seems separate from the ability to improve self-recursively, in the same way that capabilities-progress is distinct from alignment-progress.
Strongly upvoting this for being a thorough and carefully cited explanation of how the safety/alignment community doesn’t engage enough with relevant literature from the broader field, likely at the cost of reduplicated work, suboptimal research directions, and less exchange and diffusion of important safety-relevant ideas
Ditto. I’ve recently started moving into interpretability / explainability and spent the past week skimming the broader literature on XAI, so the timing of this carefully cited post is quite impactful for me.
I see similar things happening with causality generally, where it seems to me that (as a 1st order heuristic) much of alignment forum’s reference for causality is frozen at Pearl’s 2008 textbook, missing what I consider to be most of the valuable recent contributions and expansions in the field.
Example: Finite Factored Sets seems to be reinventing causal representation learning [for a good intro, see Schölkopf 2021], where it seems to me that the broader field is outpacing FFS on its own goals. FFS promises some theoretical gains (apparently to infer causality where Pearl-esque frameworks can’t) but I’m no longer as sure about the validity of this.
Counterexample(s): the Causal Incentives Working Group, and David Krueger’s lab, for instance. Notably these are embedded in academia, where there’s more culture (incentive) to thoroughly relate to previous work. (These aren’t the only ones, just 2 that came to mind.)
A few thoughts:
This seems like a good angle for how to bridge AI safety a number of disciplines
I appreciated the effort to cite peer-reviewed sources and provide search terms that can be looked into further
While I’m still parsing the full validity/relevance concrete agendas suggested, they do seem to fit the form of “what relevance is there from established fields” without diluting the original AI safety motivations too much
Overall, it’s quite long, and I would very much like to see a distilled version (say, 1⁄5 the length).
(but that’s just a moderate signal from someone who was already interested, yet still nearly bounced off)
Under the “reward as selection” framing, I find the behaviour much less confusing:
We use reward to select for actions that led to the agent reaching the coin.
This selects for models implementing the algorithm “move towards the coin”.
However, it also selects for models implementing the algorithm “always move to the right”.
It should therefore not be surprising you can end up with an agent that always moves to the right and not necessarily towards the coin.
I’ve been reconsidering the coin run example as well recently from a causal perspective, and your articulation helped me crystalize my thoughts. Building on these points above, it seems clear that the core issue is one of causal confusion: that is, the true causal model M is “move right” → “get the coin” → “get reward”. However, if the variable of “did you get the coin” is effectively latent (because the model selection doesn’t discriminate on this variable) then the causal model M is indistinguishable from M’ which is “move right” → “get reward” (which though it is not the true causal model governing the system, generates the same observational distribution).
In fact, the incorrect model M’ actually has shorter description length, so it may be that here there is a bias against learning the true causal model. If so, I believe we have a compelling explanation for the coin runner phenomenon which does not require the existence of a mesa optimizer, and which does indicate we should be more concerned about causal confusion.
I’m also working on extending the framework to the infinite setting and am almost finished except for conditional orthogonality for uncountable sets.
Hmm, what would be the intuition/application behind the uncountable setting? Like, when would one want that (I don’t mind if it’s niche, I’m just struggling to come up with anything)?
[Question] Using Finite Factored Sets for Causal Representation Learning?
I’d be interested in seeing other matrix factorizations explored as well. Specifically, I would recommend trying nonnegative matrix factorizations: to quote the Wikipedia article:
This non-negativity makes the resulting matrices easier to inspect. Also, in applications such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered.
The added constraint may help eliminate spurious patterns: for instance, I suspect the positive/negative singular value distinction might be a red herring (based on past projects I’ve worked on).
I second this, that it’s difficult to summarize AI-safety-relevant academic work for LW audiences. I want to highlight the symmetric difficulty of trying to summarize the mountain of blog-post-style work on the AF for academics.
In short, both groups have steep reading/learning curves that are under-appreciated when you’re already familiar with it all.
Anecdotally, I’ve found the same said of Less Wrong / Alignment Forum posts among AI safety / EA academics: that it amounts to an echo chamber that no one else reads.
I suspect both communities are taking their collective lack of familiarity with the other as evidence that the other community isn’t doing their part to disseminate their ideas properly. Of course, neither community seems particularly interested in taking the time to read up on the other, and seems to think that the other community should simply mimic their example (LWers want more LW synopses of academic papers, academics want AF work to be published in journals).
Personally I think this is symptomatic of a larger camp-ish divide between the two, which is worth trying to bridge.
Ah that’s right. Thanks that example is quite clarifying!