I wish you would realize that whatever we’re looking at, it isn’t people not realizing this.
TsviBT(Tsvi Benson-Tilsen)
Look… Consider the hypothetically possible situation that in fact everyone is very far from being on the right track, and everything everyone is doing doesn’t help with the right track and isn’t on track to get on the right track or to help with the right track.
Ok, so I’m telling you that this hypothetically possible situation seems to me like the reality. And then you’re, I don’t know, trying to retreat to some sort of agreeable live-and-let-live stance, or something, where we all just agree that due to model uncertainty and the fact that people have vaguely plausible stories for how their thing might possibly be helpful, everyone should do their own thing and it’s not helpful to try to say that some big swath of research is doomed? If this is what’s happening, then I think that what you in particular are doing here is a bad thing to do here.
Maybe we can have a phone call if you’d like to discuss further.
Doomed to irrelevance, or doomed to not being a complete solution in and of itself?
Doomed to not be trying to go to and then climb the mountain.
my brain is a dirty lying liar that lies to me at every opportunity
So then it isn’t easy. But it’s feedback. Also there’s not that much distinction between making a philosophically rigorous argument and “doing introspection” in the sense I mean, so if you think the former is feasible, work from there.
Is there a particular reason you expect there to be exactly one hard part of the problem,
Have you stopped beating your wife? I say “the” here in the sense of like “the problem of climbing that mountain over there”. If you’re far away, it makes sense to talk about “the (thing over there)”, even if, when you’re up close, there’s multiple routes, multiple summits, multiple sorts of needed equipment, multiple sources of risk, etc.
and for the part that ends up being hardest in the end to be the part that looks hardest to us now?
We make an argument like “any solution would have to address X” or “anything with feature Y does not do Z” or “property W is impossible”, and then we can see what a given piece of research is and is not doing / how it is doomed to irrelevance. It’s not like pointing to a little ball in ideaspace and being like “the answer is somewhere in here”. Rather it’s like cutting out a halfspace and saying “everything on this side of this plane is doomed, we’d have to be somewhere in the other half”, or like pointing out a manifold that all research is on and saying “anything on this manifold is doomed, we’d have to figure out how to move somewhat orthogonalward”.
research that stemmed from someone trying something extremely simple and getting an unexpected result
I agree IF we are looking at the objects in question. If LLMs were minds, the research would be much more relevant. (I don’t care if you have an army of people who all agree on taking a stance that seems to imply that there’s not much relevant difference between LLMs and future AGI systems that might kill everyone.)
What is your preferred method for getting feedback from reality on whether your theory describes the world as it is?
I think you (and everyone else) don’t know how to ask this question properly. For example, “on whether your theory describes the world as it is” is a too-narrow idea of what our thoughts about minds are supposed to be. Sub-example: our thoughts about mind are supposed to also produce design ideas.
To answer your question: by looking at and thinking about minds. The only minds that currently exist are humans, and the best access you have to minds is introspection. (I don’t mean meditation, I mean thinking and also thinking about thinking/wanting/acting—aka some kinds of philosophy and math.)
I broadly agree with this. (And David was like .7 out of the 1.5 profs on the list who I guessed might genuinely want to grant the needed freedom.)
I do think that people might do good related work in math (specifically, probability/information theory, logic, etc.--stuff about formalized reasoning), philosophy (of mind), and possibly in other places such as theoretical linguistics. But this would require that the academic context is conducive to good novel work in the field, which lower bar is probably far from universally met; and would require the researcher to have good taste. And this is “related” in the sense of “might write a paper which leads to another paper which would be cited by [the alignment textbook from the future] for proofs/analogies/evidence about minds”.
I don’t speak for MIRI, but broadly I think MIRI thinks that roughly no existing research is hopeworthy, and that this isn’t likely to change soon. I think that, anyway.
In discussions like this one, I’m conditioning on something like “it’s worth it, these days, to directly try to solve AGI alignment”. That seems assumed in the post, seems assumed in lots of these discussions, seems assumed by lots of funders, and it’s why above I wrote “the main direct help we can give to AGI alignment” rather than something stronger like “the main help (simpliciter) we can give to AGI alignment” or “the main way we can decrease X-risk”.
If, hypothetically, we were doing MI on minds, then I would predict that MI will pick some low hanging fruit and then hit walls where their methods will stop working, and it will be more difficult to develop new methods that work. The new methods that work will look more and more like reflecting on one’s own thinking, discovering new ways of understanding one’s own thinking, and then going and looking for something like that in the in-vitro mind. IDK how far that could go. But then this will completely grind to a halt when the IVM is coming up with concepts and ways of thinking that are novel to humanity. Some other approach would be needed to learn new ideas from a mind via MI.
However, another dealbreaker problem with current and current-trajectory MI is that it isn’t studying minds.
From the section you linked:
Moreover, the program guarantees at least some mentorship from your supervisor. Your advisor’s incentives are reasonably aligned with yours: they get judged by your success in general, so want to see you publish well-recognized first-author research, land a top research job after graduation and generally make a name for yourself (and by extension, them).
Doing a PhD also pushes you to learn how to communicate with the broader ML research community. The “publish or perish″ imperative means you’ll get good at writing conference papers and defending your work.
These would be exactly the “anyone around them” about whose opinion they would have to not give a fuck.
I don’t know a good way to do this, but maybe a pointer would be: funders should explicitly state something to the effect of:
“The purpose of this PhD funding is to find new approaches to core problems in AGI alignment. Success in this goal can’t be judged by an existing academic structure (journals, conferences, peer-review, professors) because there does not exist such a structure aimed at the core problems in AGI alignment. You may if you wish make it a major goal of yours to produce output that is well-received by some group in academia, but be aware that this goal would be non-overlapping with the purpose of this PhD funding.”
The Vitalik fellowship says:
To be eligible, applicants should either be graduate students or be applying to PhD programs. Funding is conditional on being accepted to a PhD program, working on AI existential safety research, and having an advisor who can confirm to us that they will support the student’s work on AI existential safety research.
Despite being an extremely reasonable (even necessary) requirement, this is already a major problem according to me. The problem is that (IIUC—not sure) academics are incentivized to, basically, be dishonest, if it gets them funding for projects / students. Of the ~dozen professors here (https://futureoflife.org/about-us/our-people/ai-existential-safety-community/) who I’m at least a tiny bit familiar with, I think maybe 1.5ish are actually going to happily support actually-exploratory PhD students. I could be wrong about this though—curious for more data either way. And how many will successfully communicate to the sort of person who would take a real shot at exploratory conceptual research if given the opportunity to do such research that they would in fact support that? I don’t know. Zero? One? And how would someone sent to the FLI page know of the existence of that professor?
Fellows are expected to participate in annual workshops and other activities that will be organized to help them interact and network with other researchers in the field.
Continued funding is contingent on continued eligibility, demonstrated by submitting a brief (~1 page) progress report by July 1st of each year.
Again, reasonable, but… Needs more clarity on what is expected, and what is not expected.
a technical specification of the proposed research
What does this even mean? This webpage doesn’t get it. We’re trying to buy something that isn’t something someone can already write a technical specification of.
That would only work for people with the capacity to not give a fuck what anyone around them thinks, especially including the person funding and advising them. And that’s arguably unethical depending on context.
the doppelganger problem is a fairly standard criticism of the sparse autoencoder work,
And what’s the response to the criticism, or a/the hoped approach?
diasystemic novelty seems the kind of thing you’d encounter when doing developmental interpretability, interp-through-time
Yeah, this makes sense. And hey, maybe it will lead to good stuff. Any results so far, that I might consider approaching some core alignment difficulties?
it seems the kind of thing which would come from the study of in-context-learning, a goal that mainstream MI I believe has, even if it doesn’t focus on now (likely because it believes its unable to at this moment), and which I think it will care more about as the power of such in-context learning becomes more and more apparent.
Also makes some sense (though the ex quo, insofar as we even want to attribute this to current systems is distributed across the training algorithms and the architecture sources, as well as inference-time stuff).
Generally what you’re bringing up sounds like “yes these are problems and MI would like to think about them… later”. Which is understandable, but yeah, that’s what streetlighting looks like.
Maybe an implicit justification of current work is like:
There’s these more important, more difficult problems. We want to deal with them, but they are too hard right now, so we will try in the future. Right now we’ll deal with simpler things. By dealing with simpler things, we’ll build up knowledge, skills, tools, and surrounding/supporting orientation (e.g. explaining weird phenomena that are actually due to already-understandable stuff, so that later we don’t get distracted). This will make it easier to deal with the hard stuff in the future.
This makes a lot of sense—it’s both empathizandable, and seems probably somewhat true. However:
Again, it still isn’t in fact currently addressing the hard parts. We want to keep straight the difference between [currently addressing] vs. [arguably might address in the future].
We gotta think about what sort of thing would possibly ever work. We gotta think about this now, as much as possible.
A core motivating intuition behind the MI program is (I think) “the stuff is all there, perfectly accessible programmatically, we just have to learn to read it”. This intuition is deeply flawed: Koan: divining alien datastructures from RAM activations
Ah thanks!
In particular, I think it’s a bit harsh to solely apply the [genuinely new directions discovered] metric. Even when doing everything right, I expect the hit rate to be very low there, with [variation on current framing/approach] being the most common type of success.
Mhm. In fact I’d want to apply a bar that’s even lower, or at least different: [the extent to which the participants (as judged by more established alignment thinkers) seem to be well on the way to developing new promising directions—e.g. being relentlessly resourceful including at the meta-level; having both appropriate Babble and appropriate Prune; not shying away from the hard parts].
the setup were flexible enough that it didn’t seem likely to limit the potential of participants (by being less productive-in-the-sense-desired than counterfactual environments).
Agree that this is an issue, but I think it can be addressed—certainly at least well enough that there’d be worthwhile value-of-info in running such a thing.
I’d be happy to contribute a bit of effort, if someone else is taking the lead. I think most of my efforts will be directed elsewhere, but for example I’d be happy to think through what such a program should look like; help write justificatory parts of grant applications; and maybe mentor / similar.
Here’s the convo according to me:
Bloom:
I think there’s scope within field like interp to focus on things that are closer to the hard part of the problem or at least touch on robust bottlenecks for alignment agendas
BT:
Object level: ontology identification, in the sense that is studied empirically, is pretty useless.
sname:
I haven’t seen anyone do such interpretability research yet but I see no particular reason to think this is the sort of thing that can’t be studied empirically rather than the sort of thing that hasn’t been studied empirically.
BT:
Well, empirically, when people try to study it empirically, instead they do something else
sname:
I don’t know that we have any empirical data on what happens when people try to study that particular empirical question (the specific relationship between the features leaned by two models of different modalities) because I don’t know that anyone has set out to study that particular question in any serious way.
BT:
ah, sname is talking about conceptual Doppelgängers specifically, as ze indicated in a previous comment that I now understand
When I said “when people try to study it empirically”, what I meant was “when people try to do interpretability research (presumably, that is relevant to the hard part of the problem?)”.
“prosaic alignment is unlikely to be helpful, look at all of these empirical researchers who have not even answered these basic questions”
Right, I’m not saying exactly this. But I am saying:
Prosaic alignment is unlikely to be helpful, look at how they are starting in an extremely streetlighty way(*) and then, empirically, not pushing out into the dark quickly—and furthermore, AFAIK, not very concerned with how they aren’t pushing out into the dark quickly enough, or successfully addressing this at the meta level, though plausibly they’re doing that and I’m just not aware.
(*): studying LLMs, which are not minds; trying to recognize [stuff we mostly conceptually understand] within systems rather than trying to come to conceptually understand [the stuff we’d need to be able to recognize/design in a mind, in order to determine the mind’s effects].
(I think this would he worth doing f resources we’re unlimited, not sure as things actually stand).
Well, you’ve agreed with a defanged version of my statements. The toothful version, which I do think: Insofar as this is even possible, we should allocate a lot more resources toward funding any high-caliber smart/creative/interesting/promising/motivated youngsters/newcomers who want to take a crack at independently approaching the core difficulties of AGI alignment, even if that means reallocating a lot of resources away from existing on-paradigm research.
Edit: to be even more explicit, what I’m trying to do in this thread is encourage thinking about ways one might collect empirical observations about non-”streetlit” topics. None of the topics are under the streetlight until someone builds the streetlight. “Build a streetlight” is sometimes an available action, but it only happens if someone makes a specific effort to do so.
This seems like a good thing to do. But there’s multiple ways that existing research is streetlit, and reality doesn’t owe it to you to make it be the case that there are nice (tractionful, feasible, interesting, empirical, familiar, non-weird-seeming, feedbacked, grounded, legible, consensusful) paths toward the important stuff. The absence of nice paths would really suck if it’s the case, and it’s hard to see how anyone could be justifiedly really confident that there are no nice paths. But yes, I’m saying that it looks like there aren’t nice paths, or at least there aren’t enough nice paths that we seem likely to find them by continuing to sample from the same distribution we’ve been sampling from; and I have some arguments and reasons supporting this belief, which seem true; and I would guess that a substantial fraction (though not most) of current alignment researchers would agree with a fairly strong version of “very few or no nice paths”.
Could that be operationalized as a prediction of the form
If you train a model on a bunch of simple tasks involving both functional and object-oriented code (e.g. “predict the next token of the codebase”, “predict missing token”, “identify syntax errors”) and then train it on a complex task on only object-oriented code (e.g. “write a document describing how to use this library”), it will fail to navigate that ontological shift and will be unable to document functional code.
I don’t think that’s a good operationalization, as you predict. I think it’s trying to be an operationalization related to my claim above:
ontology identification, in the sense that is studied empirically, is pretty useless. It [..] AFAIK isn’t trying to [...] at all handle diasystemic novelty [...].
But it sort of sounds like you’re trying to extract a prediction about capability generalization or something? Anyway, an interp-like study trying to handle diasystemic novelty might for example try to predict large scale explicitization events events before they happen—maybe in a way that’s robust to “drop out”. E.g. you have a mind that doesn’t explicitly understand Bayesian reasoning; but it is engaging in lots of activities that would naturally induce small-world probabilistic reasoning, e.g. gambling games or predicting-in-distribution simple physical systems; and then your interpreter’s job is to notice, maybe only given access to restricted parts (in time or space, say) of the mind’s internals, that Bayesian reasoning is (implicitly) on the rise in many places. (This is still easy mode if the interpreter gets to understand Bayesian reasoning explicitly beforehand.) I don’t necessarily recommend this sort of study, though; I favor theory.
I don’t know what you’re trying to do in this thread (e.g. what question you’re trying to answer).
Yeah that looks good, except that it takes an order of magnitude longer to get going on conceptual alignment directions. I’ll message Adam to hear what happened with that.
Ok I want to just lay out what I’m trying to do here, and why, because it could be based on false assumptions.
A main assumption I’m making, which totally could be false, is that your paragraph
Funders of independent researchers we’ve interviewed think that there are plenty of talented applicants, but would prefer more research proposals focused on relatively few existing promising research directions (e.g., Open Phil RFPs, MATS mentors’ agendas), rather than a profusion of speculative new agendas.
is generally representative of the entire landscape, with a few small-ish exceptions. In other words, I’m assuming that it’s pretty difficult for a young smart person to show up and say “hey, I want to spend 3 whole years thinking about this problem de novo, can I have one year’s salary and a reevaluation after 1 year for a renewal”.
A main assumption that motivates what I’m doing here, and that could be false, is:
Funders make decisions mostly by some combination of recommendations from people they trust. The trust might be personal, or might be based on accomplishments, or might be based on some arguments made by the trusted person to the funder—and, centrally, the trust is actually derived from a loose diffuse array of impressions coming from the community, broadly.
To make the assumption slightly more clear: The assumption says that it’s actually quite common, maybe even the single dominant way funders make decisions, for the causality of a decision to flow through literally thousands of little interactions, where the little interactions communicate “I think XYZ is Important/Unimportant”. And these aggregate up into a general sense of importance/unimportance, or something. And then funding decisions work with two filters:
The explicit reasoning about the details—is this person qualified, how much funding, what’s the feedback, who endorses it, etc etc.
The implicit filter of Un/Importance. This doesn’t get raised to attention usually. It’s just in the background.
And “fund a smart motivated youngster without a plan for 3 years with little evaluation” is “unimportant”. And this unimportance is implicitly but strongly reinforced by everyone talking about in-paradigm stuff. And the situation is self-reinforcing because youngsters mostly don’t try to do the thing, because there’s no narrative and no funding, and so it is actually true that there aren’t many smart motivated youngsters just waiting for some funding to do trailblazing.
If my assumptions are true, then IDK what to do about this but would say that at least
people should be aware of this situation, and
people should keep talking about this situation, especially in contexts where they are contributing to the loose diffuse array of impressions by contributing to framing about what AGI alignment needs.
Well, empirically, when people try to study it empirically, instead they do something else. Surely that’s empirical evidence that it can’t be studied empirically? (I’m a little bit trolling but also not.)
Ok I think you’re right. I didn’t know (at least, not well enough) that “talent needs” quasi-idiomatically means “sorts of people that an organization wants to hire”, and interpreted it to mean literally “needs (by anyone / the world) for skills / knowledge”.
I don’t buy the unwieldiness excuse; you could say “Hiring needs in on-paradigm technical AI safety”, for example. But me criticizing minutae of the framing in this post doesn’t seem helpful. The main thing I want to communicate is that
the main direct help we can give to AGI alignment would go via novel ideas that would be considered off-paradigm; and therefore
high-caliber newcomers to the field should be strongly encouraged to try to do that; and
there’s strong emergent effects in the resource allocation (money, narrative attention, collaboration) of the field that strongly discourage newcomers from doing so and/or don’t attract newcomers who would do so.
IDK if there’s political support that would be helpful and that could be affected by people saying things to their representatives. But if so, then it would be helpful to have a short, clear, on-point letter that people can adapt to send to their representatives. Things I’d want to see in such a letter:
AGI, if created, would destroy all or nearly all human value.
We aren’t remotely on track to solving the technical problems that would need to be solved in order to build AGI without destroying all or nearly all human value.
Many researchers say they are trying to build AGI and/or doing research that materially contributes toward building AGI. None of those researchers has a plausible plan for making AGI that doesn’t destroy all or nearly all human value.
As your constituent, I don’t want all or nearly all human value to be destroyed.
Please start learning about this so that you can lend your political weight to proposals that would address existential risk from AGI.
This is more important to me than all other risks about AI combined.
Or something.