I will spend time reading posts and papers, improving coding skills as needed to run and interpret experiments, learning math as needed for writing up proofs, talking with concept-based interpretability researchers as well as other conceptual alignment researchers
I feel like this is missing the bit where you write proofs, run and interpret experiments, etc.
I had thought that that would be implicit in why I’m picking up those skills/that knowledge? I agree that it’s not great that I’m finding that some of my initial ideas for things to do are infeasible or unhelpful such that I don’t feel like I have concrete theorems to want to try to prove here, or specific experiments I expect to want to run. I think a lot of next week is going to be reading up on natural latents/abstractions even more deeply than before when I was learning about them previously and trying to find somewhere a proof needs to go.
As a maximal goal, I might seek to test my theories about the detection of generalizable human values (like reciprocity and benevolence) by programming an alife simulation meant to test a toy-model version of agentic interaction and world-model agreement/interoperability through the fine-structure of the simulated agents.
Do you think you will be able to do this in the next 6 weeks? Might be worth scaling this down to “start a framework to test my theories” or something like that
Almost certainly this is way too ambitious for me to do, but I don’t know what “starting a framework” would look like. I guess I don’t have as full an understanding as I’d like of what MATS expects me to come up with/what’s in-bounds? I’d want to come up with a paper or something out of this but I’m also not confident in my ability to (for instance) fully specify the missing pieces of John’s model. Or even one of his missing pieces.
I kept trying to rewrite this part and it kept coming out too long. Basically—I would want the alife agents to be able to definitely agree on spacetime nearness and the valuableness of some objects (like food) and for them to be able to communicate (?in some way?) and to have clusterer-powered ontologies that maybe even do something like have their initializations inherited by subsequent generations of the agents.
That said like I’m about to say on another comment that project is way too ambitious.
I plan to stress-test and further flesh out the theory, with a minimal goal of producing a writeup presenting results I’ve found and examining whether the assumptions of the toy models of the original post hold up as a way of examining Natural Abstractions as an alignment plan.
I feel like this doesn’t give me quite enough of an idea of what you’d be doing—like, what does “stress-testing” involve? What parts need fleshing out?
My problem here is that the sketched-out toy model in the post is badly badly underspecified. AFAIK John hasn’t, for instance, thought about whether a different clustering model might be a better pick, and the entire post is a subproblem of trying to figure out how interoperable world-models would have to work. “Stress-test” is definitely not the right word here. “Specify”? “Fill in”? “Sketch out”? “Guess at”? Kind of all of it needs fleshing out.
Time-bounded: Are research activities and outputs time-bounded?
Does the proposal include a tentative timeline of planned activities (e.g., LTFF grant, scholar symposium talk, paper submission)?
How might the timeline change if planned research activities are unsuccessful?
This part is kind of missing—I’m seeing a big list of stuff you could do, but not an indication of how much of it you might reasonably expect to do in the next 5 weeks. A better approach here would be to give a list of things you could do in those 5 weeks together estimates for how much time each thing could take, possibly with a side section of “here are other things I could do depending on how stuff goes”
This is helpful. I’m going to make a list of things I think I could get done in somewhere between a few days and like 2 weeks that I think would advance my desire to put together a more complete+rigorous theory of semantics.
I feel like this section is missing a sentence like “OK here’s the thing that would be the output of my project, and here’s how it would cause these good effects”
[probably I also put a timeline here of stuff I have done so far?]
This is valuable for you to do so that you can get a feel for what you can do in a week, but I’m not sure it’s actually that valuable to plop into the RP
Less prosaically, it’s not impossible that a stronger or more solidly grounded theory of semantics or of interoperable world-models might prove to be the “last missing piece” between us and AGI; that said, given that my research path primarily involves things like finding and constructing conceptual tools, writing mathematical proofs, and reasoning about bounds on accumulating errors—and not things like training new frontier models—I think the risk/dual-use-hazard of my proposed work is minimal.
I don’t really understand this argument. Why wouldn’t having a better theory of semantics and concepts help people build better AIs, but still do a good job of describing what’s going on in smart AIs? Like, you might think the more things you know about smart AIs, the easier it would be to build them—where does this argument break?
The thing you imply here is that it’s pretty different from stuff people currently do to train frontier models, but you already told me that scaling frontier models was really unlikely to lead to AGI, so why should that give me any comfort?
Like, you might think the more things you know about smart AIs, the easier it would be to build them—where does this argument break?
I mean… it doesn’t? I guess I mostly think that either what I’m working on is totally off the capabilities pathway, or if it’s somehow on one, then I don’t think whatever minor framework improvement or suggestion for a mental frame that I come up with is going to push things all that far? Which I agree is kind of a depressing thing to expect of your work, but I argue that that’s the most likely two outcomes here. Does that address that?
Not only would a better theory of semantics help researchers detect objects and features which are natural to the AI, it would also help them check whether a given AI treats some feature of its environment or class of object as a natural cluster, and help researchers agree within provable bounds on what concept precisely they are targeting.
This part isn’t so clear to me. Why can’t I just look at what features of the world an AI represents without a theory of semantics?
I guess in that case I’d worry that you go and look at the features and come away with some impression of what those features represent and it turns out you’re totally wrong? I keep coming back to the example of a text-classifier where you find “”“the French activation directions””” except it turns out that only one of them is for French (if any at all) and the others are things like “words ending in x and z” or “words spoken by fancy people in these novels and quotes pages”.
It seems to me like asking too much, to think that there won’t be shared natural ontologies between humans (construed broadly) and ML models but we can still make sure that with the right pretraining regiment/dataset choice/etc the model will end up with a human ontology and also this process is something that admits any amount of error and also this can be done in a way that’s not trivially jailbreakable.
On one hand, arbitrary agents—or at least a large class of agents, or at least (proto-)AGIs that humans make—might turn out to simply already naturally agree with us on the features we abstract from our surroundings; a better-grounded and better-developed theory of semantics would allow us to confirm this and become more optimistic about the feasibility of alignment.
On the other, such agents might prove in general to have inner ontologies totally unrelated to our own, or perhaps only somewhat different, but in enduring and hazardous ways; a better theory of semantics would warn us of this in advance and suggest other routes to AGI or perhaps drive a total halt to development.
I feel like these two paragraphs are just fleshing out the thing you said earlier and aren’t really needed
the next paragraph is kind of like that but making a sort of novel point so maybe they’re necessary? I’d try to focus them on saying things you haven’t yet said
I agree that those three paragraphs are bloated. My issue is this—I don’t yet know which of those three branches is true (natural abstractions exist all the time vs. NAs can exist but only if you put them there vs. NAs do not, in general, exist, and they break immediately) but whichever it is, I think a better theory of semantics would help tell us which one it is, and then also be a necessary prerequisite to the obvious resulting plan.
those that rely on arbitrary AGIs detecting and [settling on as natural] the same features of the world that humans do, including values and qualities important to humanity
can you give examples of such strategies, and argue that they rely on this?
I’m in a weird situation here: I’m not entirely sure whether the community considers the Learning Theory Agenda to be the same alignment plan as The Plan (which is arguably not a plan at all but he sure thinks about value learning!), and whether I can count things like the class of scalable oversight plans which take as read that “human values” are a specific natural object. Would you at least agree that those first two (or one???) rely on that?
Worst of all, lacking such a theory means that we lack the framework and the language we’d need to precisely describe both human values—and how we’d check that a given system comprehends human values.
As a result, I think that conceptual alignment will be a required direction of work towards ensuring that the advent of AGI results in a desirable future for humanity among other sapient life. In particular, my perspective as a mathematician leads me to believe that just as a lack of provable guarantees about a mathematical object means that such an object can be arbitrarily unexpectedly badly behaved on features you didn’t want to specify, so too could the behavior of an underspecified or imprecisely specified AGI result in arbitrarily undesirable (or even merely pathological or self-defeating) behavior along axes we didn’t think to check.
IDG how this is supposed to be related to whether scaling will work. Surely if scaling were enough, your arguments here would still go thru, right?
I realized I wasn’t super clear about which part was which. I agree that “is scaling enough” is a major crux for me and I’d be way way more afraid if it looked like scaling were sufficient on its own; that part, however, is about “do we actually need to get alignment basically exactly right”. Does that change your understanding?
It seems very likely (~96%) to me that scale is not in fact all that is required to go from current frontier models to AGI, such that GPT-8 (say) will still not be superintelligent and a near-perfect predictor or generator of text, just because of what largely boils down to a difference of scale and not a difference of kind or of underlying conceptual model; I consider it more likely that we’ll get AGI ~30 years out but that we’ll have to get alignment precisely right.
You might want to gesture at why this seems likely to you, since AFAICT this is a minority view.
All the same, it’s not all that surprising that conceptual alignment generally and natural abstractions/natural semantics specifically are—maybe unavoidably—underserved subfields of alignment: the model of natural semantics I’m working off of was only officially formalized in mid-June 2024.
what’s the point of this sentence? would anything bad happen if you just deleted it?
I was trying to address the justification for why I’m here doing this instead of someone else doing something else? I might have been reading something about neglectedness from the old rubric. I could totally just cut it.
the model of natural semantics I’m working off of was only officially formalized in mid-June 2024.
is this why it isn’t surprising that conceptual alignment is underserved, or an example of it being underserved? as written i feel like the structure implies the second, but content-wise it feels more like the first
You use a lot of em dashes, and it’s noticeable. This is a common problem in writing. I don’t know a good way to deal with this, other than suggesting that you consider which ones could be footnotes, parentheses, or commas.
cannot be even reasonably sure that the measurements taken and experiments performed are telling us what we think they are
It’s not clear to me why this follows. Couldn’t it be the case that even without a theory of what sorts of features we expect models to learn / use, we can detect what features they are in fact using?
what sorts of things in the environment do we expect models to pick up on
how do we expect models to process info from the environment
If we’re wrong about 1, I feel like we could find it out. But if we make wrong assumptions about 2, it makes a bit more sense to me that we could fail to find that out.
In any case, an example indicating how we could fail would probably be useful here.
For 1., we could totally find out that our AGI just plain cannot pick up on what a car or a dog is, and only classify/recognize their parts (or by halves, or just always misclassify them) but then not have any sense of what’s going on to cause it or how to fix it.
For 2. … I have no idea? I feel like that might be out of scope for what I want to think about. I don’t even know how I’d start attacking that problem in full generality or even in part.
I think I’m missing something. What does the story look like, where we have some feature we’re totally unsure of what it signifies, but we’re very sure that the model is using it?
Or from the other direction, I keep coming back to Jacob’s transformer with like 200 orthogonal activation directions that all look to make the model write good code. They all seemed to be producing about the exact same activation pattern 8 layers on. It didn’t seem like his model was particularly spoiled for activation space—so what is it all those extra directions were actually picking up on?
I feel like this is missing the bit where you write proofs, run and interpret experiments, etc.
I had thought that that would be implicit in why I’m picking up those skills/that knowledge? I agree that it’s not great that I’m finding that some of my initial ideas for things to do are infeasible or unhelpful such that I don’t feel like I have concrete theorems to want to try to prove here, or specific experiments I expect to want to run. I think a lot of next week is going to be reading up on natural latents/abstractions even more deeply than before when I was learning about them previously and trying to find somewhere a proof needs to go.
Do you think you will be able to do this in the next 6 weeks? Might be worth scaling this down to “start a framework to test my theories” or something like that
Almost certainly this is way too ambitious for me to do, but I don’t know what “starting a framework” would look like. I guess I don’t have as full an understanding as I’d like of what MATS expects me to come up with/what’s in-bounds? I’d want to come up with a paper or something out of this but I’m also not confident in my ability to (for instance) fully specify the missing pieces of John’s model. Or even one of his missing pieces.
what does this mean?
I kept trying to rewrite this part and it kept coming out too long. Basically—I would want the alife agents to be able to definitely agree on spacetime nearness and the valuableness of some objects (like food) and for them to be able to communicate (?in some way?) and to have clusterer-powered ontologies that maybe even do something like have their initializations inherited by subsequent generations of the agents.
That said like I’m about to say on another comment that project is way too ambitious.
I think most people won’t know what this word means
Fixed but I’m likely removing that part anyway.
I feel like this doesn’t give me quite enough of an idea of what you’d be doing—like, what does “stress-testing” involve? What parts need fleshing out?
My problem here is that the sketched-out toy model in the post is badly badly underspecified. AFAIK John hasn’t, for instance, thought about whether a different clustering model might be a better pick, and the entire post is a subproblem of trying to figure out how interoperable world-models would have to work. “Stress-test” is definitely not the right word here. “Specify”? “Fill in”? “Sketch out”? “Guess at”? Kind of all of it needs fleshing out.
This part is kind of missing—I’m seeing a big list of stuff you could do, but not an indication of how much of it you might reasonably expect to do in the next 5 weeks. A better approach here would be to give a list of things you could do in those 5 weeks together estimates for how much time each thing could take, possibly with a side section of “here are other things I could do depending on how stuff goes”
This is helpful. I’m going to make a list of things I think I could get done in somewhere between a few days and like 2 weeks that I think would advance my desire to put together a more complete+rigorous theory of semantics.
I feel like this section is missing a sentence like “OK here’s the thing that would be the output of my project, and here’s how it would cause these good effects”
Yeah. I agree that it’s a huge problem that I can’t immediately point to what the output might be, or why it might cause something helpful downstream.
This is valuable for you to do so that you can get a feel for what you can do in a week, but I’m not sure it’s actually that valuable to plop into the RP
Makes sense. That’s also not ideal because for personal reasons you already know of I have no idea what my pace of work on this generally will be.
I don’t really understand this argument. Why wouldn’t having a better theory of semantics and concepts help people build better AIs, but still do a good job of describing what’s going on in smart AIs? Like, you might think the more things you know about smart AIs, the easier it would be to build them—where does this argument break?
The thing you imply here is that it’s pretty different from stuff people currently do to train frontier models, but you already told me that scaling frontier models was really unlikely to lead to AGI, so why should that give me any comfort?
I mean… it doesn’t? I guess I mostly think that either what I’m working on is totally off the capabilities pathway, or if it’s somehow on one, then I don’t think whatever minor framework improvement or suggestion for a mental frame that I come up with is going to push things all that far? Which I agree is kind of a depressing thing to expect of your work, but I argue that that’s the most likely two outcomes here. Does that address that?
This part isn’t so clear to me. Why can’t I just look at what features of the world an AI represents without a theory of semantics?
I guess in that case I’d worry that you go and look at the features and come away with some impression of what those features represent and it turns out you’re totally wrong? I keep coming back to the example of a text-classifier where you find “”“the French activation directions””” except it turns out that only one of them is for French (if any at all) and the others are things like “words ending in x and z” or “words spoken by fancy people in these novels and quotes pages”.
why do you think it’s unlikely?
It seems to me like asking too much, to think that there won’t be shared natural ontologies between humans (construed broadly) and ML models but we can still make sure that with the right pretraining regiment/dataset choice/etc the model will end up with a human ontology and also this process is something that admits any amount of error and also this can be done in a way that’s not trivially jailbreakable.
I feel like these two paragraphs are just fleshing out the thing you said earlier and aren’t really needed
the next paragraph is kind of like that but making a sort of novel point so maybe they’re necessary? I’d try to focus them on saying things you haven’t yet said
I agree that those three paragraphs are bloated. My issue is this—I don’t yet know which of those three branches is true (natural abstractions exist all the time vs. NAs can exist but only if you put them there vs. NAs do not, in general, exist, and they break immediately) but whichever it is, I think a better theory of semantics would help tell us which one it is, and then also be a necessary prerequisite to the obvious resulting plan.
is this word needed?
No; removed.
can you give examples of such strategies, and argue that they rely on this?
I’m in a weird situation here: I’m not entirely sure whether the community considers the Learning Theory Agenda to be the same alignment plan as The Plan (which is arguably not a plan at all but he sure thinks about value learning!), and whether I can count things like the class of scalable oversight plans which take as read that “human values” are a specific natural object. Would you at least agree that those first two (or one???) rely on that?
a link here could be nice
added
Maybe devote a sentence arguing for this claim
IDG how this is supposed to be related to whether scaling will work. Surely if scaling were enough, your arguments here would still go thru, right?
I realized I wasn’t super clear about which part was which. I agree that “is scaling enough” is a major crux for me and I’d be way way more afraid if it looked like scaling were sufficient on its own; that part, however, is about “do we actually need to get alignment basically exactly right”. Does that change your understanding?
You might want to gesture at why this seems likely to you, since AFAICT this is a minority view.
writing a bit about this now.
what’s the point of this sentence? would anything bad happen if you just deleted it?
I was trying to address the justification for why I’m here doing this instead of someone else doing something else? I might have been reading something about neglectedness from the old rubric. I could totally just cut it.
is this why it isn’t surprising that conceptual alignment is underserved, or an example of it being underserved? as written i feel like the structure implies the second, but content-wise it feels more like the first
both of the human values? or should this be “both human values and how we’d check that...”?
should be more clear, yeah, something like “not only human values but also how we’d check that...”
You use a lot of em dashes, and it’s noticeable. This is a common problem in writing. I don’t know a good way to deal with this, other than suggesting that you consider which ones could be footnotes, parentheses, or commas.
This is helpful! I didn’t know I’d be allowed to use footnotes in my RP; I default to plaintext.
It’s not clear to me why this follows. Couldn’t it be the case that even without a theory of what sorts of features we expect models to learn / use, we can detect what features they are in fact using?
I guess there’s two things here:
what sorts of things in the environment do we expect models to pick up on
how do we expect models to process info from the environment
If we’re wrong about 1, I feel like we could find it out. But if we make wrong assumptions about 2, it makes a bit more sense to me that we could fail to find that out.
In any case, an example indicating how we could fail would probably be useful here.
For 1., we could totally find out that our AGI just plain cannot pick up on what a car or a dog is, and only classify/recognize their parts (or by halves, or just always misclassify them) but then not have any sense of what’s going on to cause it or how to fix it.
For 2. … I have no idea? I feel like that might be out of scope for what I want to think about. I don’t even know how I’d start attacking that problem in full generality or even in part.
I think I’m missing something. What does the story look like, where we have some feature we’re totally unsure of what it signifies, but we’re very sure that the model is using it?
Or from the other direction, I keep coming back to Jacob’s transformer with like 200 orthogonal activation directions that all look to make the model write good code. They all seemed to be producing about the exact same activation pattern 8 layers on. It didn’t seem like his model was particularly spoiled for activation space—so what is it all those extra directions were actually picking up on?