Thanks for the feedback! In a follow-up, I can try creating various rewordings of the prompt for each value. But instead of just neutral rewordings, it seems like you are talking about the extent to which the tone of the prompt is implicitly encouraging behavior (output length) one way or the other, am I correct in interpreting that way? So e.g. have a much more subdued/neutral tone for the consciousness example?
Joe Kwon
Does the median LW commenter believe that autoregressive LLMs will take us all the way to superintelligence?
Super cool stuff. Minor question, what does “Fraction of MLP progress” mean? Are you scaling down the MLP output values that get added to the residual stream? Thanks!
FWIW I understand now what it’s meant to do, but have very little idea how your protocol/proposal delivers positive outcomes in the world by emitting performative speech acts. I think explaining your internal reasoning/hypothesis for how emitting performative speech acts leads to powerful AI’s delivering positive incomes would be helpful.
Is such a “channel” necessary to deliver positive outcomes? Is it supposed to make it more likely that AI delivers positive outcomes? More details on what a success looks like to you here, etc.
I skimmed The Snuggle/Date/Slap Protocol and Ethicophysics II: Politics is the Mind-Savior which are two recent downvoted posts of yours. I think they get negative karma because they are difficult to understand and it’s hard to tell what you’re supposed to take away from it. They would probably be better received if the content were written such that it’s easy to understand what your message is at an object-level as well as what the point of your post is.
I read the Snuggle/Date/Slap Protocol and feel confused about what you’re trying to accomplish (is it solving AI Alignment?) and how the method is supposed to accomplish that.
In the ethicophysics posts, I understand the object level claims/material (like the homework/discussion questions) but fail to understand what the goal is. It seems like you are jumping to grounded mathematical theories for stuff like ethics/morality which immediately makes me feel dubious. It’s a too much, too grand, too certain kind of reaction. Perhaps you’re just spitballing/brainstorming some ideas, but that’s not how it comes across and I infer you feel deeply assured that it’s correct given statements like “It [your theory of ethics modeled on the laws of physics] therefore forms an ideal foundation for solving the AI safety problem.”
I don’t necessarily think you should change whatever you’re doing BTW just pointing out some likely reactions/impressions driving negative karma.
This is terrific. One feature that will be great to have, is a way to sort and categorize your predictions under various labels.
Sexuality is, usually, a very strong drive which has a large influence over behaviour and long term goals. If we could create an alignment drive as strong in our AGI we would be in a good position.
I don’t think we’d be in a good position even if we instilled an alignment drive this strong in AGI
To me, the caveats section of this post highlights the limited scope from which language models will be able to learn human values and preferences, given explicitly stated (And even implied-from-text) goals != human values as a whole.
Hi Cameron, nice to see you here : ) what are your thoughts on a critique like: human prosocial behavior/values only look the way they look and hold stable within-lifetimes, insofar as we evolved in + live in a world where there are loads of other agents with roughly equal power as ourselves? Do you disagree with that belief?
This was very insightful. It seems like a great thing to point to, for the many newish-to-alignment people ideating research agendas (like myself). Thanks for writing and posting!
This is a really cool idea and I’m glad you made the post! Here are a few comments/thoughts:
H1: “If you give a human absolute power, there is a small subset of humans that actually cares and will try to make everyone’s life better according to their own wishes”How confident are you in this premise? Power and sense of values/incentives/preferences may not be orthogonal (and my intuition is that it isn’t). Also, I feel a little skeptical about the usefulness of thinking about the trait showing up more or less in various intelligence strata within humans. Seems like what we’re worried about is in a different reference class. Not sure.
H4 is something I’m super interested in and would be happy to talk about it in conversations/calls if you want to : )
Something at the root of this might be relevant to the inverse scaling competition thing where they’re trying to find what things get worse in larger models. This might have some flavor of obviously wrongness → deception via plausible sounding things as models get larger? https://github.com/inverse-scaling/prize
interesting idea. like.. a mix of genuine sympathy/expansion of moral circle to AI, and virtue signaling/anti-corporation meme spreads to the majority population and effectively curtails AGI capabilities research? This feels like a thing that might actually do nothing to reduce corporations’ efforts to get to powerful AI unless it reaches a threshold at which point there’s very dramatic actions against corporations who continue to try to do that thing
I stream-of-consciousness’d this out and I’m not happy with how it turned out, but it’s probably better I post this than delete it for not being polished and eloquent. Can clarify with responses in comments.
Glad you posted this and I’m also interested in hearing what others say. I’ve had these questions for myself in tiny bursts throughout the last few months.
When I get the chance to speak to people earlier in their career stage than myself (starting undergrad, or is a high schooler attending a mathcamp I went to) who are undecided about their careers, I bring up my interest in AI Alignment and why I think it’s important, and share resources for them after the call in case they’re interested in learning more about it. I don’t have very many opportunities like this because I don’t actively seek to identify and “recruit” them. I only bring it up by happenstance (e.g. joining a random discord server for homotopy type theory, seeing an intro by someone who went to the same mathcamp as me and is interested in cogsci, and scheduling a call to talk about my research background in cogsci and how my interests have evolved/led me to alignment over time).
I know very talented people who are around my age at MIT and from a math program I attended; students who are breezing by technical double majors with perfect GPAs, IMO participants, good competitive programmers, etc. Some things that make it hard for me:
If I know them well, I can talk about my research interests and try to get them to see my motivation, but if I’m only catching up with them 1-2x a year, it feels very unnatural and synthetic for me to be spending that time trying to convert them into doing alignment work. If I am still very close to them / talk to them frequently, there’s still an issue of bringing it up naturally and having a chance to convince them. Most of these people are doing Math PhDs, or trading in finance, or working on a startup, or… The point is that they are fresh on their sprint down the path that they have chosen. They are all the type who are very focused and determined to succeed on the goals they have settled on. It is not “easy” to get them (or for this matter, almost any college student) to halt their “exploit” mode, take 10 steps back and lots of time from their busy lives, and then “explore” another option that I’m seemingly imposing onto them. FWIW, the people I know who are in trading seem to be the most likely to switch out (explicitly have told me in conversations that they just enjoy the challenge of the work, but want to find more fulfilling things down the road. And to these people I share ideas and resources about AI Safety.)
I shared resources after the call, talked about why I’m interested in alignment, and that’s the furthest I’ve gone wrt potentially converting someone who is already in a separate career track, to consider alignment.
If it was MUCH easier to convince people that ai alignment is worth thinking about in under an hour, and I could reach out to people to talk to me about this for a hour without looking like a nutjob and potentially damaging our relationship because it seems like I’m just trying to convert them into something else, AND the field of AI Alignment was more naturally compelling for them to join, I’d do much more of this outreach. On that last point, what I mean is: for one moment, let’s suspend the object level importance of solving AI Alignment. In reality, there are things that are incredibly important/attractive for people when pursuing a career. Status, monetary compensation, and recognition (and not being labeled a nutjob) are some big ones. If these things were better (and I think they are getting much better recently), it would be easier to get people to spend more time at least thinking about the possibility of working on AI Alignment, and eventually some would work on it because I don’t think the arguments for x-risk from AI are hard to understand. If I personally didn’t have so much support by way of programs the community had started (SERI, AISC, EA 1-1s, EAG AI Safety researchers making time to talk to me), or it felt like the EA/X-risk group was not at all “prestigious”, I don’t know how engaged I would’ve been in the beginning, when I started my own journey learning about all this. As much as I wish it weren’t true, I would not be surprised at all if the first instinctual thing that led me down this road was noticing that EAs/LW users were intelligent and had a solidly respectable community, before choosing to spend my time engaging with the content (a lot of which was about X-risks).
Hi John. One could run useful empirical experiments right now, before fleshing out all these structures and how to represent them, if you can assume that a proxy for human representations (crude: conceptnet, less crude: similarity judgments on visual features and classes collected by humans) is a good enough proxy for “relevant structures” (or at least that these representations more faithfully capture the natural abstractions than the best machines in vision tasks for example, where human performance is the benchmark performance), right?
I had a similar idea about ontology mismatch identification via checking for isomorphic structures, and also realized I had no idea how to realize that idea. Through some discussions with Stephen Casper and Ilia Sucholutsky, we kind of pivoted the above idea into the regime of interpretability/adversarial robustness where we are hunting for interesting properties given that we can identify the biggest ways that humans and machines are representing things differently (and that humans, for now, are doing it “better”/more efficiently/more like the natural abstraction structures that exist).
I think am working in the same building this summer (caught a split-second glance at you yesterday); I would love a chance to discuss how selection theorems might relate to an interpretability/adversarial robustness project I have been thinking about.
Thanks so much for the response, this is all clear now!
Sorry if it’s obvious from some other part of your post, but the whole premise is that sufficiently strong models *deployed in sufficiently complex environments* leads to general intelligence with optimization over various levels of abstractions. So why is it obvious that: It doesn’t matter if your AI is only taught math, if it’s a glorified calculator — any sufficiently powerful calculator desperately wants to be an optimizer?
If it’s only trained to solve arithmetic and there are no additional sensory modalities aside from the buttons on a typical calculator, how does increasing this AI’s compute/power lead to it becoming an optimizer over a wider domain than just arithmetic? Maybe I’m misunderstanding the claim, or maybe there’s an obvious reason I’m overlooking.
Also, what do you think of the possibility that when AI becomes superhuman++ in tasks, that the representations go from interpretable to inscrutable again (because it uses lower level representations that are inaccessible to humans)? I understand the natural abstraction hypothesis, and I buy it too, but even an epsilon increase in details might compound into significant prediction outcomes if a causal model is trying to use tons of representations in conjunction to compute something complex.
Do you think it might be valuable to find a theoretical limit that shows that the amount of compute needed for such epsilon-details to be usefully incorporated is greater than ever will be feasible (or not)?
Hi Steve, loved this post! I’ve been interested in viewing the steering and thought generator + assessor submodule framework as the object and generator-of-values of which which we want AI to learn a good pointer to/representation of, to simulate out the complex+emergent human values and properly value extrapolate.
I know the way I’m thinking about the following doesn’t sit quite right with your perspective, because AFAIK, you don’t believe there need to be independent, modular value systems that give their own reward signals for different things (your steering subsystem and thought generator and assessor subsystem are working in tandem to produce a singular reward signal). I’d be interested in hearing your thoughts on what seems more realistic, after importing my model of value generators as more distinctive and independent modular systems in the brain.
In the past week, I’ve been thinking about the potential importance of considering human value generators as modular subsystems (for both compute and reward). Consider the possibility that at various stages of the evolutionary neurocircuitry-shaping timeline of humans, that modular and independently developed subsystems developed. E.g. one of the first systems, some “reptilian” vibe system, was one that rewarded sugary stuff because it was a good proxy at the time for nutritious/calorie-dense foods that help with survival. And then down the line, there was another system that developed to reward feeling high-social status, because it was a good proxy at the time for surviving as social animals in in-group tribal environments. What things would you critique about this view, and how would you fit similar core-gears into your model of the human value generating system?
I’m considering value generators as more independent and modular, because (this gets into a philosophical domain but) perhaps we want powerful optimizers to apply optimization pressure not towards the human values generated by our wholistic-reward-system, but to ones generated by specific subsystems (system 2, higher-order values, cognitive/executive control reward system) instead of reptilian hedon-maximizing system.
This is a few-day old, extremely crude and rough-around-the-edges idea, but I’d especially appreciate your input and critiques on this view. If it were promising enough, I wonder if (inspired by John Wentworth’s evolution of modularity post) training agents in a huge MMO environment and switching up reward signals in the environment (or the environment distribution itself) every few generations would lead to a development of modular reward systems (mimicking the trajectory of value generator systems developing in humans over the evolutionary timeline).
Enjoyed reading this! Really glad you’re getting good research experience and I’m stoked about the strides you’re making towards developing research skills since our call (feels like ages ago)! I’ve been doing a lot of what you describe as “directed research” myself lately as I’m learning more about DL-specific projects and I’ve been learning much faster than when I was just doing cursory, half-assed paper skimming, alongside my cogsci projects. Would love to catch up over a call sometime to talk about stuff we’re working on now
Makes sense, and I also don’t expect the results here to be surprising to most people.
What do you mean by this part? As in if it just writes very long responses naturally? There’s a significant change in the response lengths depending on whether it’s just the question (empirically the longest for my factual questions), a short prompt preceding the question, a longer prompt preceding the question, etc. So I tried to control for the fact that having any consciousness prompt means a longer input to Claude by creating some control prompts that have nothing to do with consciousness—in which case it had shorter responses after controlling for input length.
Basically because I’m working with an already RLHF’d model whose output lengths are probably most dominated by whatever happened in the preference tuning process, I try my best to account for that by having similar length prompts preceding the questions I ask.