LessWrong team member / moderator. I’ve been a LessWrong organizer since 2011, with roughly equal focus on the cultural, practical and intellectual aspects of the community. My first project was creating the Secular Solstice and helping groups across the world run their own version of it. More recently I’ve been interested in improving my own epistemic standards and helping others to do so as well.
Raemon
Or: when the current policy stops making sense, we can figure out a new policy.
In particular, when the current policy stops making sense, AI moderation tools may also be more powerful and can enable a wider range of policies.
I mean, the sanctions are ‘if we think your content looks LLM generated, we’ll reject it and/or give a warning and/or eventually delete or ban.’ We do this for several users a day.
That may get harder someday but it’s certainly not unenforceable now.
I agree it’ll get harder to validate, but I think having something like this policy is, like, a prerequisite (or at least helpful grounding) for the mindset change.
Curated. I think figuring out whether and how we can apply AI to AI safety is one of the most important questions, and I like this post for exploring this through many more different angles than we’d historically seen.
A thing I both like and dislike about this post is that it’s more focused on laying out the questions than giving answers. This makes it easier for me the post to “help me think it through myself” (rather than just telling me a “we should do X” style answer).
But it lays out a dizzying enough array of different concerns that I found it sort of hard to translate this into “okay what actually should I actually think about next?”. I’d have found it helpful if the post ended with some kind of recap of “here’s the areas that seem most important to be tracking, for me.”
(note: This is Raemon’s random take rather than considered Team Consensus)
Part of the question here is “what sort of engine is overall maintainable, from a moderation perspective?”.
LLMs make it easy for tons of people to be submitting content to LessWrong without really checking whether it’s true and relevant. It’s not enough for a given piece to be true. It needs to be reliably true, with low cost to moderator attention.
Right now, basically LLMs don’t produce anywhere near good enough content. So, presently, letting people submit AI generated content without adding significant additional value is a recipe for LW admins to spend a bunch of extra time each day deciding whether to moderate a bunch of content that we’re realistically going to say “no” to.
(Some of the content is ~on par with the bottom 25% of LW content, but the bottom 25% of LW content is honestly below the quality bar we prefer the site to be at, and the reason we let those comments/posts in at all is because it’s too expensive to really check if it’s reasonable, and when we’re unsure, we sometimes to default to “let it in, and let the automatic rate limits handle it”. But, the automated rate limits would not be sufficient to handle an influx of LLM slop)
But, even when we imagine content that should theoretically be “just over the bar”, there are secondorder effects of LW being a site with a potentially large amount of AI content that nobody is really sure if it’s accurate or whether anyone endorses it and whether we are entering into some slow rolling epistemic disaster.
So, my guess for the bar for “how good quality do we need to be talking about for AI content to be net-positive” is more at least top-50% and maybe top-25% of baseline LW users. And when we get to that point probably the world looks pretty different.
My lived experience is that AI-assisted-coding hasn’t actually improved my workflow much since o1-preview, although other people I know have reported differently.
It seems like my workshops would generally work better if they were spaced out over 3 Saturdays, instead of crammed into 2.5 days in one weekend.
This would give people more time to try applying the skills in their day to day, and see what strategic problems they actually run into each week. Then on each Saturday, they could spend some time reviewing last week, thinking about what they want to get out of this workshop day, and then making a plan for next week.
My main hesitation is I kind of expect people to flake more when it’s spread out over 3 weeks, or for it to be harder to find 3 Saturdays in a row that work as opposed to 1 full weekend in a row.
I also think there is a bit of a special workshop container that you get when there’s 3 days in a row, and it’s a bit sad to lose that container.
But, two ideas I’ve considered so far are:
Charge more, and people get a partial refund if they attend all three sessions.
Have there be 4 days instead of 3, and design it such that if people miss a day it’s not that big a deal.
I’ve also been thinking about a more immersive-program experience, where for 3-4 weeks, people are living/working onsite at Lighthaven, mostly working on some ambitious-but-confusing project, but with periodic lessons and checkins about practical metastrategy. (This is basically a different product than “the current workshop”, and much higher commitment, but it’s closer to what I originally wanted with Feedbackloop-first Rationality, and is what I most expect to actually work)
I’m curious to hear what people think about these.
Also, have you tracked the previous discussion on Old Scott Alexander and LessWrong about generally “mysterious straight lines” being a surprisingly common phenomenon in economics. i.e. On an old AI post Oli noted:
This is one of my major go-to examples of this really weird linear phenomenon:
150 years of a completely straight line! There were two world wars in there, the development of artificial fertilizer, the broad industrialization of society, the invention of the car. And all throughout the line just carries one, with no significant perturbations.
This doesn’t mean we should automatically take new proposed Straight Line Phenomena at face value, I don’t actually know if this is more like “pretty common actually” or “there are a few notable times it was true that are drawing undue attention.” But I’m at least not like “this is a never-before-seen anomaly”)
I think it’s also “My Little Pony Fanfics are more cringe than Harry Potter fanfics, and there is something about the combo of My Little Pony and AIs taking over the world that is extra cringe.”
I’m here from the future trying to decide how much to believe in and how common are Gods of Straight Lines, and curious if you could say more arguing about this.
I do periodically think about this and feel kind of exhausted at the prospect, but it does seem pretty plausibly correct. Good to have a writeup of it.
It particularly seems likely to be the right mindset if you think survival right now depends on getting some kind of longish pause (at least on the sort of research that’d lead to RSI+takeoff)
Metastrategy = Cultivating good “luck surface area”?
Metastrategy: being good at looking at an arbitrary situation/problem, and figure out what your goals are, and what strategies/plans/tactics to employ in pursuit of those goals.
Luck Surface area: exposing yourself to a lot of situations where you are more likely to get valuable things in a not-very-predictable way. Being “good at cultivating luck surface area” means going to events/talking-to-people/consuming information that are more likely to give you random opportunities / new ways of thinking / new partners.
At one of my metastrategy workshops, while I talked with a participant about what actions had been most valuable the previous year, many of the things were like “we published a blogpost, or went to an event, and then kinda randomly found people who helped us a bunch, i.e. gave us money or we ended up hiring them.”
This led me to utter the sentence “yeah, okay I grudgingly admit that ‘increasing your luck surface area’ is more important than being good at ‘metastrategy’”, and I improvised a session on “where did a lot of your good luck come from this year, and how could you capitalize more on that?”But, thinking later, I think maybe actually “being good at metastrategy” and “being good at managing luck surface area” are maybe basically the same thing?
That is:
If already know how to handle a given situation, you’re basically using “strategy”, not “metastrategy.”
If you don’t already know, what you wanna do is strategically direct your thoughts in novel directions (maybe by doing crazy brainstorming, maybe by doing structured “think about the problem in a bunch of different ways that seem likely to help”, maybe by taking a shower and letting your mind wander, maybe by talking to people who will likely have good advice about your problem.
This is basically “exposing luck surface area” for your cognition.
Thinking about it more and chatting with a friend: Managing Luck Surface Area seems like a subset of metastrategy but not the whole thing.
One counter example they gave was “reading a book that will basically tell you a crucial fact, or teach you a specific skill”, where you basically know it will work and that it’s a necessary prerequisite for solving your problem.
But it does seem like the “luck surface area”-ish portion of metastrategy is usually more important for most people/situations, esp. if you’re going to find plans that are 10-100x better than your current plan. (Although, once you locate a hypothesis “get a ton of domain-expertise in a given field” might be the right next step. That’s sort of blurring back into “regular strategy” rather than “metastrategy”, although the line is fuzzy)
Pedagogic feedback: each diagram is much longer than a page, it’s harder to fit the whole thing in my head at once.
It’s unclear to me what the current evidence is for this happening ‘a lot’ and ‘them being called Nova specifically’. I don’t particularly doubt it but it seemed sort of asserted without much background.
Curated. This concept seems like an important building block for designing incentive structures / societies, and this seems like a good comprehensive reference post for the concept.
Note: it looks like you probably want this to be a markdown file. You can go to https://www.lesswrong.com/account, with the “site customizations” section, and click “activate Markdown” to enable the markdown editor.
Fyi I think it’s time to do minor formatting adjustments to make papers/abstracts easier to read on LW
I think there might be a bit of a (presumably unintentional) motte and bailey here where the motte is “careful conceptual thinking might be required rather than pure naive empiricism (because we won’t be given good enough test beds by default) and it seems like Anthropic (leadership) might fail heavily at this” and the bailey is “extreme philosophical competence (e.g. 10-30 years of tricky work) is pretty likely to be needed”.
Yeah I agree that was happening somewhat. The connecting dots here are “in worlds where it turns out we need a long Philosophical Pause, I think you and Buck would probably be above some threshold where you notice and navigate it reasonably.”
I think my actual belief is “the Motte is high likelihood true, the Bailey is… medium-ish likelihood true, but, like, it’s a distribution, there’s not a clear dividing line between them”
I also think the pause can be “well, we’re running untrusted AGIs and ~trusted pseudogeneral LLM-agents that help with the philosophical progress, but, we can’t run them that long or fast, they help speed things up and make what’d normally be a 10-30 year pause into a 3-10 year pause, but also the world would be going crazy left to it’s own devices, and the sort of global institutional changes necessary are still similarly-outside-of-overton window as a 20 year global moratorium and the “race with China” rhetoric is still bad.
Thanks for laying this out thus far. I’mma reply but understand if you wanna leave the convo here . I would be interested in more effortpost/dialogue about your thoughts here.
Yes, my reasoning is definitely part, but not all of the argument. Like the thing I said is a sufficient crux for me. (If I thought we had to directly use human labor to align AIs which were qualitatively wildly superhuman in general I would put much more weight on “extreme philosophical competence”.)
This makes sense as a crux for the claim “we need philosophical competence to align unboundedly intelligent superintelligences.” But, it doesn’t make sense for the claim “we need philosophical competence to align general, openended intelligence.” I suppose my OP didn’t really distinguish these claims and there were a few interpretations of how the arguments fit together. I was more saying the second (although to be fair I’m not sure I was actually distinguishing them well in my head until now)
It doesn’t make sense for “we just’ need to be able to hand off to an AI which is seriously aligned” to be a crux for the second. A thing can’t be a crux for itself.
I notice my “other-guy-feels-like-they’re-missing-the-point” → “check if I’m not listening well, or if something is structurally wrong with the convo” alarm is firing, so maybe I do want to ask for one last clarification on “did you feel like you understood this the first time? Does it feel like I’m missing the point of what you said? Do you think you understand why it feels to me like you were missing the point (even if you think it’s because I’m being dense about something?)
Takes on your proposal
Meanwhile, here’s some takes based on my current understanding of your proposal.
These bits:
We need to ensure that our countermeasures aren’t just shifting from a type of misalignment we can detect to a type we can’t. Qualitatively analyzing the countermeasures and our tests should help here.
...is a bit I think is philosophical-competence bottlenecked. And this bit:
“Actually, we didn’t have any methods available to try which could end up with a model that (always) isn’t egregiously misaligned. So, even if you can iterate a bunch, you’ll just either find that nothing works or you’ll just fool yourself.”
...is a mix of “philosophically bottlenecked” and “rationality bottlenecked.” (i.e. you both have to be capable of reasoning about whether you’ve found things that really worked, and, because there are a lot of degrees of freedom, capable of noticing if you’re deploying that reasoning accurately)
I might buy that you and Buck are competent enough here to think clearly about it (not sure. I think you benefit from having a number of people around who seem likely to help), but I would bet against Anthropic decisionmakers being philosophically competent enough.
(I think at least some people on the alignment science or interpretability teams might be. I bet against the median such teammembers being able to navigate it. And ultimately, what matters is “does Anthropic leadership go forward with the next training run”, so it matters whether Anthropic leadership buys arguments from hypothetically-competent-enough alignment/interpretability people. And Anthropic leadership already seem to basically be ignoring arguments of this type, and I don’t actually expect to get the sort of empirical clarity that (it seems like) they’d need to update before it’s too late.)
Second, we can study how generalization on this sort of thing works in general
I think this counts as the sort of empiricism I’m somewhat optimisic about in my post. i.e. if you are able to find experiments that actually give you evidence about deeper laws, that let you then make predictions about new Actually Uncertain questions of generalization that you then run more experiments on… that’s the sort of thing I feel optimistic about. (Depending on the details, of course)
But, you still need technical philosophical competence to know if you’re asking the right questions about generalization, and to know when the results actually imply that the next scale-up is safe.
I think I agree with a lot of stuff here but don’t find this post itself particularly compelling for the point.
I also don’t think “be virtuous” is really sufficient to know “what to actually do.” It matters a lot which virtues. Like I think environmentalism’s problems wasn’t “insufficiently virtue-ethics oriented”, it’s problem was that it didn’t have some particular virtues that were important.