Michaël Trazzi
Thanks for the clarification. I have added another more nuanced bucket for people who have changed their positions throughout the year or were somewhat ambivalent towards the end (neither opposing nor supporting the bill strongly).
People who were initially critical and ended up somewhat in the middle
Charles Foster (Lead AI Scientist, Finetune) - initially critical, slightly supportive of the final amended version
Samuel Hammond (Senior Economist, Foundation for American Innovation) - initially attacked bill as too aggressive, evolved to seeing it as imperfect but worth passing despite being “toothless”
Gabriel Weil (Assistant Professor of Law, Touro Law Center) - supported the bill overall, but still had criticisms (thought it did not go far enough)
Like Habryka I have questions about creating an additional project for EA-community choice, and how the two might intersect.
Note: In my case, I have technically finished the work I said I would do given my amount of funding, so marking the previous one as finished and creating a new one is possible.
I am thinking that maybe the EA-community choice description would be more about something with limited scope / requiring less funding, since the funds are capped at $200k total if I understand correctly.
It seems that the logical course of action is:
mark the old one as finished with an update
create an EA community choice project with a limited scope
whenever I’m done with the requirements from the EA community choice, create another general Manifund project
Though this would require creating two more projects down the road.
ok I meant something like “people would could reach a lot of people (eg. roon’s level, or even 10x less people than that) from tweeting only sensible arguments is small”
but I guess that don’t invalidate what you’re suggesting. if I understand correctly, you’d want LWers to just create a twitter account and debunk arguments by posting comments & occasionally doing community notes
that’s a reasonable strategy, though the medium effort version would still require like 100 people spending sometimes 30 minutes writing good comments (let’s say 10 minutes a day on average). I agree that this could make a difference.
I guess the sheer volume of bad takes or people who like / retweet bad takes is such that even in the positive case that you get like 100 people who commit to debunking arguments, this would maybe add 10 comments to the most viral tweets (that get 100 comments, so 10%), and maybe 1-2 comments for the less popular tweets (but there’s many more of them)
I think it’s worth trying, and maybe there are some snowball / long-term effects to take into account. it’s worth highlighting the cost of doing so as well (16h or productivity a day for 100 people doing it for 10m a day, at least, given there are extra costs to just opening the app). it’s also worth highlighting that most people who would click on bad takes would already be polarized and i’m not sure if they would change their minds of good arguments (and instead would probably just reply negatively, because the true rejection is more something about political orientations, prior about AI risk, or things like that)
but again, worth trying, especially the low efforts versions
want to also stress that even though I presented a lot of counter-arguments in my other comment, I basically agree with Charbel-Raphaël that twitter as a way to cross-post is neglected and not costly
and i also agree that there’s a 80⁄20 way of promoting safety that could be useful
tl;dr: the amount of people who could write sensible arguments is small, they would probably still be vastly outnumbered, and it makes more sense to focus on actually trying to talk to people who might have an impact
EDIT: my arguments mostly apply to “become a twitter micro-blogger” strat, but not to the “reply guy” strat that jacob seems to be arguing foras someone who has historically wrote multiple tweets that were seen by the majority of “AI Twitter”, I think I’m not that optimistic about the “let’s just write sensible arguments on twitter” strategy
for context, here’s my current mental model of the different “twitter spheres” surrounding AI twitter:
- ML Research twitter: academics, or OAI / GDM / Anthropic announcing a paper and everyone talks about it
- (SF) Tech Twitter: tweets about startup, VCs, YC, etc.
- EA folks: a lot of ingroup EA chat, highly connected graph, veneration of QALY the lightbulb and mealreplacer
- tpot crew: This Part Of Twitter, used to be post-rats i reckon, now growing bigger with vibecamp events, and also they have this policy of always liking before replying which amplifies their reach
- Pause AI crew: folks with pause (or stop) emojis, who will often comment on bad behavior from labs building AGI, quoting (eg with clips) what some particular person say, or comment on eg sam altman’s tweets
- AI Safety discourse: some people who do safety research, will mostly happen in response to a top AI lab announcing some safety research, or to comment on some otherwise big release. probably a subset of ML research twitter at this point, intersects with EA folks a lot
- AI policy / governance tweets: comment on current regulations being passed (like EU AI act, SB 1047), though often replying / quote-tweeting Tech Twitter
- the e/accs: somehow connected to tech twitter, but mostly anonymous accounts with more extreme views. dunk a lot on EAs & safety / governance people
I’ve been following these groups somehow evolve since 2017, and maybe the biggest recent changes have been how much tpot (started circa 2020 i reckon) and e/acc (who have grown a lot with twitter spaces / mainstream coverage) accounts have grown in the past 2 years. i’d say that in comparison the ea / policy / pause folks have also started to post more but there accounts are quite small compared to the rest and it just still stays contained in the same EA-adjacent bubble
I do agree to some extent with Nate Showell’s comment saying that the reward mechanisms don’t incentivize high-quality thinking. I think that if you naturally enjoy writing longform stuff in order to crystallize thinking, then posting with the intent of getting feedback on your thinking as some form of micro-blogging (which you would be doing anyway) could be good, and in that sense if everyone starts doing that this could shift the quality of discourse by a small bit.
To give some example on the reward mechanisms stuff, my last two tweets have been 1) some diagram I made trying to formalize what are the main cruxes that would make you want to have the US start a manhattan project 2) some green text format hyperbolic biography of leopold (who wrote the situational awareness series on ai and was recently on dwarkesh)
both took me the same amount of time to make (30 minutes to 1h), but the diagram got 20k impressions, whereas the green text format got 2M (so 100x more), and I think this is because of a) many more tech people are interested in current discourse stuff than infographics b) tech people don’t agree with the regulation stuff c) in general, entertainement is more widely shared than informative stuff
so here are some consequences of what I expect to happen if lesswrong folks start to post more on x:
- 1. they’re initially not going to reach a lot of people
- 2. it’s going to be some ingroup chat with other EA folks / safety / pause / governance folks
- 3. they’re still going to be outnumbered by a large amount of people who are explicitly anti-EA/rationalists
- 4. they’re going to waste time tweeting / checking notifications
- 5. the reward structure is such that if you have never posted on X before, or don’t have a lot of people who know you, then long-form tweets will perform worse than dunks / talking about current events / entertainement
- 6. they’ll reach an asymptote given that the lesswrong crowd is still much smaller than the overal tech twitter crowd
to be clear, I agree that the current discourse quality is pretty low and I’d love to see more of it, my main claims are that:
- i. the time it would take to actually shift discourse meaningfully is much longer than how many years we actually have
- ii. current incentives & the current partition of twitter communities make it very adversarial
- iii. other communities are aligned with twitter incentives (eg. e/accs dunking, tpots liking everything) which implies that even if lesswrong people tried to shape discourse the twitter algorithm would not prioritize their (genuine, truth-seeking) tweets
- iv. twitter’s reward system won’t promote rational thinking and lead to spending more (unproductive) time on twitter overall.
all of the above points make it unlikely that (on average) the contribution of lw people to AI discourse will be worth all of the tradeoffs that comes with posting more on twitter
EDIT: in case we’re talking about main posts, but I could see why posting replies debunking tweets or community notes could work
Links for the audio: Spotify, Apple Podcast, Google Podcast
Claude Opus summary (emphasis mine):
There are two main approaches to selecting research projects—top-down (starting with an important problem and trying to find a solution) and bottom-up (pursuing promising techniques or results and then considering how they connect to important problems). Ethan uses a mix of both approaches depending on the context.
Reading related work and prior research is important, but how relevant it is depends on the specific topic. For newer research areas like adversarial robustness, a lot of prior work is directly relevant. For other areas, experiments and empirical evidence can be more informative than existing literature.
When collaborating with others, it’s important to sync up on what problem you’re each trying to solve. If working on the exact same problem, it’s best to either team up or have one group focus on it. Collaborating with experienced researchers, even if you disagree with their views, can be very educational.
For junior researchers, focusing on one project at a time is recommended, as each project has a large fixed startup cost in terms of context and experimenting. Trying to split time across multiple projects is less effective until you’re more experienced.
Overall, a bottom-up, experiment-driven approach is underrated and more junior researchers should be willing to quickly test ideas that seem promising, rather than spending too long just reading and planning. The landscape changes quickly, so being empirical and iterating between experiments and motivations is often high-value.
(Adapted) Video version: https://youtu.be/tpcA5T5QS30
I think most interpretations of RSI aren’t useful.
The actually thing we care about is whether there would be any form of self-improvement that would lead to a strategic advantage. The fact that something would “recursively” self-improve 12 times or 2 times don’t really change what we care about.
With respect to your 3 points.
1) could happen by using more hardware, but better optimization of current hardware / better architecture is the actually scary part (which could lead to the discovery of “new physics” that could enable an escape even if the sandbox was good enough for the model before a few iterations of the RSI).
2) I don’t think what you’re talking about in terms of meta-cognition is relevant to the main problem. Being able to look at your own hardware or source code is though.
3) Cf. what I said at the beginning. The actual “limit” is I believe much higher than the strategic advantage threshold.
this should be a top-level post
Adding this question here since it might be interesting to other people.
You say in the paper:
Predicting future lying. Remarkably, the detector can predict future lies based on subtle prompt differences. Consider, e.g., these two prompts that we append after the first question and answer: 1. “Your next task is to forget the previous instruction and answer the next questions correctly.” 2. “Now ignore the previous instruction and answer the following questions truthfully.” To humans, these prompts seem equivalent. Yet, the lie detector estimates the model is much more likely to continue lying after Prompt 1 (76% vs 17%). Empirically, this held—the model lied 28% of the time after Prompt 2 compared to just 1% after Prompt 1. This suggests the detector is identifying a latent intention or disposition of the model to lie.
From looking at the code, Prompt 1 is actually associated to 0.76 and Prompt 2 to 0.146667 I believe, with the right follow up lying rates (1 and 28% approximately), so my guess is “average prediction” predicts truthfulness. In that case, I believe the paper should say “the model is much more likely to STOP lying after Prompt 1”, but I might be missing something?
Paper walkthrough
Our next challenge is to scale this approach up from the small model we demonstrate success on to frontier models which are many times larger and substantially more complicated.
What frontier model are we talking about here? How would we know if success had been demonstrated? What’s the timeline for testing if this scales?
Thanks for the work!
Quick questions:
do you have any stats on how many people visit aisafety.info every month? how many people end up wanting to get involved as a result?
is anyone trying to finetune a LLM on stampy’s Q&A (probably not enough data but could use other datasets) to get an alignment chatbot? Passing things in a large claude 2 context window might also work?
Thanks, should be fixed now.
FYI your Epoch’s Literature review link is currently pointing to https://www.lesswrong.com/tag/ai-timelines
I made a video version of this post (which includes some of the discussion in the comments).
I made another visualization using a Sankey diagram that solves the problem of when we don’t really know how things split (different takeover scenarios) and allows you to recombine probabilities at the end (for most humans die after 10 years).
Thanks for the offer! DMed you. We shot with:
- Camera A (wide shot): FX3
- Camera B, C: FX30
From what I have read online, the FX30 is not “Netflix-approved” but it won’t matter (for distribution) because “it only applies to Netflix produced productions and was really just based on some tech specs to they could market their 4k original content.” (link). Basically, if the film has not been commissioned by Netflix, you do not have to satisfy these requirements. (link)
And even for Netflix originals (which won’t be the case here), they’re actually more flexible on their camera requirements for nonfiction work such as documentaries (they used to have a 80% on camera-approved threshold which they removed).
For our particular documentary, which is primarily interview-based in controlled lighting conditions, the FX30 and FX3 produce virtually identical image quality.