Primarily interested in agent foundations. AI macrostrategy. and enhancement of human intelligence, sanity, and wisdom.
I endorse and operate by Crocker’s rules.
I have not signed any agreements whose existence I cannot mention.
Primarily interested in agent foundations. AI macrostrategy. and enhancement of human intelligence, sanity, and wisdom.
I endorse and operate by Crocker’s rules.
I have not signed any agreements whose existence I cannot mention.
Strong agree.
For a more generalized version, see: https://www.lesswrong.com/posts/4gDbqL3Tods8kHDqs/limits-to-legibility
(caveat they initially distil from a much larger model, which I see as a little bit of a cheat)
Another little bit of a cheat is that they only train Qwen2.5-Math-7B according to the procedure described. In contrast, for the other three models (smaller than Qwen2.5-Math-7B), they instead use the fine-tuned Qwen2.5-Math-7B to generate the training data to bootstrap round 4. (Basically, they distill from DeepSeek in round 1 and then they distill from fine-tuned Qwen in round 4.)
They justify:
Due to limited GPU resources, we performed 4 rounds of self-evolution exclusively on Qwen2.5-Math-7B, yielding 4 evolved policy SLMs (Table 3) and 4 PPMs (Table 4). For the other 3 policy LLMs, we fine-tune them using step-by-step verified trajectories generated from Qwen2.5-Math-7B’s 4th round. The final PPM from this round is then used as the reward model for the 3 policy SLMs.
TBH I’m not sure how this helps them with saving on GPU resources. For some reason it’s cheaper to generate a lot of big/long rollouts with the Qwen2.5-Math-7B-r4 than three times with [smaller model]-r3?)
I donated $1k.
Lighthaven is the best venue I’ve been to. LessWrong is the best place on the internet that I know of and it hosts an intellectual community that was crucial for my development as a thinker and greatly influenced my life decisions over the last 3 years.
I’m grateful for it.
I wish you all the best and hope to see you flourish and prosper.
The vNM axioms constrain the shape of an agent’s preferences, they say nothing about how to make decisions
Suppose your decision in a particular situation comes down to choosing between some number of lotteries (with specific estimated probabilities over their outcomes) and there’s no complexity/nuance/tricks on top of that. In that case, vNM says that you should choose the one with the highest expected utility as this is the one you prefer the most.
At least assuming that choice is the right operationalization of preferences but if it isn’t, then the Dutch book / money-pump arguments don’t follow.
ETA: I guess I could just say:
What are your preferences if not your idealized evaluations of decision-worthiness of options (modulo “being a corrupted piece of software running on corrupted hardware”)?
1. Introduce third-party mission alignment red teaming.
Anthropic should invite external parties to scrutinize and criticize Anthropic’s instrumental policy and specific actions based on whether they are actually advancing Anthropic’s stated mission, i.e. safe, powerful, and beneficial AI.
Tentatively, red-teaming parties might include other AI labs (adjusted for conflict of interest in some way?), as well as AI safety/alignment/risk-mitigation orgs: MIRI, Conjecture, ControlAI, PauseAI, CEST, CHT, METR, Apollo, CeSIA, ARIA, AI Safety Institutes, Convergence Analysis, CARMA, ACS, CAIS, CHAI, &c.
For the sake of clarity, each red team should provide a brief on their background views (something similar to MIRI’s Four Background Claims).
Along with their criticisms, red teams would be encouraged to propose somewhat specific changes, possibly ordered by magnitude, with something like “allocate marginally more funding to this” being a small change and “pause AGI development completely” being a very big change. Ideally, they should avoid making suggestions that include the possibility of making a small improvement now that would block a big improvement later (or make it more difficult).
Since Dario seems to be very interested in “race to the top” dynamics: if this mission alignment red-teaming program successfully signals well about Anthropic, other labs should catch up and start competing more intensely to be evaluated as positively as possible by third parties (“race towards safety”?).
It would also be good to have a platform where red teams can converse with Anthropic, as well as with each other, and the logs of their back-and-forth are published to be viewed by the public.
Anthropic should commit to taking these criticisms seriously. In particular, given how large the stakes are, they should commit to taking something like “many parties believe that Anthropic in its current form might be net-negative, even increasing the risk of extinction from AI” as a reason to pause or slow down, even if that’s contrary to their inside view.
2. Anthropic should make an explicit statement about its infohazard policy.
This statement should include how Anthropic thinks about and how it handles doing and publishing research that advances AGI development and doesn’t benefit safety/alignment/x-risk reduction to an extent sufficient to offset its contribution to (likely unsafe by default) AGI development.
I wish this was posted as a question, ideally by you together with other Anthropic people, including Dario.
Figure out a way to show users the CoT of reasoning/agent models that you release in the future. (i.e. don’t do what OpenAI did with o1). Doesn’t have to be all of it, just has to be enough—e.g. each user gets 1 CoT view per day.
What would be the purpose of 1 CoT view per user per day?
Where does China fit into this picture
Unlike the West, China enjoys unconditional love from the Heavens. /j
[After I wrote down the thing, I became more uncertain about how much weight to give to it. Still, I think it’s a valid consideration to have on your list of considerations.]
“AI alignment”, “AI safety”, “AI (X-)risk”, “AInotkilleveryoneism”, “AI ethics” came to be associated with somewhat specific categories of issues. When somebody says “we should work (or invest more or spend more) on AI {alignment,safety,X-risk,notkilleveryoneism,ethics}”, they communicate that they are concerned about those issues and think that deliberate work on addressing those issues is required or otherwise those issues are probably not going to be addressed (to a sufficient extent, within relevant time, &c.).
“AI outcomes” is even broader/[more inclusive] than any of the above (the only step left to broaden it even further would be perhaps to say “work on AI being good” or, in the other direction, work on “technology/innovation outcomes”) and/but also waters down the issue even more. Now you’re saying “AI is not going to be (sufficiently) good by default (with various AI outcomes people having very different ideas about what makes AI likely not (sufficiently) good by default)”.
It feels like we’re moving in the direction of broadening our scope of consideration to (1) ensure we’re not missing anything, and (2) facilitate coalition building (moral trade?). While this is valid, it risks (1) failing to operate on the/an appropriate level of abstraction, and (2) diluting our stated concerns so much that coalition building becomes too difficult because different people/groups endorsing stated concerns have their own interpretations/beliefs/value systems. (Something something find an optimum (but also be ready and willing to update where you think the optimum lies when situation changes)?)
I’m not claiming it’s feasible (within decades). That’s just what a solution might look like.
Insufficiently catchy
Something like “We have mapped out the possible human-understandable or algorithmically neat descriptions of the network’s behavior sufficiently comprehensively and sampled from this space sufficiently comprehensively to know that the probability that there’s a description of its behavior that is meaningfully shorter than the shortest one of the ones that we’ve found is at most .”.
It seems to me this is challenging to interpret if there isn’t a persistent agent, with a persistent mind, who assigns Bayesian subjective probabilities to outcomes
Right but if there isn’t a persistent agent with a persistent mind, then we no longer have an entity to which predicates of rationality apply (at least in the sense that the term “rationality” is usually understood in this community). Talking about it in terms of “it’s no longer vNM-rational” feels like saying “it’s no longer wet” when you change the subject of discussion from physical bodies to abstract mathematical structures.
Or am I misunderstanding you?
If you negate 1 or 3, then you have an additional factor/consideration in what your mind should be shaped like and the conclusion “you better be shaped such that your behavior is interpretable as maximizing some (sensible) utility function or otherwise you are exploitable or miss out on profitable bets” doesn’t straightforwardly follow.
setting utility in log of size of bankroll
This doesn’t work if the lottery is in utils rather than dollars/money/whatever instrumental resource.
vNM coherence arguments etc say something like “you better have your preferences satisfy those criteria because otherwise you might get exploited or miss out on opportunities for profit”.
I have my gripes with parts of it but to the extent these arguments hold some water (and I do think they hold some water) they assume that there’s no other pressures acting on the mind (or mind-generating process or something) or [reasons to be shaped like this instead of being shaped like that] that act along or interact with those vNM-ish pressures.
Various forms of boundedness are the most obvious example, though not very interesting. A more interesting example is the need to have an updateless component in one’s decision theory.[1] Plausibly there’s also the thing about acquiring deontology/virtue-ethics making the agent easier to cooperate with.
So I think that it’s better to think of vNM-ish pressures as being one category of pressures acting on the ~mind, than to think of a vNM agent as one of the final-agent-type options. You get the latter from the former if you assume away all other pressures but the pressures view is more foundational IMO.
Updatelessness is inconsistent with an assumption of decision tree separability that is a foundation for the money pump arguments for vNM, at least the ones used by Gustafsson.
When I click “New Post”, I see this
A factor stemming from the same cause but pushing in the opposite direction is that “mundane” AI profitability can “distract” people who would otherwise be AGI hawks.
Was [keeping FrontierMath entirely private and under Epoch’s control] feasible for Epoch in the same sense of “feasible” you are using here?