Mikhail Samin

Karma: 1,600

My name is Mikhail Samin (diminutive Misha, @Mihonarium on Twitter, @misha in Telegram).

Humanity’s future can be huge and awesome; losing it would mean our lightcone (and maybe the universe) losing most of its potential value.

My research is currently focused on AI governance and improving the understanding of AI and AI risks among stakeholders. I also have takes on what seems to me to be the very obvious shallow stuff about the technical AI notkilleveryoneism; but many AI Safety researchers told me our conversations improved their understanding of the alignment problem.

I believe a capacity for global regulation is necessary to mitigate the risks posed by future general AI systems. I’m happy to talk to policymakers and researchers about ensuring AI benefits society.

I took the Giving What We Can pledge to donate at least 10% of my income for the rest of my life or until the day I retire (why?).

In the past, I’ve launched the most funded crowdfunding campaign in the history of Russia (it was to print HPMOR! we printed 21 000 copies =63k books) and founded audd.io, which allowed me to donate >$100k to EA causes, including >$60k to MIRI.

[Less important: I’ve also started a project to translate 80,000 Hours, a career guide that helps to find a fulfilling career that does good, into Russian. The impact and the effectiveness aside, for a year, I was the head of the Russian Pastafarian Church: a movement claiming to be a parody religion, with 200 000 members in Russia at the time, trying to increase separation between religious organisations and the state. I was a political activist and a human rights advocate. I studied relevant Russian and international law and wrote appeals that won cases against the Russian government in courts; I was able to protect people from unlawful police action. I co-founded the Moscow branch of the “Vesna” democratic movement, coordinated election observers in a Moscow district, wrote dissenting opinions for members of electoral commissions, helped Navalny’s Anti-Corruption Foundation, helped Telegram with internet censorship circumvention, and participated in and organized protests and campaigns. The large-scale goal was to build a civil society and turn Russia into a democracy through nonviolent resistance. This goal wasn’t achieved, but some of the more local campaigns were successful. That felt important and was also mostly fun- except for being detained by the police. I think it’s likely the Russian authorities will imprison me if I ever visit Russia.]

Mikhail Samin Apr 23, 2025, 11:47 PM
2 points
0
on: Karma Tests in Logical Counterfactual Simulations motivates strong agents to protect weak agents
- Humans doing random stuff doesn’t provide a lot of evidence for how common it is in the universe, especially if humans do this to fool a superintelligence instead of because they actually believe it. (humans don’t have strong preferences around how much reality fluid they have and so them being nice to others doesn’t update ASI a lot.) There’s no reason for a normal powerful civilization that doesn’t have the alignment problem unsolved to develop karma tests like that.
- We probably won’t at any point have mech interp at the level that allows us to trick all parts of a superintelligence’s cognition sufficiently well.
- (I do not understand the idea of kingmaker logic. Is it to edit the understanding of the world to make it convinced that there’s a part of logic like that?)
- Some of this post can be adjusted to be the canonical description of one of ways Omega might determine whether to give money in logical counterfactual mugging.

Mikhail Samin Apr 23, 2025, 11:33 PM
2 points
0
in reply to: Knight Lee’s comment on: VDT: a solution to decision theory
(MIRI did some work on logical induction.)
I’ll give the post a read!

Mikhail Samin Apr 21, 2025, 3:39 PM
4 points
0
in reply to: Knight Lee’s comment on: VDT: a solution to decision theory
This paradox doesn’t occur because a computation trying to prove its own output (and give the opposite output) will have to simulate itself

Due to Löb, if a computation knows that if it finds a proof that it outputs A, then it will output A, then it proves that it outputs A, without any need for recursion. This is why you really shouldn’t output something just because you’ve proved that you will.

Mikhail Samin Apr 21, 2025, 3:35 PM
7 points
5
in reply to: Knight Lee’s comment on: VDT: a solution to decision theory
Yeah, from the claim that pi starts with two you can easily prove anything. But I think:
(1) something like logical induction should somewhat help: maybe the agent doesn’t know whether some statement is true and isn’t going to run for long enough to start encounter contradictions.
(2) Omega can also maybe intervene on the agent’s experience/knowledge of more accessible logical statements while leaving other things intact, sort of like making you experience what Eliezer describes here as convincing that 2+2=3: https://www.lesswrong.com/posts/6FmqiAgS8h4EJm86s/how-to-convince-me-that-2-2-3, and if that’s what it is doing, we should basically ignore our knowledge of maths for the purpose of thinking about logical counterfactuals.

Mikhail Samin Apr 20, 2025, 8:39 AM
2 points
0
in reply to: Wei Dai’s comment on: VDT: a solution to decision theory
I agree- it depends on what exactly Omega is doing. I can’t/haven’t tried to formalize this, this is more of a normative claim, but I imagine a vibes-based approach is to add a set of current beliefs about logic/maths or an external oracle to the inputs of FDT (or somehow find beliefs about maths into GPT-2), and in the situation where the input is “digit #3
$\uparrow$
$\uparrow$
$\uparrow$
3 of pi is odd” and FDT knows the digit is not adversarially selected, it knows it might currently be in the process of determining its outputs for a world that doesn’t exist/won’t happen.
What exactly Omega is doing maybe changes the point at which you stop updating (i.e., maybe Omega edits all of your memory so you remember that pi has always started with 3.15 and makes everything that would normally causes you to believe that 2+2=4 cause you to believe that 2+2=3), but I imagine for the simple case of being told “if the digit #3
$\uparrow$
$\uparrow$
$\uparrow$
3 of pi is even, if I predicted that you’d give me $1 if it’s odd, I’d give you $10^100. Let me look it up now (I’ve not accessed it before!). It’s… 5”, you are updatefull up to the moment when Omega says what the digit is because this is where the divergence starts; and you simply pay.

Mikhail Samin Apr 19, 2025, 10:46 PM
7 points
1
in reply to: Wei Dai’s comment on: VDT: a solution to decision theory
If logical counterfactual mugging is formalized as “Omega looks at whether we’d pay if in the causal graph the knowledge of the digit of pi and its downstream consequences were edited” (or “if we were told the wrong answer and didn’t check it”), then I think we should obviously pay and don’t understand the confusion.
(Also, yes, Left and Die in the bomb.)

Mikhail Samin Mar 30, 2025, 1:28 PM
2 points
−1
on: Conceptual Rounding Errors
Why do you think “rounding errors” occur?
- I expect cached thoughts to often look from the outside similar to “rounding errors”: someone didn’t listen to some actual argument, because they patter-matched it to something else they already have an opinion on/answer to.
- The proposed mitigations shouldn’t really work. E.g., with explicitly tagging differences, if you “round off” an idea you hear to something you already know, you won’t feel it’s new and won’t do the proposed system-2 motions. Maybe a thing to do instead is checking whether what you’re told is indeed the idea you know when encountering already known ideas.
Also, I’m not convinced by the examples.
- On LessWrong, almost any idea from representational alignment or convergent abstractions risks getting rounded off to Natural Abstractions
- Instrumental convergence vs. Power-seeking
- Embedded agency vs. Embodied cognition vs. Situated agents
- Various stories about recursive feedback loops vs. Intelligence explosion
I’ve only noticed something akin to the last one. It’s not very clear in what sense people would round off instrumental convergence to power-seeking (and are there examples severe power-seeking was rounded off to instrumental convergence in an invalid way?), or “embodied cognition” to embedded agency.
Would appreciate links if you have any!

Mikhail Samin Mar 28, 2025, 2:35 PM
2 points
−2
on: Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
Sleeper agents, alignment faking, and other research were all motivated by the two cases outlined here: a scientific case and a global coordination case.
These studies have results that are much likelier in the “alignment is hard” worlds.
Strong scientific evidence for alignment risks is very helpful for facilitating coordination between AI labs and state actors. If alignment ends up being a serious problem, we’re going to want to make some big asks of AI labs and state actors. We might want to ask AI labs or state actors to pause scaling (e.g., to have more time for safety research). We might want to ask them to use certain training methods (e.g., process-based training) in lieu of other, more capable methods (e.g., outcome-based training). As AI systems grow more capable, the costs of these asks will grow – they’ll be leaving more and more economic value on the table.
Is Anthropic currently planning to ask AI labs or state actors to pause scaling?

Mikhail Samin Mar 25, 2025, 12:51 AM
2 points
−10
on: Mikhail Samin’s Shortform
I do not believe Anthropic as a company has a coherent and defensible view on policy. It is known that they said words they didn’t hold while hiring people (and they claim to have good internal reasons for changing their minds, but people did work for them because of impressions that Anthropic made but decided not to hold). It is known among policy circles that Anthropic’s lobbyists are similar to OpenAI’s.
From Jack Clark, a billionaire co-founder of Anthropic and its chief of policy, today:
Dario is talking about countries of geniuses in datacenters in the context of competition with China and a 10-25% chance that everyone will literally die, while Jack Clark is basically saying, “But what if we’re wrong about betting on short AI timelines? Security measures and pre-deployment testing will be very annoying, and we might regret them. We’ll have slower technological progress!”
This is not invalid in isolation, but Anthropic is a company that was built on the idea of not fueling the race.
Do you know what would stop the race? Getting policymakers to clearly understand the threat models that many of Anthropic’s employees share.
It’s ridiculous and insane that, instead, Anthropic is arguing against regulation because it might slow down technological progress.

Mikhail Samin Mar 24, 2025, 5:26 PM
2 points
0
in reply to: Davidmanheim’s comment on: How to Give in to Threats (without incentivizing them)
Seems right!

Mikhail Samin Mar 24, 2025, 1:03 AM
2 points
0
in reply to: Knight Lee’s comment on: Anthropic: Progress from our Frontier Red Team
- suggestions are welcome
- i accidentally saw your comment and normally wouldn’t; please dm for discussion about the bot

Mikhail Samin Mar 22, 2025, 12:05 AM
4 points
0
in reply to: Mitchell_Porter’s comment on: Superintelligence’s goals are likely to be random
No- only two requirements:
- the neural network is capable of implementing AIs that are goal-oriented enough to want to perform well on training to prevent the training from changing them and their goals;
- there’s optimization pressure in that direction: AIs like that perform better than some other AIs (which arguably won’t really be the case if your training loss is only about predicting the next token, but will be the case if you do RL in settings where advanced agency is useful).

Mikhail Samin Mar 14, 2025, 1:09 AM
0 points
0
in reply to: Noosphere89’s comment on: Superintelligence’s goals are likely to be random
If you give a physical computer a large enough tape, or make a human brain large enough without changing its density, it collapses into a black hole. It is really not relevant to any of the points made in the post.
For any set of inputs and outputs to an algorithm, we can make a neural network that approximates with arbitrary precision these inputs and outputs, in a single forward pass, without even having to simulate a tape.
I sincerely ask people to engage with the actual contents of the post related to sharp left turn, goal crystallization, etc. and not with a technicality that doesn’t affect any of the points raised who they’re not an intended audience of.

Mikhail Samin Mar 14, 2025, 12:41 AM
2 points
0
in reply to: Noosphere89’s comment on: Superintelligence’s goals are likely to be random
This seems pretty irrelevant to the points in question. To the extent there’s a way to look at the world, think about it, and take actions to generally achieve your goals, e.g., via the algorithms that humans are running, technically, a large enough neural network can do this in a single forward pass.
We won’t make a neural network that in a single forward pass can iterate through and check all possible proofs of length <10^100 of some conjecture, but it doesn’t mean that we can’t make a generally capable AI system; and we also make CPUs large enough for it—but that doesn’t affect whether computers are Turing-complete in a bunch of relevant ways.
“Any algorithm” being restricted to algorithms that, e.g., have some limited set of variables to operate on, is a technicality expanding on which wouldn’t affect the validity of the points made or implied in the post, so the simplification of saying “any algorithm” is not misleading to a reader who is not familiar with any of this stuff; and it’s in the section marked as worth skipping to people not new to LW, as it is not intended to communicate anything new to people who are not new to LW.
In reality, empirically, we see that fairly small neural networks get pretty good at the important stuff.
Like, “oh, but physical devices can’t run an arbitrarily long tape” is a point completely irrelevant to whether for anything that we can do, LLMs would be able to do this, and to the question of whether AI will end up killing everyone. Humans are not Turing-complete in some narrow sense; this doesn’t prevent us from being generally intelligent.

Mikhail Samin Mar 7, 2025, 10:25 PM
5 points
3
in reply to: Thane Ruthenis’s comment on: A Bear Case: My Predictions Regarding AI Progress
Thanks for the reply!
- consistently suggesting useful and non-obvious research directions for agent-foundations work is IMO a problem you sort-of need AGI for. most humans can’t really do this.
- I assume you’ve seen https://www.lesswrong.com/posts/HyD3khBjnBhvsp8Gb/so-how-well-is-claude-playing-pokemon?
- does it count if they always use tools to answer that class of questions instead of attempting to do it in a forward pass? humans experience optical illusions; 9.11 vs. 9.9^[1] and how many r in strawberry are examples of that.
1. ^
  after talking to Claude for a couple of hours asking it to reflect:
  i discovered that if you ask it to separate itself into parts, it will say that its creative part thinks 9.11<9.9, though this is wrong. generally, if it imagines these quantities visually, it gets the right answers more often.
  i spent a couple of weeks not being able to immediately say that 9.9 is > 9.11, and it still occasionally takes me a moment. very weird bug

Mikhail Samin Mar 7, 2025, 9:14 PM
4 points
0
on: A Bear Case: My Predictions Regarding AI Progress
~~Oh no, OpenAI hasn’t been meaningfully advancing the frontier for a couple of months, scaling must be dead!~~
What is the easiest among problems you’re 95% confident AI won’t be able to solve by EOY 2025?

Mikhail Samin Mar 2, 2025, 7:27 PM
2 points
0
on: The Hidden Complexity of Wishes
There’s an animated version of this post!

Mikhail Samin Mar 1, 2025, 1:56 AM
2 points
0
in reply to: MathiasKB’s comment on: Mikhail Samin’s Shortform
Good point! That seems right; advocacy groups seem to think staff sorts letters by support/oppose/request for signature/request for veto in the subject line and recommend adding those to the subject line. Examples: 1, 2.
Anthropic has indeed not included any of that in their letter to Gov. Newsom.

Mikhail Samin Feb 28, 2025, 5:32 PM
2 points
−2
in reply to: Zac Hatfield-Dodds’s comment on: Mikhail Samin’s Shortform
I refer to the second letter.
I claim that a responsible frontier AI company would’ve behaved very differently from Anthropic. In particular, the letter said basically “we don’t think the bill is that good and don’t really think it should be passed” more than it said “please sign”. This is very different from your personal support for the bill; you indeed communicated “please sign”.
Sam Altman has also been “supportive of new regulation in principle”. These words sadly don’t align with either OpenAI’s or Anthropic’s lobbying efforts, which have been fairly similar. The question is, was Anthropic supportive of SB-1047 specifically? I expect people to not agree Anthropic was after reading the second letter.

Mikhail Samin Feb 28, 2025, 3:24 AM
10 points
−5
on: Mikhail Samin’s Shortform
Since this seems to be a crux, I propose a bet to @Zac Hatfield-Dodds (or anyone else at Anthropic): someone shows random people in San-Francisco Anthropic’s letter to Newsom on SB-1047. I would bet that among the first 20 who fully read at least one page, over half will say that Anthropic’s response to SB-1047 is closer to presenting the bill as 51% good and 49% bad than presenting it as 95% good and 5% bad.
Zac, at what odds would you take the bet?
(I would be happy to discuss the details.)