Martín Soto

Karma: 1,365

Doing AI Safety research for ethical reasons.

My webpage.

Leave me anonymous feedback.

I operate by Crocker’s Rules.

Martín Soto Aug 13, 2024, 7:24 PM
LW: 0 AF: 1
0
AF
on: In Defense of Open-Minded UDT
Excellent explanation, congratulations! Sad I’ll have to miss the discussion.
Interlocutor: Neither option is plausible. If you update, you’re not dynamically consistent, and you face an incentive to modify into updatelessness. If you bound cross-branch entanglements in the prior, you need to explain why reality itself also bounds such entanglements, or else you’re simply advising people to be delusional.
You found yourself a very nice interlocutor. I think we truly cannot have our cake and eat it: either you update, making you susceptible to infohazards=traps (if they exist, and they might exist), or you don’t, making you entrenched forever. I think we need to stop dancing around this fact, recognize that a fully-general solution in the formalism is not possible, and instead look into the details of our particular case. Sure, our environment might be adversarially bad, traps might be everywhere. But under this uncertainty, which ways do we think are best to recognize and prevent traps (while updating on other things). This is kind of studying and predicting generalization: given my past observations, where do I think I will suddenly fall out of distribution (into a trap)?
Me: I’m not sure if that’s exactly the condition, but at least it motivates the idea that there’s some condition differentiating when we should be updateful vs updateless. I think uncertainty about “our own beliefs” is subtly wrong; it seems more like uncertainty about which beliefs we endorse.
This was very though-provoking, but unfortunately I still think this crashes head-on with the realization that, a priori and in full generality, we can’t differentiate between safe and unsafe updates. Indeed, why would we expect that no one will punish us by updating on “our own beliefs” or “which beliefs I endorse”? After all, that’s just one more part of reality (without a clear boundary separating it).
It sounds like you are correctly explaining that our choice of prior will be, in some important sense, arbitrary: we can’t know the correct one in advance, we always have to rely on extrapolating contingent past observations.
But then, it seems like your reaction is still hoping that we can have our cake and eat it: “I will remain uncertain about which beliefs I endorse, and only later will I update on the fact that I am in this or that reality. If I’m in the Infinite Counterlogical Mugging… then I will just eventually change my prior because I noticed I’m in the bad world!”. But then again, why would we think this update is safe? That’s just not being updateless, and losing out on the strategic gains from not updating.
Since a solution doesn’t exist in full generality, I think we should pivot to more concrete work related to the “content” (our particular human priors and our particular environment) instead of the “formalism”. For example:
- Conceptual or empirical work on which are the robust and safe ways to extract information from humans (Suddenly LLM pre-training becomes safety work)
- Conceptual or empirical work on which actions or reasoning are more likely to unearth traps under different assumptions (although this work could unearth traps)
- Compilation or observation of properties of our environment (our physical reality) that could have some weak signal on which kinds of moves are safe
  - Unavoidably, this will involve some philosophical / almost-ethical reflection about which worlds we care about and which ones we are willing to give up.

Martín Soto Aug 13, 2024, 7:05 PM
1 point
0
in reply to: Richard_Ngo’s comment on: ricraz’s Shortform
I think Nesov had some similar idea about “agents deferring to a (logically) far-away algorithm-contract Z to avoid miscoordination”, although I never understood it completely, nor think that idea can solve miscoordination in the abstract (only, possibly, be a nice pragmatic way to bootstrap coordination from agents who are already sufficiently nice).
EDIT 2: UDT is usually prone to commitment races because it thinks of each agent in a conflict as separately making commitments earlier in logical time. But focusing on symmetric commitments gets rid of this problem.
Hate to always be that guy, but if you are assuming all agents will only engage in symmetric commitments, then you are assuming commitment races away. In actuality, it is possible for a (meta-) commitment race to happen about “whether I only engage in symmetric commitments”.

Martín Soto Aug 13, 2024, 7:01 PM
1 point
0
in reply to: Richard_Ngo’s comment on: ricraz’s Shortform
I don’t understand your point here, explain?
Say there are 5 different veils of ignorance (priors) that most minds consider Schelling (you could try to argue there will be exactly one, but I don’t see why).
If everyone simply accepted exactly the same one, then yes, lots of nice things would happen and you wouldn’t get catastrophically inefficient conflict.
But every one of these 5 priors will have different outcomes when it is implemented by everyone. For example, maybe in prior 3 agent A is slightly better off and agent B is slightly worse off.
So you need to give me a reason why a commitment race doesn’t recur in the level of “choosing which of the 5 priors everyone should implement”. That is, maybe A will make a very early commitment to only every implement prior 3. As always, this is rational if A thinks the others will react a certain way (give in to the threat and implement 3). And I don’t have a reason to expect agents not to have such priors (although I agree they are slightly less likely than more common-sensical priors).
That is, as always, the commitment races problem doesn’t have a general solution on paper. You need to get into the details of our multi-verse and our agents to argue that they won’t have these crazy priors and will coordinate well.
This seems to be claiming that in some multiverses, the gains to powerful agents from being hawkish outweigh the losses to weak agents. But then why is this a problem? It just seems like the optimal outcome.
It seems likely that in our universe there are some agents with arbitrarily high gains-from-being-hawkish, that don’t have correspondingly arbitrarily low measure. (This is related to Pascalian reasoning, see Daniel’s sequence.) For example, someone whose utility is exponential on number of paperclips. I don’t agree that the optimal outcome (according to my ethics) is for me (who’s utility is at most linear on happy people) to turn all my resources into paperclips.
Maybe if I was a preference utilitarian biting enough bullets, this would be the case. But I just want happy people.

Martín Soto Aug 9, 2024, 4:47 PM
4 points
0
in reply to: Richard_Ngo’s comment on: ricraz’s Shortform
Nice!
Proposal 4: same as proposal 3 but each agent also obeys commitments that they would have made from behind a veil of ignorance where they didn’t yet know who they were or what their values were. From that position, they wouldn’t have wanted to do future destructive commitment races.
I don’t think this solves Commitment Races in general, because of two different considerations:
1. Trivially, I can say that you still have the problem when everyone needs to bootstrap a Schelling veil of ignorance.
2. Less trivially, even behind the most simple/Schelling veils of ignorance, I find it likely that hawkish commitments are incentivized. For example, the veil might say that you might be Powerful agent A, or Weak agent B, and if some Powerful agents have weird enough utilities (and this seems likely in a big pool of agents), hawkishly committing in case you are A will be a net-positive bet.
This might still mostly solve Commitment Races in our particular multi-verse. I have intuitions both for and against this bootstrapping being possible. I’d be interested to hear yours.

Martín Soto Aug 3, 2024, 11:33 PM
1 point
0
in reply to: Vanessa Kosoy’s comment on: Martín Soto’s Shortform
I have no idea whether Turing’s original motivation was this one (not that it matters much). But I agree that if we take time and judge expertise to the extreme we get what you say, and that current LLMs don’t pass that. Heck, even a trick as simple as asking for a positional / visual task (something like ARC AGI, even if completely text-based) would suffice. But I still would expect academics to be able to produce a pretty interesting paper on weaker versions of the test.

Martín Soto Aug 2, 2024, 11:32 PM
17 points
6
on: Martín Soto’s Shortform
Why isn’t there yet a paper in Nature or Science called simply “LLMs pass the Turing Test”?
I know we’re kind of past that, and now we understand LLMs can be good at some things while bad at others. And the Turing Test is mainly interesting for its historical significance, not as the most informative test to run on AI. And I’m not even completely sure how much current LLMs pass the Turing Test (it will depend massively on the details of your Turing Test).
But my model of academia predicts that, by now, some senior ML academics would have paired up with some senior “running-experiments-on-humans-and-doing-science-on-the-results” academics (and possibly some labs), and put out an extremely exhaustive and high quality paper actually running a good Turing Test. If anything so that the community can coordinate around it, and make recent advancements more scientifically legible.
It’s not either like the sole value of the paper would be publicity and legibility. There are many important questions around how good LLMs are at passing as humans for deployment. And I’m not thinking either of something as shallow as “prompt GPT4 in a certain way”, but rather “work with the labs to actually optimize models for passing the test” (but of course don’t release them), which could be interesting for LLM science.
The only thing I’ve found is this lower quality paper.
My best guess is that this project does already exist, but it took >1 year, and is now undergoing ~2 years of slow revisions or whatever (although I’d still be surprised they haven’t been able to put something out sooner?).
It’s also possible that labs don’t want this kind of research/publicity (regardless of whether they are running similar experiments internally). Or deem it too risky to create such human-looking models, even if they wouldn’t release them. But I don’t think either of those is the case. And even if it was, the academics could still do some semblance of it through prompting alone, and probably it would already pass some versions of the Turing Test. (Now they have open-source models capable enough to do it, but that’s more recent.)

Martín Soto Aug 1, 2024, 11:50 PM
2 points
1
in reply to: Jonas Hallgren’s comment on: The need for multi-agent experiments
Thanks Jonas!
A way to combine the two worlds might be to run it in video games or similar where you already have players
Oh my, we have converged back on Critch’s original idea for Encultured AI (not anymore, now it’s health-tech).

Martín Soto Jul 29, 2024, 11:05 PM
3 points
0
in reply to: Vanessa Kosoy’s comment on: Martín Soto’s Shortform
You’re right! I had mistaken the derivative for the original function.
Probably this slip happened because I was also thinking of the following:
Embedded learning can’t ever be modelled as taking such an (origin-agnostic) derivative.
When in ML we take the gradient in the loss landscape, we are literally taking (or approximating) a counterfactual: “If my algorithm was a bit more like this, would I have performed better in this environment? (For example, would my prediction have been closer to the real next token)”
But in embedded reality there’s no way to take this counterfactual: You just have your past and present observations, and you don’t necessarily know whether you’d have obtained more or less reward had you moved your hand a bit more like this (taking the fruit to your mouth) or like that (moving it away).
Of course, one way to solve this is to learn a reward model inside your brain, which can learn without any counterfactuals (just considering whether the prediction was correct, or how “close” it was for some definition of close). And then another part of the brain is trained to approximate argmaxing the reward model.
But another effect, that I’d also expect to happen, is that (either through this reward model or other means) the brain learns a “baseline of reward” (the “origin”) based on past levels of dopamine or whatever, and then reinforces things that go over that baseline, and disincentivizes those that go below (also proportionally to how far they are from the baseline). Basically the hedonic treadmill. I also think there’s some a priori argument for this helping with computational frugality, in case you change environments (and start receiving much more or much less reward).

Martín Soto Jul 28, 2024, 11:07 PM
19 points
2
on: Martín Soto’s Shortform
The default explanation I’d heard for “the human brain naturally focusing on negative considerations”, or “the human body experiencing more pain than pleasure”, was that, in the ancestral environment, there were many catastrophic events to run away from, but not many incredibly positive events to run towards: having sex once is not as good as dying is bad (for inclusive genetic fitness).
But maybe there’s another, more general factor, that doesn’t rely on these environment details but rather deeper mathematical properties:
Say you are an algorithm being constantly tweaked by a learning process.
Say on input X you produce output (action) Y, leading to a good outcome (meaning, one of the outcomes the learning process likes, whatever that means). Sure, the learning process can tweak your algorithm in some way to ensure that X → Y is even more likely in the future. But even if it doesn’t, by default, next time you receive input X you will still produce Y (since the learning algorithm hasn’t changed you, and ignoring noise). You are, in some sense, already taking the correct action (or at least, an acceptably correct one).
Say on input X’ you produce output Y’, leading instead to a bad outcome. If the learning process changes nothing, next time you find X’ you’ll do the same. So the process really needs to change your algorithm right now, and can’t fall back on your existing default behavior.
Of course, many other factors make it the case that such a naive story isn’t the full picture:
- Maybe there’s noise, so it’s not guaranteed you behave the same on every input.
- Maybe the negative tweaks make the positive behavior (on other inputs) slowly wither away (like circuit rewriting in neural networks), so you need to reinforce positive behavior for it to stick.
- Maybe the learning algorithm doesn’t have a clear notion of “positive and negative”, and instead just provides in a same direction (but with different intensities) for different intensities in a scale without origin. (But this seems ~~very different from the current paradigm,~~ and fundamentally wasteful.)
Still, I think I’m pointing at something real, like “on average across environments punishing failures is more valuable than reinforcing successes”.

Martín Soto Jul 28, 2024, 10:19 PM
4 points
7
on: This is already your second chance
Very fun

Martín Soto Jul 26, 2024, 10:46 PM
1 point
0
in reply to: L Rudolf L’s comment on: Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
Now it makes sense, thank you!

Martín Soto Jul 20, 2024, 8:58 PM
2 points
0
in reply to: kaivu’s comment on: Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
Thanks! I don’t understand the logic behind your setup yet.
Trying to use the random seed to inform the choice of word pairs was the intended LLM behavior: the model was supposed to use the random seed to select two random words
But then, if the model were to correctly do this, it would score 0 in your test, right? Because it would generate a different word pair for every random seed, and what you are scoring is “generating only two words across all random seeds, and furthermore ensuring they have these probabilities”.
The main reason we didn’t enforce this very strictly in our grading is that we didn’t expect (and in fact empirically did not observe) LLMs actually hard-coding a single pair across all seeds
My understanding of what you’re saying is that, with the prompt you used (which encouraged making the word pair depend on the random seed), you indeed got many different word pairs (thus the model would by default score badly). To account for this, you somehow “relaxed” scoring (I don’t know exactly how you did this) to be more lenient with this failure mode.
So my question is: if you faced the “problem” that the LLM didn’t reliably output the same word pair (and wanted to solve this problem in some way), why didn’t you change the prompt to stop encouraging the word pair dependence on the random seed?
Maybe what you’re saying is that you indeed tried this, and even then there were many different word pairs (the change didn’t make a big difference), so you had to “relax” scoring anyway.
(Even in this case, I don’t understand why you’d include in the final experiments and paper the prompt which does encourage making the word pair depend on the random seed.)

Martín Soto Jul 18, 2024, 7:24 PM
2 points
0
in reply to: Jan Betley’s comment on: Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
you need a set of problems assigned to clearly defined types and I’m not aware of any such dataset
Hm, I was thinking something as easy to categorize as “multiplying numbers of n digits”, or “the different levels of MMLU” (although again, they already know about MMLU), or “independently do X online (for example create an account somewhere)”, or even some of the tasks from your paper.
I guess I was thinking less about “what facts they know”, which is pure memorization (although this is also interesting), and more about “cognitively hard tasks”, that require some computational steps.

Martín Soto Jul 18, 2024, 6:57 PM
1 point
0
on: Me & My Clone
Given your clone is a perfectly mirrored copy of yourself down to the lowest physical level (whatever that means), then breaking symmetry would violate the homogeneity or isotropy of physics. I don’t know where the physics literature stands on the likelihood of that happening (even though certainly we don’t see macroscopic violations).
Of course, it might be an atom-by-atom copy is not a copy down to the lowest physical level, in which case trivially you can get eventual asymmetry. I mean, it doesn’t even make complete sense to say “atom-by-atom copy” in the language of quantum mechanics, since you can’t be arbitrarily certain about the position and velocity of each atom. Maybe saying something like “the quantum state function of the whole room is perfectly symmetric in this specific way”. I think then (if that is indeed the lowest physical level) the function will remain symmetric forever, but maybe in some universes you and your copy end up in different places? That is, the symmetry would happen at another level in this example: across universes, and not necessarily inside each single universe?
It might also be there is no lowest physical level, just unending complexity all the way down (this had a philosophical name which I now forget).

Martín Soto Jul 17, 2024, 5:47 PM
8 points
0
on: Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
Another idea: Ask the LLM how well it will do on a certain task (for example, which fraction of math problems of type X it will get right), and then actually test it. This a priori lands in INTROSPECTION, but could have a bit of FACTS or ID-LEVERAGE if you use tasks described in training data as “hard for LLMs” (like tasks related to tokens and text position).

Martín Soto Jul 16, 2024, 11:55 PM
4 points
0
on: Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
About the Not-given prompt in ANTI-IMITATION-OUTPUT-CONTROL:
You say “use the seed to generate two new random rare words”. But if I’m understanding correctly, the seed is different for each of the 100 instantiations of the LLM, and you want the LLM to only output 2 different words across all these 100 instantiations (with the correct proportions). So, actually, the best strategy for the LLM would be to generate the ordered pair without using the random seed, and then only use the random seed to throw an unfair coin.
Given how it’s written, and the closeness of that excerpt to the random seed, I’d expect the LLM to “not notice” this, and automatically “try” to use the random seed to inform the choice of word pair.
Could this be impeding performance? Does it improve if you don’t say that misleading bit?

Martín Soto Jul 12, 2024, 2:38 AM
7 points
1
on: Martín Soto’s Shortform
I’ve noticed less and less posts include explicit Acknowledgments or Epistemic Status.
This could indicate that the average post has less work put into it: it hasn’t gone through an explicit round of feedback from people you’ll have to acknowledge. Although this could also be explained by the average poster being more isolated.
If it’s true less work is put into the average post, it seems likely this means that kind of work and discussion has just shifted to private channels like Slack, or more established venues like academia.
I’d guess the LW team have their ways to measure or hypothesize about how much work is put into posts.
It could also be related to the average reader wanting to skim many things fast, as opposed to read a few deeply.
My feeling is that now we all assume by default that the epistemic status is tentative (except in obvious cases like papers).
It could also be that some discourse has become more polarized, and people are less likely to explicitly hedge their position through an epistemic status.
Or that the average reader being less isolated and thus more contextualized, and not as in need of epistemic hedges.
Or simply that less posts nowadays are structured around a central idea or claim, and thus different parts of the post have different epistemic statuses to be written at the top.
It could also be that post types have become more standardized, and each has their reason not to include these sections. For example:
- Papers already have acknowledgments, and the epistemic status is diluted through the paper.
- Stories or emotion-driven posts don’t want to break the mood with acknowledgments (and don’t warrant epistemic status).

Martín Soto Jul 8, 2024, 3:55 AM
1 point
0
on: Looking back on my alignment PhD
This post is not only useful, but beautiful.
This, more than anything else on this website, reflects for me the lived experiences which demonstrate we can become more rational and effective at helping the world.
Many points of resonance with my experience since discovering this community. Many same blind-spots that I unfortunately haven’t been able to shortcut, and have had to re-discover by myself. Although this does make me wish I had read some of your old posts earlier.

Martín Soto Jul 1, 2024, 1:12 AM
1 point
0
on: Technologies and Terminology: AI isn’t Software, it’s… Deepware?
It should be called A-ware, short for Artificial-ware, given the already massive popularity of the term “Artificial Intelligence” to designate “trained-rather-than-programmed” systems.
It also seems more likely to me that future products will contain some AI sub-parts and some traditional-software sub-parts (rather than being wholly one or the other), and one or the other is utilized depending on context. We could call such a system Situationally A-ware.

Martín Soto May 24, 2024, 2:54 PM
4 points
3
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
That was dazzling to read, especially the last bit.