Mark Xu

Karma: 4,491

I do alignment research at the Alignment Research Center. Learn more about me at markxu.com/about

Mark Xu May 15, 2025, 4:39 PM
37 points
0
on: Obstacles in ARC’s agenda: Finding explanations
I will try to more directly express the positive intuition for why all of this seems possible to me, that is why I think such a loss function over heuristic arguments that makes all the correct tradeoffs should exist.

Consider the process of SGD as a process of Bayesian model selection. We start with some prior of possible weights of a model in some GPT architecture, then we update based on a series of data and in the end we get some model. We might similarly then have a bunch of objections to how such a model selection process could ever learn the data, e.g. that we don’t have enough parameters to memorize every fact like “apples fall down” “pears fall down” etc., so how will the model know when to try to compress these facts into an underlying theory? And for other things like Barack Obama, how will the model learn to memorize that fact, but not the facts about fruits falling down? How is it possible to have a loss function that treats “Obama follows Barack” as an axiom, but “apples fall down” as a fact derived from some more general beliefs about gravity?

The answer is, of course, that we don’t really need to deal with any of that and we can just make loss go down, and if we’ve set up the learning problem correctly then SGD will magically do all these tradeoffs for us.

In the heuristic argument frame, the hope is thus less “we will find a loss function that somehow does all these tradeoffs in a way that magically works” but rather that we can find some loss function over heuristic arguments that does the same thing as what SGD is in some sense already doing to find a model that somehow compresses common crawl so well. That is, our loss function only needs to learn treat “Obama follows Barack” as axiomatic in so far as SGD learns to treat “Obama follows Barack” as axiomatic.

And the hope is that if we do this correctly, the we can identify deceptive alignment because deceptive alignment is defined to be your model intentionally deceiving, and thus “model acts deceptively” is not, from the perspective of model/SGD, an axiomatic fact, so as long as our loss function over heuristic arguments is properly “parallel” to SGD, then it will not learn to treat “model acts deceptive” as axiomatic (because it will only treat things as axiomatic if model/SGD treat them as axiomatic).

Another way of saying this is that SGD + architecture implicitly assign some “probability” (probably not really in a way that is a distribution in any sense) to any fact F being “axiomatic” and uses data to learn which facts are axiomatic vs not, and so the heuristic argument machinery must assign the same “probability” that facts are axiomatic and do the same kind of learning.

Mark Xu Jan 13, 2025, 12:32 AM
6 points
3
in reply to: Ben Pace’s comment on: On Eating the Sun
I think I expect Earth in this case to just say no and not sell the sun? But I was confused at like 2 points in your paragraph so I don’t think I understand what you’re saying that well. I also think we’re probably on mostly the same page, and am not that interested in hashing out further potential disagreements.

Also, mostly unrelated, maybe a hot take, but if you’re able to get outcompeted because you don’t upload, then the future you’re in is not very good.

Mark Xu Jan 13, 2025, 12:29 AM
2 points
0
in reply to: habryka’s comment on: On Eating the Sun
Cool. I misinterpreted your previous comment and think we’re basically on the same page.

Mark Xu Jan 11, 2025, 9:02 PM
6 points
2
in reply to: habryka’s comment on: On Eating the Sun
I think the majority of humans probably won’t want to be uploads, leave the solar system permanently, etc. Maybe this is where we disagree? I don’t really think there’s going to be a thing that most people care about more.

Mark Xu Jan 11, 2025, 9:00 PM
4 points
1
in reply to: Ben Pace’s comment on: On Eating the Sun
I don’t think that’s a very good analogy, but I will say that is is basically true for the Amish. And I do think that we should respect their preferences. (I seperately think cars are not that good, and that people would infact prefer to bicycle around or ride house drawn carriges or whatever if civilization was conducive to that, although that’s kinda besides the point.)

I’m not arguing that we should be conservative about changing the sun. I’m just claiming that people like the sun and won’t want to see it eaten/fundamentally transformed, and that we should respect this preference. This is reason why it’s different from candles → lightbulbs, because people very obviously wanted lightbulbs when offered. But I don’t think the marginal increase in well-being from eating the sun will be nearly enough to make balance against the desire that the sun remain the same, so I don’t think most people will on net want the sun to be eaten. To be clear, this is an empirical claim about what people want that might very well be false.

Mark Xu Jan 10, 2025, 8:33 PM
8 points
1
in reply to: Ben Pace’s comment on: On Eating the Sun
I am claiming that people when informed will want the sun to continuing being the sun. I also think that most people when informed will not really care that much about creating new people, will continue to believe in the act-omission distinction, etc. And that this is a coherent view that will add up to a large set of people wanting things in the solar system to remain conservatively the same. I seperately claim that if this is true, then other people should just respect this preference, and use the other stars that people don’t care about for energy.

Mark Xu Jan 10, 2025, 7:01 AM
1 point
1
in reply to: AnthonyC’s comment on: On Eating the Sun
But most people on Earth don’t want “an artificial system to light the Earth in such a way as to mimic the sun”, they want the actual sun to go on existing.

Mark Xu Jan 3, 2025, 5:46 AM
4 points
0
in reply to: Ben Pace’s comment on: Ben Pace’s Shortform Feed
This is in part the reasoning used by Judge Kaplan:

Kaplan himself said on Thursday that he decided on his sentence in part to make sure that Bankman-Fried cannot harm other people going forward. “There is a risk that this man will be in a position to do something very bad in the future,” he said. “In part, my sentence will be for the purpose of disabling him, to the extent that can appropriately be done, for a significant period of time.”

from https://time.com/6961068/sam-bankman-fried-prison-sentence/

Mark Xu Jan 3, 2025, 5:37 AM
4 points
0
in reply to: Erik Jenner’s comment on: ejenner’s Shortform
It’s kind of strange that, from my perspective, these mistakes are very similar to the mistakes I think I made, and also see a lot of other people making. Perhaps one “must” spend too long doing abstract slippery stuff to really understand the nature of why it doesn’t really work that well?

Mark Xu Dec 23, 2024, 6:38 AM
4 points
1
in reply to: Ben Pace’s comment on: Mark Xu’s Shortform
I know what the word means, I just think in typical cases people should be saying a lot more about why something is undignified, because I don’t think people’s senses of dignity typically overlap that much, especially if the reader doesn’t typically read LW. In these cases I think permitting the use of the word “undignified” prevents specificity.

Mark Xu Dec 22, 2024, 3:46 AM
14 points
−3
on: Mark Xu’s Shortform
“Undignified” is really vague

I sometimes see/hear people say that “X would be a really undignified”. I mostly don’t really know what this means? I think it means “if I told someone that I did X, I would feel a bit embarassed.” It’s not really an argument against X. It’s not dissimilar to saying “vibes are off with X”.

Not saying you should never say it, but basically every use I see could/should be replaced with something more specific.

Mark Xu Dec 18, 2024, 10:02 PM
3 points
0
in reply to: Robert Lasch’s comment on: Training Regime 15: CoZE
Yeah I didn’t really use good words. I mean something more like “make your identity fit yourself better” which often involves making it smaller by removing false beliefs about constraints, but also involves making it larger in some ways, eg uncovering new passions.

Mark Xu Oct 11, 2024, 4:40 PM
LW: 4 AF: 2
0
AF
in reply to: Scott Emmons’s comment on: Mark Xu’s Shortform
I was intending to warn about the possibility of future perception of corruption, e.g. after a non-existential AI catastrophe. I do not think anyone currently working at safety teams is percieved as that “corrupted”, although I do think there is mild negative sentiment among some online communities (some parts of twitter, reddit, etc.).

Mark Xu Oct 10, 2024, 7:36 PM
LW: 4 AF: 2
0
AF
in reply to: Arthur Conmy’s comment on: Mark Xu’s Shortform
Basically (2), very small amounts of (1) (perhaps qualitatively similar to the amount of (1) you would apply to e.g. people joining US AISI or UK AISI)

Mark Xu Oct 10, 2024, 12:05 AM
LW: 80 AF: 41
27
AF
on: Mark Xu’s Shortform
AI safety researchers might be allocated too heavily to Anthropic compared to Google Deepmind
Some considerations:
- Safety researchers should want Google Deepmind (GDM) to have a robust and flourishing safety department. It seems plausible that GDM will be able to create “the smartest” models: they have lots of talent, and own lots of computers. (see e.g. https://epochai.org/data/notable-ai-models#computing-capacity)
- Anthropic (ANT) might run into trouble in the future due to not owning their own computers, e.g. if Amazon (or where ever they’re renting their computers from) starts their own internal scaling competitor, and decides to stop renting out most of their compute.
- ANT has a stronger safety culture, and so it is a more pleasant experience to work at ANT for the average safety researcher. This suggests that there might be a systematic bias towards ANT that pulls away from the “optimal allocation”.
- GDM only recently started a bay area based safety research team/lab (with members like Alex Turner). So if people had previously decided to work for ANT based on location, they now have the opportunity to work for GDM without relocating.
- I’ve heard that many safety researchers join ANT without considering working for GDM, which seems like an error, although I don’t have 1st hand evidence for this being true.
- ANT vs GDM is probably a less important consideration than “scaling lab” (ANT, OAI, GMD, XAI, etc.) vs “non scaling lab” (USAISI, UKAISI, Redwood, ARC, Palisade, METR, MATS, etc. (so many...)). I would advise people to think hard about how joining a scaling lab might inhibit their future careers by e.g. creating a perception they are “corrupted” [edit: I mean viewed as corrupted by the broader world in situations where e.g. there is a non-existential AI disaster or there is rising dislike of the way AI is being handled by coorperations more broadly, e.g. similar to how working for an oil company might result in various climate people thinking you’re corrupted, even if you were trying to get the oil company to reduce emissions, etc. I personally do not think GDM or ANT safety people are “corrupted”] (in addition to strengthening them, which I expect people to spend more time thinking about by default).
- Because ANT has a stronger safety culture, doing safety at GDM involve more politics and navigating around buerearcracy, and thus might be less productive. This consideration applies most if you think the impact of your work is mostly through the object level research you do, which I think is possible but not that plausible.
(Thanks to Neel Nanda for inspiring this post, and Ryan Greenblatt for comments.)

Mark Xu Oct 8, 2024, 12:28 AM
3 points
0
in reply to: Knight Lee’s comment on: Mark Xu’s Shortform
idk how much value that adds over this shortform, and I currently find AI prose a bit nauseating.

Mark Xu Oct 8, 2024, 12:27 AM
LW: 6 AF: 4
0
AF
in reply to: johnswentworth’s comment on: Mark Xu’s Shortform
Hiliariously, it seems likely that our disagreement is even more meta, on the question of “how do you know when you have enough information to know”, or potentially even higher, e.g. “how much uncertainty should one have given that they think they know” etc.

Mark Xu Oct 7, 2024, 5:26 PM
LW: 2 AF: 1
0
AF
in reply to: Mark Xu’s comment on: Mark Xu’s Shortform
see my longer comment https://www.lesswrong.com/posts/A79wykDjr4pcYy9K7/mark-xu-s-shortform#8qjN3Mb8xmJxx59ZG

Mark Xu Oct 7, 2024, 5:25 PM
LW: 16 AF: 7
0
AF
in reply to: johnswentworth’s comment on: Mark Xu’s Shortform
I think I disagree with your model of importance. If your goal is the make a sum of numbers small, then you want to focus your efforts where the derivative is lowest (highest? signs are hard), not where the absolute magnitude is highest.

The “epsilon fallacy” can be committed in both directions: both in that any negative dervative is worth working on, and that any extremely large number is worth taking a chance to try to improve.

I also seperately think that “bottleneck” is not generally a good term to apply to a complex project with high amounts of technical and philosophical uncertainty. The ability to see a “bottleneck” is very valuable should one exist, but I am skeptical of the ability to strongly predict where such bottlnecks will be in advance, and do not think the historical record really supports the ability to find such bottlenecks reliably by “thinking”, as opposed to doing a lot of stuff, including trying things and seeing what works. If you have a broad distribution over where a bottleneck might be, then all activities lend value by “derisking” locations for particular bottlenecks if they succeed, and providing more evidence that a bottleneck is in a particular location if it fails. (kinda like: https://en.wikipedia.org/wiki/Swiss_cheese_model) For instance, I think of “deceptive alignment” as a possible way to get pessimal generalization, and thus a proabalistic “bottleneck” to various alignment approaches. But there are other ways things can fail, and so one can still lend value by solving non-deceptive-alignment related problems (although my day job consists of trying to get “benign generalization” our of ML, and thus does infact address that particular bottleneck imo).

I also seperately think that if someone thinks they have identified a bottleneck, they should try to go resolve it as best they can. I think of that as what you (John) is doing, and fully support such activities, although think I am unlikely to join your particular project. I think the questions you are trying to answer are very interesting ones, and the “natural latents” approach seems likely to shed at some light on whats going on with e.g. the ability of agents to communicate at all.
What links here?
- Mark Xu's comment on Mark Xu’s Shortform by Mark Xu (Oct 7, 2024, 5:26 PM; 2 points)

Mark Xu Oct 7, 2024, 5:15 PM
12 points
10
in reply to: Mark Xu’s comment on: Why I’m not a Bayesian
related to the claim that “all models are meta-models”, in that they are objects capable of e.g evaluating how applicable they are for making a given prediction. E.g. “newtonian mechanics” also carries along with it information about how if things are moving too fast, you need to add more noise to its predictions, i.e. it’s less true/applicable/etc.