mattmacdermott

Karma: 1,261

mattmacdermott Jul 18, 2025, 2:50 PM
−3 points
3
on: mattmacdermott’s Shortform
ELK = easy-to-hard generalisation + assumption the model already knows the hard stuff?

mattmacdermott Jul 17, 2025, 3:59 PM
6 points
2
in reply to: Rohin Shah’s comment on: Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
It’s the last sentence of the first paragraph of section 1.

mattmacdermott Jul 1, 2025, 9:16 PM
12 points
3
on: Authors Have a Responsibility to Communicate Clearly
Not sure how to think about this overall. I can come up with examples where it seems like you should assign basically full credit for sloppy or straightforwardly wrong statements.

E.g. suppose Alice claims that BIC only make black pens. Bob says, “I literally have a packet of blue BIC pens in my desk drawer. We will go to my house, open the drawer, and you will see them.” They go to Bob’s house, and lo, the desk drawer is empty. Turns out the pens are on the kitchen table instead. Clearly it’s fine for Bob to say, “All I really meant was that I had blue pens at my house, the point stands.”

I think your mention of motte-and-baileys probably points at the right refinement: maybe it’s fine to be sloppy if the version you later correct yourself to has the same implications as what you literally said. But if you correct yourself to something easier to defend but that doesn’t support your initial conclusion to the same extent, that’s bad.

EDIT: another important feature of the pens example is that the statement Bob switched to is uncontroversially true. If on finding the desk drawer empty he instead wanted to switch to, “I left them at work”, then probably he should pause and admit a mistake first.

mattmacdermott Jul 1, 2025, 8:43 PM
9 points
2
on: Authors Have a Responsibility to Communicate Clearly
Independently of the broader point, here’s some comments on the particular example from the Scientist AI paper (source: I am an author):
- reward tampering arguments were and are a topic of disagreement between the authors, some of whom have views similar to yours, and some of whom have views well-expressed by the quoted passage
- I predict the lead author (Yoshua) would indeed own that passage and not say, “That was sloppy, what we really meant was...”
- so while it’s fine as an example of readers perhaps inappropriately substituting an interpretation that seems more correct to them, I think it’s not great as an example of authors motte-and-baileying or whatever (although admittedly having a bunch of different authors on a document can push against precision in areas where there’s less consensus)

mattmacdermott Jun 29, 2025, 3:15 AM
7 points
0
on: No, Futarchy Doesn’t Have an EDT Flaw

This really works to make CDT decisions! Try thinking through what the market would do in various decision-theoretic problems.

I thought through whether it works in Newcomb’s problem and it was unexpectedly complicated and confusing (see below) and I now doubt that it always recovers CDT in Newcomblike problems. I may have done something wrong.

Newcomblike problems are obviously not the reason we want casual decision markets. But more generally I’m realising now that the relationship between evidential and causal decision markets is quite different from the relationship between EDT and CDT as normally conceived. EDT and CDT agree with each other in everyday problems and only disagree in Newcomblike problems, whereas evidential and causal decision markets disagree in everyday problems.

So perhaps ‘EDT’ and ‘CDT’ are not good terms to use when talking about decision markets.

Does the proposed scheme get us CDT in Newcomblike problems?

Consider Newcomb’s problem. Say I make markets for “If I twobox, will I get the million?” and “If I onebox, will I get the million?” and follow the randomisation scheme. In the nonrandomised cases, I should twobox if $1000000 p_{1} + 1000 > 1000000 p_{2}$ , where $p_{1}$ is the first market probability and $p_{2}$ is the second. That is, if $p_{1} + 1 / 1000 > p_{2}$ .

Let’s assume omega can’t predict when I’ll randomise or what the outcome will be, and so always fills the boxes according to what I do in the nonrandomised cases.

If $p_{1} + 1 / 1000 > p_{2}$ , then I’ll twobox in the nonrandomised cases, so the box will be empty even in the randomised cases, so both the twobox and the onebox market will resolve NO and so both p1 and p2 should be bet down.

If $p_{1} + 1 / 1000 \leq p_{2}$ , then I’ll onebox in the nonrandomised cases, so the box will always be full, and both the twobox and onebox markets will resolve YES and so both $p_{1}$ and $p_{2}$ should be bet up.

I think the only equilibrium here is if $p_{1}$ and $p_{2}$ are both 0, in which case I’ll twobox and in the randomised cases both markets will correctly resolve NO. That does agree with CDT in the end but it’s kind of a weird way to get there. Not sure if it generalises.

Alternatively we could assume that omega can predict the randomisation. But then the twobox market will be at 0% and the onebox at 100%, so I would onebox.

mattmacdermott Jun 26, 2025, 5:31 PM
3 points
0
in reply to: TristanTrim’s comment on: TT Self Study Journal # 1
I’m certainly no expert on self-studying maths. I’ve generally found it easy to pick up a conceptual understanding from skimming textbooks, and for some subjects (e.g. statistics, Bayesian probability, maybe logic) I think that’s where most of the value lies. I’ve never had the drive or made the time to work through a lot of exercises on my own, and I’d guess that for subjects like linear algebra being able to actually work through problems is probably the important part.

So if you have a subject where both (i) it’s not clearly relevant, and (ii) getting a useful understanding requires working through a lot of exercises, then I’d probably hold off.

mattmacdermott Jun 26, 2025, 2:43 AM
3 points
0
on: TT Self Study Journal # 1
Good luck!

I would say that properly learning new maths takes a long time and it might not be worth trying to seriously study areas that aren’t clearly related to the kind of research you want to do (like category theory).

Like, being a maths undergrad is a full time job and maths undergrads typically learn the equivalent of a few slim textbooks worth of content every few months.

Probably you work more hours and are more driven than your average maths undergrad, but then again you’ll be studying alone and trying to do other kinds of work too. Unless you’re exceptionally driven (or skip all the excercises) then it will be enough of an achievement to study a couple of textbooks a year. So spend them wisely!

mattmacdermott Jun 23, 2025, 3:41 PM
4 points
0
in reply to: CstineSublime’s comment on: Jonas V’s Shortform
I reckon it’s probably a generalisation of “research taste” which means “good judgement about what kind of research to try and do”. This is subtly different to the more typical usage of the word taste — like, “music taste” usually means “ability to discern what music is good” not “ability to discern what music to try and make”.

When I googled “research taste” many of the top results are LessWrong-adjacent, but I did find this page on “cultivating your research taste” written by some academics in 2013.

Maybe someone with superior googling skills to me can figure out if that’s the origin of the phrase.

mattmacdermott Jun 21, 2025, 5:34 PM
4 points
2
in reply to: mattmacdermott’s comment on: mattmacdermott’s Shortform
There’s a second, less prosaic, way to frame mitigating reward hacking as an ELK problem, although it requires making some extra assumptions.

Forget about what the coding model knows and just think about training the best monitor we can. There might be ways of improving our monitor’s ability to understand whether code contains reward hacks that don’t automatically make the monitor better at telling us about them. For example, suppose we train the monitor to predict some downstream consequences of running code generated by the agent, e.g. whether the test cases will pass^[1]. Assume the monitor gets extremely good at this task (we can generate a lot of realistic data, so this is plausible). If ‘code passes for the intended reason’ and ‘code passes because of a reward hack’ are sufficiently distinct causal mechanisms, then we can argue that in order to be really good at this task the monitor has to have a really good understanding of whether the code contains a reward hack^[2].

But then we have an ELK problem, since although the monitor now has a really good understanding of reward hacks, we’ve trained it to predict downstream consequences rather than report reward hacks. We might need clever ways to extract its understanding of reward hacks.

(We’d be in a similar situation if we just happened to have predictive models that were really good at this task, even if we didn’t specifically train them on it.)
1. ↩︎
  Much like in the function correctness dataset in this paper.
2. ↩︎
  The METR post above notes that many reward hacks are pretty ambiguous, so this might be a big ‘if’.

mattmacdermott Jun 21, 2025, 5:29 PM
4 points
2
on: mattmacdermott’s Shortform
Reward hacking in frontier models is a prosaic instance of the ELK problem.

I’ve been describing this example to people quite often recently and wanted to write it down somewhere I can link to.

Suppose we’re training a language model via reinforcement learning to solve coding problems, with a reward function that just checks whether the code passes some test cases. Current frontier models learn to sometimes reward hack, e.g. by special-casing the test cases rather than writing correct code (METR post with several examples).

We’d like to detect when a piece of code contains a reward hack, so that we can avoid reinforcing it. But the naive method of having a monitor inspect the code or the agent’s chain of thought is not ideal, since if it’s an imperfect classifier this might just teach the agent to reward hack in undetectable ways. Empirical studies have found some evidence for this.

But it seems like the coding models themselves probably know whether a solution they’ve generated contains a reward hack, even when the monitor can’t detect it^[1].

So even if we can never get a perfect reward hack classifier, we might hope to detect all the reward hacks that the model itself knows about. This is both an easier problem — the information is in the model somewhere, we just have to get it out — and also covers the cases we care about most. It’s more worrying to be reinforcing intentional misaligned behaviour than honest mistakes.

The problem of finding out whether a model thinks some code contains a reward hack is an instance of the ELK problem, and the conceptual arguments for why it could be difficult carry over. This is the clearest case I know of where the ideas in that paper have been directly relevant to a real prosaic AI safety problem. It’s cool that quite speculative AI safety theory from a few years ago has become relevant at the frontier.

I think that people who are interested in ELK should probably now use detecting intentional reward hacking as their go-to example, whether they want to think about theoretical worst-case solutions or test ideas empirically.
1. ↩︎
  I think that models usually know when they’re reward hacking, but I’m less confident that this applies to examples that fool a monitor. It could be that in these cases the model itself is usually also unaware of the hack.

mattmacdermott Jun 15, 2025, 2:50 AM
4 points
0
in reply to: Fabien Roger’s comment on: Unsupervised Elicitation of Language Models
Do you think of this work as an ELK thing?

mattmacdermott Jun 14, 2025, 4:27 PM
6 points
0
in reply to: Viliam’s comment on: Viliam’s Shortform
I guess it’s annoying to have several such journals at the top of rankings lists. Similarly to how if you look up a list of premier league footballers with the highest goals per game, the list will normally be restricted to players who’ve played a certain number of games.

mattmacdermott Jun 13, 2025, 11:36 PM
3 points
0
on: How could I tell someone that consciousness is not the primary concern of AI Safety?
I think that quite often when people say ‘consciousness’ in contexts like this, and especially when they say ‘sentience’, they mean something more like self-awareness than phenomenal consciousness.

Probably they are also not tracking the distinction very carefully, or thinking very deeply about any of this. But still, thinking the problem is ‘will AIs become self-aware?’ is not quite as silly as thinking it is ‘will the AIs develop phenomenal consciousness?’ and I think it’s the former that causes them to say these things.

mattmacdermott Jun 13, 2025, 11:25 PM
3 points
0
in reply to: cubefox’s comment on: Viliam’s Shortform
Impact factor is some average citation count, so without a minimum you could game it by waiting for one big hit paper.

mattmacdermott Jun 3, 2025, 4:16 PM
4 points
4
in reply to: Alexander Gietelink Oldenziel’s comment on: johnswentworth’s Shortform
October 2023 I believe

mattmacdermott May 26, 2025, 11:59 AM
7 points
2
in reply to: dr_s’s comment on: Gemini Diffusion: watch this space
You can train transformers as diffusion models (example paper), and that’s presumably what Gemini diffusion is.

mattmacdermott May 22, 2025, 1:05 PM
3 points
0
in reply to: Jozdien’s comment on: peterbarnett’s Shortform
It does seem likely that this is less legible by default, although we’d need to look at complete examples of how the sequence changes across time to get a clear sense. Unfortunately I can’t see any in the paper.

mattmacdermott May 21, 2025, 11:02 AM
6 points
1
in reply to: peterbarnett’s comment on: peterbarnett’s Shortform
Wait, is it obvious that they are worse for faithfulness than the normal CoT approach?
Yes, the outputs are no longer produced sequentially over the sequence dimension, so we definitely don’t have causal faithfulness along the sequence dimension.
But the outputs are produced sequentially along the time dimension. Plausibly by monitoring along that dimension we can get a window into the model’s thinking.
What’s more, I think you actually guarantee strict causal faithfulness in the diffusion case, unlike in the normal autoregressive paradigm. The reason being that in the normal paradigm there is a causal path from the inputs to the final outputs via the model’s activations which doesn’t go through the intermediate produced tokens. Whereas (if I understand correctly)^[1] with diffusion models we throw away the model’s activations after each timestep and just feed in the token sequence. So the only way for the model’s thinking at earlier timesteps to affect later timesteps is via the token sequence. Any of the standard measures of chain-of-thought faithfulness that look like corrupting the intermediate reasoning steps and seeing whether the final answer changes will in fact say that the diffusion model is faithful (if applied along the time dimension).
Of course, this causal notion of faithfulness is necessary but not sufficient for what we really care about, since the intermediate outputs could still fail to make the model’s thinking comprehensible to us. Is there strong reason to think the diffusion setup is worse in that sense?

ETA: maybe “diffusion models + paraphrasing the sequence after each timestep” is promising for getting both causal faithfulness and non-encoded reasoning? By default this probably breaks the diffusion model, but perhaps it could be made to work.
1. ^
  Based on skimming this paper and assuming it’s representative.

mattmacdermott May 19, 2025, 8:43 PM
4 points
2
in reply to: RyanCarey’s comment on: Mikhail Samin’s Shortform
An interesting development in the time since your shortform was written is that we can now try these ideas out without too much effort via Manifold.

Anyone know of any examples?

mattmacdermott May 19, 2025, 7:52 PM
14 points
2
in reply to: Mikhail Samin’s comment on: Mikhail Samin’s Shortform
I think the first question to think about is how to use them to make CDT decisions. You can create a market about a causal effect if you have control over the decision and you can randomise it to break any correlations with the rest of the world, assuming the fact that you’re going to randomise it doesn’t otherwise affect the outcome (or bettors don’t think it will).

Committing to doing that does render the market useless for choosing policy, but you could randomly decide whether to randomise or to make the decision via whatever the process you actually want to use, and have the market be conditional on the former. You probably don’t want to be randomising your policy decisions too often, but if liquidity wasn’t an issue you could set the probability of randomisation arbitrarily low.

Then FDT… I dunno, seems hard.
What links here?
- No, Futarchy Doesn’t Have an EDT Flaw by Mikhail Samin (Jun 27, 2025, 9:27 AM; 33 points)

mattmacdermott

Does the proposed scheme get us CDT in Newcomblike problems?

Reward hacking in frontier models is a prosaic instance of the ELK problem.