mattmacdermott

Karma: 1,166

mattmacdermott Apr 8, 2025, 10:38 PM
2 points
0
in reply to: rats’s comment on: rats’s Shortform
I think it may or may not diverge from meaningful natural language in the next couple of years, and importantly I think we’ll be able to roughly tell whether it has. So I think we should just see (although finding other formats for interpretable autogression could be good too).

mattmacdermott Apr 8, 2025, 6:00 PM
3 points
1
in reply to: rats’s comment on: rats’s Shortform

just not do gradient descent on the internal chain of thought, then its just a worse scratchpad.

This seems like a misunderstanding. When OpenAI and others talk about not optimising the chain of thought, they mean not optimising it for looking nice. That still means optimising it for its contribution to the final answer i.e. for being the best scratchpad it can be (that’s the whole paradigm).

mattmacdermott Apr 7, 2025, 12:09 AM
7 points
1
in reply to: Ruby’s comment on: A Slow Guide to Confronting Doom

If what you mean is you can’t be that confident given disagreement, I dunno, I wish I could have that much faith in people.

In another way, being that confident despite disagreement requires faith in people — yourself and the others who agree with you.

I think one reason I have a much lower p(doom) than some people is that although I think the AI safety community is great, I don’t have that much more faith in its epistemics than everyone else’s.

mattmacdermott Apr 6, 2025, 3:48 PM
16 points
1
on: mattmacdermott’s Shortform
Circular Consequentialism

When I was a kid I used to love playing RuneScape. One day I had what seemed like a deep insight. Why did I want to kill enemies and complete quests? In order to level up and get better equipment. Why did I want to level up and get better equipment? In order to kill enemies and complete quests. It all seemed a bit empty and circular. I don’t think I stopped playing RuneScape after that, but I would think about it every now and again and it would give me pause. In hindsight, my motivations weren’t really circular — I was playing Runescape in order to have fun, and the rest was just instrumental to that.

But my question now is — is a true circular consequentialist a stable agent that can exist in the world? An agent that wants X only because it leads to Y, and wants Y only because it leads to X?

Note, I don’t think this is the same as a circular preferences situation. The agent isn’t swapping X for Y and then Y for X over and over again, ready to get money pumped by some clever observer. It’s getting more and more X, and more and more Y over time.

Obviously if it terminally cares about both X and Y or cares about them both instrumentally for some other purpose, a normal consequentialist could display this behaviour. But do you even need terminal goals here? Can you have an agent that only cares about X instrumentally for its effect on Y, and only cares about Y instrumentally for its effect on X. In order for this to be different to just caring about X and Y terminally, I think a necessary property is that the only path through which the agent is trying to increase Y via is X, and the only path through which it’s trying to increase X via is Y.

mattmacdermott Apr 4, 2025, 4:53 PM
2 points
0
in reply to: Davidmanheim’s comment on: Is instrumental convergence a thing for virtue-driven agents?
Maybe you could spell this out a bit more? What concretely do you mean when you say that anything that outputs decisions implies a utility function — are you thinking of a certain mathematical result/procedure?

mattmacdermott Apr 3, 2025, 5:06 AM
2 points
0
in reply to: Davidmanheim’s comment on: Is instrumental convergence a thing for virtue-driven agents?

anything that outputs decisions implies a utility function

I think this is only true in a boring sense and isn’t true in more natural senses. For example, in an MDP, it’s not true that every policy maximises a non-constant utility function over states.

mattmacdermott Apr 2, 2025, 3:27 PM
2 points
0
in reply to: StanislavKrym’s comment on: Is instrumental convergence a thing for virtue-driven agents?
I think this generalises too much from ChatGPT, and also reads to much into ChatGPT’s nature from the experiment, but it’s a small piece of evidence.

mattmacdermott Apr 2, 2025, 3:23 PM
4 points
0
in reply to: Jeremy Gillen’s comment on: Is instrumental convergence a thing for virtue-driven agents?

I think you’ve hidden most of the difficulty in this line. If we knew how to make a consequentialist sub-agent that was acting “in service” of the outer loop, then we could probably use the same technique to make a Task-based AGI acting “in service” of us.

Later I might try to flesh out my currently-very-loose picture of why consequentialism-in-service-of-virtues seems like a plausible thing we could end up with. I’m not sure whether it implies that you should be able to make a task-based AGI.

Obvious nitpick: It’s just “gain as much power as is helpful for achieving whatever my goals are”. I think maybe you think instrumental convergence has stronger power-seeking implications than it does. It only has strong implications when the task is very difficult.

Fair enough. Talk of instrumental convergence usually assumes that the amount of power that is helpful will be a lot (otherwise it wouldn’t be scary). But I suppose you’d say that’s just because we expect to try to use AIs for very difficult tasks. (Later you mention unboundedness too, which I think should be added to difficulty here).

it’s likely that some of the tasks will be difficult and unbounded

I’m not sure about that, because the fact that the task is being completed in service of some virtue might limit the scope of actions that are considered for it. Again I think it’s on me to paint a more detailed picture of the way the agent works and how it comes about in order for us to be able to think that through.

mattmacdermott Apr 2, 2025, 3:07 PM
4 points
2
in reply to: Gordon Seidoh Worley’s comment on: Is instrumental convergence a thing for virtue-driven agents?
I think this gets at the heart of the question (but doesn’t consider the other possible answer). Does a powerful virtue-driven agent optimise hard now for its ability to embody that virtue in the future? Or does it just kinda chill and embody the virtue now, sacrificing some of its ability to embody it extra-hard in the future?

I guess both are conceivable, so perhaps I do need to give an argument why we might expect some kind of virtue-driven AI in the first place, and see which kind that argument suggests.

Is instrumental convergence a thing for virtue-driven agents?

mattmacdermottApr 2, 2025, 3:59 AM

33 points

37 comments2 min readLW link

mattmacdermott Mar 30, 2025, 7:53 PM
2 points
0
in reply to: LWLW’s comment on: LWLW’s Shortform

If you have the time look up “Terence Tao” on Gwern’s website.

In case anyone else is going looking, here is the relevant account of Tao as a child and here is a screenshot of the most relevant part:

mattmacdermott Mar 24, 2025, 11:53 PM
4 points
0
on: mattmacdermott’s Shortform
Why does ChatGPT voice-to-text keep translating me into Welsh?

I use ChatGPT voice-to-text all the time. About 1% of the time, the message I record in English gets seemingly-roughly-correctly translated into Welsh, and ChatGPT replies in Welsh. Sometimes my messages go back to English on the next message, and sometimes they stay in Welsh for a while. Has anyone else experienced this?

Example: https://chatgpt.com/share/67e1f11e-4624-800a-b9cd-70dee98c6d4e

mattmacdermott Mar 24, 2025, 11:07 PM
2 points
0
in reply to: Thomas Kwa’s comment on: METR: Measuring AI Ability to Complete Long Tasks
I think the commenter is asking something a bit different—about the distribution of tasks rather than the success rate. My variant of this question: is your set of tasks supposed to be an unbiased sample of the tasks a knowledge worker faces, so that if I see a 50% success rate on 1 hour long tasks I can read it as a 50% success rate on average across all of the tasks any knowledge worker faces?

Or is it more impressive than that because the tasks are selected to be moderately interesting, or less impressive because they’re selected to be measurable, etc

mattmacdermott Mar 21, 2025, 5:16 PM
17 points
0
on: Towards a scale-free theory of intelligent agency
I vaguely remember a LessWrong comment from you a couple of years ago saying that you included Agent Foundations in the AGISF course as a compromise despite not thinking it’s a useful research direction.

Could you say something about why you’ve changed your mind, or what the nuance is if you haven’t?

mattmacdermott Mar 10, 2025, 2:57 PM
2 points
0
in reply to: gwern’s comment on: gwern’s Shortform
I’d be curious to know if conditioning on high agreement alone had less of this effect than conditioning on high karma alone (because something many people agree on is unlikely to be a claim of novel evidence, and more likely to be a take.

mattmacdermott Mar 7, 2025, 7:03 PM
LW: 2 AF: 1
0
AF
in reply to: dil-leik-og’s comment on: Validating against a misalignment detector is very different to training against one
Flagging for posterity that we had a long discussion about this via another medium and I was not convinced.

mattmacdermott Mar 7, 2025, 9:23 AM
3 points
1
in reply to: groblegark’s comment on: groblegark’s Shortform
More or less, yes. But I don’t think it suggests there might be other prompts around that unlock similar improvements—chain-of-thought works because it allows the model to spend more serial compute on a problem, rather than because of something really important about the words.

mattmacdermott Mar 6, 2025, 10:08 PM
2 points
0
in reply to: Vladimir_Nesov’s comment on: On OpenAI’s Safety and Alignment Philosophy
Agree that pauses are a clearer line. But even if a pause and tool-limit are both temporary, we should expect the full pause to have to last longer.

mattmacdermott Mar 6, 2025, 6:07 PM
11 points
0
in reply to: Vladimir_Nesov’s comment on: On OpenAI’s Safety and Alignment Philosophy
One difference is that keeping AI a tool might be a temporary strategy until you can use the tool AI to solve whatever safety problems apply to non-tool AI. In that case the co-ordination problem isn’t as difficult because you might just need to get the smallish pool of leading actors to co-ordinate for a while, rather than everyone to coordinate indefinitely.

mattmacdermott Mar 6, 2025, 12:51 PM
2 points
0
in reply to: Lucius Bushnaq’s comment on: What Is The Alignment Problem?

I now suspect that there is a pretty real and non-vacuous sense in which deep learning is approximated Solomonoff induction.

Even granting that, do you think the same applies to the cognition of an AI created using deep learning—is it approximating Solomonoff induction when presented with a new problem at inference time?

I think it’s not, for reasons like the ones in aysja’s comment.

mattmacdermott

Circular Consequentialism

Is in­stru­men­tal con­ver­gence a thing for virtue-driven agents?

Why does ChatGPT voice-to-text keep translating me into Welsh?

Is instrumental convergence a thing for virtue-driven agents?