Tom Davidson

Karma: 1,007

Tom Davidson Aug 1, 2023, 4:04 PM
LW: 15 AF: 8
0
AF
on: Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy
Linking to a post I wrote on a related topic, where I sketch a process (see diagram) for using this kind of red-teaming to iteratively improve your oversight process. (I’m more focussed on a scenario where you’re trying to offload as much of the work in evaluating and improving your oversight process to AIs)

Tom Davidson Aug 1, 2023, 3:03 PM
LW: 5 AF: 3
0
AF
in reply to: Rohin Shah’s comment on: The “no sandbagging on checkable tasks” hypothesis
I read “capable of X” as meaning something like “if the model was actively trying to do X then it would do X”. I.e. a misaligned model doesn’t reveal the vulnerability to humans during testing bc it doesn’t want them to patch it, but then later it exploits that same vulnerability during deployment bc it’s trying to hack the computer system

Let’s use AI to harden human defenses against AI manipulation

Tom DavidsonMay 17, 2023, 11:33 PM

35 points

7 comments24 min readLW link

Tom Davidson Mar 30, 2023, 12:50 AM
1 point
0
in reply to: Beeblebrox’s comment on: Which AI outputs should humans check for shenanigans, to avoid AI takeover? An overly simple abstract model
What tags are they?

Which AI outputs should humans check for shenanigans, to avoid AI takeover? A simple model

Tom DavidsonMar 27, 2023, 11:36 PM

16 points

3 comments8 min readLW link

Tom Davidson Jan 23, 2023, 4:35 PM
4 points
0
in reply to: Not Relevant’s comment on: What a compute-centric framework says about AI takeoff speeds—draft report
I agree that the final tasks that humans do may look like “check that you understand and trust the work the AIs have done”, and that a lack of trust is a plausible bottleneck to full automation of AI research.

I don’t think the only way for humans at AI labs to get that trust is to automate alignment research, though that is one way. Human-conducted alignment research might lead them to trust AIs, or they might have a large amount of trust in the AIs’ work without believing they are aligned. E.g. they separate the workflow into lots of narrow tasks that can be done by a variety of non-agentic AIs that they don’t think pose a risk; or they set up a system of checks and balances (where different AIs check each other’s work and look for signs of deception) that they trust despite thinking certain AIs may be unaligned, they do such extensive adversarial training that they’re confident that the AIs would never actual try to do anything deceptive in practice (perhaps because they’re paranoid that a seeming opportunity to trick humans is just a human-designed test of their alignment). TBC, I think “being confident that the AIs are aligned” is better and more likely than these alternative routes to trusting the work.

Also, when I’m forecasting AI capabilities i’m forecasting AI that could readily automate 100% of AI R&D, not AI that actually does automate it. If trust was the only factor preventing full automation, that could count as AI that could readily automate 100%.

What a compute-centric framework says about AI takeoff speeds

Tom DavidsonJan 23, 2023, 4:02 AM

187 points

30 comments16 min readLW link 1 review

Tom Davidson May 25, 2022, 7:59 PM
LW: 4 AF: 4
1
AF
on: Review of Soft Takeoff Can Still Lead to DSA

But realistically not all projects will hoard all their ideas. Suppose instead that for the leading project, 10% of their new ideas are discovered in-house, and 90% come from publicly available discoveries accessible to all. Then, to continue the car analogy, it’s as if 90% of the lead car’s acceleration comes from a strong wind that blows on both cars equally. The lead of the first car/project will lengthen slightly when measured by distance/ideas, but shrink dramatically when measured by clock time.

The upshot is that we should return to that table of factors and add a big one to the left-hand column: Leads shorten automatically as general progress speeds up, so if the lead project produces only a small fraction of the general progress, maintaining a 3-year lead throughout a soft takeoff is (all else equal) almost as hard as growing a 3-year lead into a 30-year lead during the 20th century. In order to overcome this, the factors on the right would need to be very strong indeed.

But won’t “ability to get a DSA” be linked to the lead as measured in ideas rather than clock time?

Tom Davidson May 3, 2022, 3:36 AM
1 point
in reply to: johnswentworth’s comment on: Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon
Quick responses to your argument for (iii).
- If AI automates 50% of both alignment work and capabilities research, it could help with alignment before foom (while also bringing foom forward in time)
- A leading project might choose to use AIs for alignment rather for fooming
- AI might be more useful for alignment work than for capabilities work
- fooming may require may compute than certain types of alignment work

Tom Davidson Apr 27, 2022, 3:46 PM
1 point
in reply to: johnswentworth’s comment on: Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon
It sounds like the crux is whether having time with powerful (compared to today) but sub-AGI systems will make the time we have for alignment better spent. Does that sound right?

I’m thinking it will because i) you can better demonstrate AI alignment problems empirically to convince top AI researchers to prioritise safety work, ii) you can try out different alignment proposals and do other empirical work with powerful AIs, iii) you can try to leverage powerful AIs to help you do alignment research itself.

Whereas you think these things are so unlikely to help that getting more time with powerful AIs is strategically irrelevant

Tom Davidson Apr 24, 2022, 11:35 PM
4 points
in reply to: johnswentworth’s comment on: Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

In order to argue that alignment is importantly easier in slow takeoff worlds, you need to argue that there do not exist fatal problems which will not be found given more time.

I need something weaker; just that we should put some probability on there not being fatal problems which will not be found given more time. (I.e. , some probability that the extra time helps us find the last remaining fatal problems).

And that seems reasonable. In your toy model there’s 100% chance that we’re doomed. Sure, in that case extra time doesn’t help. But in models where our actions can prevent doom, extra time typically will help. And I think we should be uncertain enough about difficulty of the problem that we should put some probability on worlds where our actions can prevent doom. So we’ll end up concluding that more time does help.

Tom Davidson Apr 20, 2022, 4:57 PM
4 points
on: Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon

Corollary: alignment is not importantly easier in slow-takeoff worlds, at least not due to the ability to iterate. The hard parts of the alignment problem are the parts where it’s nonobvious that something is wrong. That’s true regardless of how fast takeoff speeds are.

This is the important part and it seems wrong.

Firstly, there’s going to be a community of people trying to find and fix the hard problems, and if they have longer to do that then they will be more likely to succeed.

Secondly, ‘nonobvious’ isn’t a an all-or-nothing term. There can easily be problems which are nonobvious enough that you don’t notice them with weeks of adversarial training but you do notice them with months or years.

Limitations of Laplace’s rule of succession

Tom DavidsonApr 19, 2022, 12:27 AM

35 points

2 comments6 min readLW link

Some thoughts on David Roodman’s GWP model and its relation to AI timelines

Tom DavidsonJul 19, 2021, 10:59 PM

32 points

1 comment8 min readLW link

Tom Davidson

Let’s use AI to harden hu­man defenses against AI manipulation

Which AI out­puts should hu­mans check for shenani­gans, to avoid AI takeover? A sim­ple model

What a com­pute-cen­tric frame­work says about AI take­off speeds

Limi­ta­tions of Laplace’s rule of succession

Some thoughts on David Rood­man’s GWP model and its re­la­tion to AI timelines

Let’s use AI to harden human defenses against AI manipulation

Which AI outputs should humans check for shenanigans, to avoid AI takeover? A simple model

What a compute-centric framework says about AI takeoff speeds

Limitations of Laplace’s rule of succession

Some thoughts on David Roodman’s GWP model and its relation to AI timelines