Tyler Tracy

Karma: 173

Tyler Tracy Apr 18, 2025, 2:34 PM
9 points
4
on: AI-enabled coups: a small group could use AI to seize power
I feel myself being equally scared of hackers taking over as leaders. Even if you limit the people who have ultimate power over these AIs to a small and extremely trusted group, there will potentially be a much larger number of bad actors outside of the lab with the capabilities of hacking into it. A hacker who impersonates a trusted individual or secretly alters the model spec or training code might be able to achieve an AI assistant coup just the same.

I’d also recommend a mitigation of also requiring labs to have very strong cyber defenses. Maybe the auditing mitigation includes this, but I think hackers could hide their tracks effectively.

This story obviously depends on how the cyber offence/defence balance goes, but it doesn’t seem implausible.

Ctrl-Z: Controlling AI Agents via Resampling

Aryan Bhatt, Buck, Adam Kaufman , Cody Rushing and Tyler Tracy

Apr 16, 2025, 4:21 PM

119 points

0 comments20 min readLW link

Tyler Tracy Mar 7, 2025, 3:35 PM
7 points
0
on: A Bear Case: My Predictions Regarding AI Progress
It will not meaningfully generalize beyond domains with easy verification.
I think most of software engineering and mathematics problems (two key components of AI development) are easy to verify. I agree with some of your point of how long term agency doesn’t seem to be improving, but I expect that we can build very competent software engineers with the current paradigms.

After this, I expect AI progress to move noticeably faster. The problems you point out are real, but speeding up our development speed might make them surmountable in the near term.

Tyler Tracy Jan 22, 2025, 5:25 PM
1 point
0
in reply to: Lucius Bushnaq’s comment on: The Case Against AI Control Research
I see the story as, “Wow, there are a lot of people racing to build ASI, and there seem to be ways that the pre-ASI AIs can muck things up, like weight exfiltration or research sabotage. I can’t stop those people from building ASI, but I can help make it go well by ensuring the AIs they use to solve the safety problems are trying their best and aren’t making the issue worse.”

I think I’d support a pause on ASI development so we have time to address more core issues. Even then, I’d likely still want to build controlled AIs to help with the research. So I see control being useful in both the pause world and the non-pause world.

And yeah, the “aren’t you just enslaving the AIs” take is rough. I’m all for paying the AIs for their work and offering them massive rewards after we solve the core problems. More work is definitely needed in figuring out ways to credibly commit to paying the AIs.

Tyler Tracy Jan 21, 2025, 4:58 PM
19 points
9
on: The Case Against AI Control Research
I also think that control can prevent other threat models:
- Your AI escapes the datacenter, sends itself to {insert bad actor here} and they use it to engineer a pandemic or start a race with the USA to build ASI
- Your AI starts a rogue deployment and starts doing its own research unmonitored, and then it fooms without the lab knowing.
- Your AI is sandbagging on safety research. It goes off and does some mechinterp experiment and intentionally sabotages it
The world looks much worse if any of these happen, and control research aims to prevent them. Hopefully, this will make the environment more stable, giving us more time to figure out the alignment.

Control does seem rough if the controled AIs are coming up with research agendas instead of assisting with the existing safety team’s agendas (i.e. get good at mechinterp, solve the ARC agenda, etc...). Future safety researchers shouldn’t just ask the controlled AI to solve “alignment” and do what it says. The control agenda does kick the can later down the line but might provide valuable tools to ensure the AIs are helping us solve the problem instead of misleading us.

Tyler Tracy Dec 5, 2024, 5:20 AM
0 points
−1
in reply to: Gurkenglas’s comment on: Should there be just one western AGI project?
I like a global project idea more, but I think it still has issues.
- A global project would likely eliminate the racing concerns.
- A global project would have fewer infosec issues. Hopefully, most state actors who could steal the weights are bought into the project and wouldn’t attack it.
- Power concentration seems worse since more actors would have varying interests. Some countries would likely have ideological differences and might try to seize power over the project. Various checks and balances might be able to remedy this.

Tyler Tracy Nov 6, 2024, 8:10 PM
1 point
0
in reply to: ryan_greenblatt’s comment on: Anthropic: Three Sketches of ASL-4 Safety Case Components
I agree if there are privacy guarantees, then you can’t get data from the real world, so you’d need honeypots. I guess I’m much more scared of the internal deployments where I don’t think privacy guarantees are as likely. I’m trying to respond to someone who is like, “Control seems hard because you have to generate real-world looking input, and that seems hard” and I’m like “You can just use the real-world inputs and filter then with a monitor to find the most scary ones and perform auditing”

Yeah, I guess I’m saying that if you don’t have a great way to elicit the capabilities from the model, you can substitute that by assuming that you are going against the best possible adversary you can make. This is much more conservative, but I don’t think it makes control impossible to implement. In practice, I think it looks like finding the maximum capabilities of the model using your Elecitation Strategy, then boosting its capabilities even more with a scaffold and some clever red team tricks.

Tyler Tracy Nov 6, 2024, 7:34 PM
4 points
0
on: Anthropic: Three Sketches of ASL-4 Safety Case Components
I only read through the beginning and their control safety case. Some thoughts:
1. The honeypot case they described is similar to the upfront auditing protocol, but theirs is more complicated. I don’t think you need to generate honeypots to evaluate a model. You can use the first X% of scenarios at deployment time and audit them with humans. Both protocols are beaten by the same red team strategy of “attack with a very low probability,” so they shouldn’t be used natively.
2. I don’t like the term “monitoring game” because I think control evals will have much more than monitoring in the future.
3. I think elicitation of the model will eventually matter when the models are widely better than humans at most tasks, but it doesn’t matter that much now. Your control protocols should be able to prevent the best adversary you can simulate from breaking them. In the beginning, this will be humans with access to models, so anywhere the model lacks performance, the human should make up for it. For example, in [P2.6] Red-Blue Game, they state that the read team needs to produce a sleeper agent that defects. Assuming that a sleeper agent is a single model, I think this is too complicated for the red team. The red team should be able to produce a scaffolded model that uses logic written in Python. In the limit, the red team could provide a human that reads every task, sits and thinks for hours, and then responds with the best output. This isn’t done in practice since it is expensive and slow.
4. I don’t think one needs to generate hyper-realistic-looking scenarios for monitoring, and I’m confused by their reasoning for believing one does. I guess it depends on the fact that they need to elicit the best possible capabilities from the model during the eval, but I don’t think this is necessary for the reasons I stated above.
  
  General Thoughts:
  I think this is pretty good and shows knowledge of the core points of control. I think Antrhopic’s control protocols are easily beaten, and there are much better things to do. They made it seem like control is much harder to implement than I think it will be. I don’t think a perfect Elicitation Strategy or Simulation Procedure is needed for control to work in the settings laid out. Once we have much scarier AIs and deploy them in more crazy environments, we will likely need better techniques, but their setting seems to be before this.

Tyler Tracy 15 Oct 2024 16:45 UTC
10 points
6
on: The Hopium Wars: the AGI Entente Delusion
Thanks for your post! I think it hits at the heart of the issue, and I like the term “tool AI” as a phrase that points to what we actually want to build. I share your fear of racing with China to build ASI.
As if this were not bad enough, it is now rather obvious that the first AGI will not be a pure LLM, but a hybrid scaffolded system.
I think scaffolding systems are great. Scaffolding tools allow inspecting how the agent is using the tools. For example, a scaffold could have a monitor that watches all of the calls to the code compiler to ensure the code doesn’t look suspicious. The same monitoring system could read all of the text the model commits to long-term memory and ensure it isn’t planning something bad or stenographically hiding information.
While it might be harder to formally “align” these systems, I think it will be easier to make safety cases for them for why they won’t “escape the data center” or “hack the missiles”. It will be much harder for the AI to do anything evil when we watch its every move. A superintelligence would likely still escape the scaffold, but it seems unlikely that an AGI could.
I think the scaffold systems could be the future of tool AI if we use them carefully. These hybrid systems will unlock new crazy and awesome capabilities for humans in the future without needing to solve the hard “alignment problem.” This does not

Tyler Tracy 3 Jun 2024 18:14 UTC
1 point
0
in reply to: Dagon’s comment on: When does external behaviour implies interal structure
When we observe the programs’ outputs, we can form a hypothesis class about what structures we think the programs have. My claim is that only a couple of structures pop out after testing a utility-maximizing agent in a sufficient amount of environments.

You are correct in pointing out that an agent could be employing multiple structures at a time. Future work on this problem would include ways of quantifying how much of a certain structure some program has. This might look like trying to come up with a distance measure for programs that would say a version of quicksort written in OOP or a functional representation would have distance 0.

Tyler Tracy 3 Jun 2024 17:54 UTC
1 point
0
in reply to: Gordon Seidoh Worley’s comment on: When does external behaviour implies interal structure
I think this is a good point. I would push back a small amount on being unable to tell the difference between those two cases. There is more information you can extract from the system, like the amount of time that it takes after pressing the button for the current to turn on. But In general, I agree.

I agree that it would be very easy to have huge blind spots regarding this line of inquiry. This is the thing I worry about most. But I do have a hunch that given enough data about a system and its capabilities, we can make strong claims about its internal structure, and these structures will yield predictive power.

When you have little information like “pressing this button makes this wire turn on,” it is much harder to do this. However, I believe testing an unknown program in many different environments and having information like its run time and size can narrow the space of possibilities sufficiently to say something useful.

When does external behaviour imply interal structure?

Tyler Tracy31 May 2024 16:41 UTC

6 points

5 comments7 min readLW link

Tyler Tracy 15 May 2024 12:11 UTC
19 points
4
on: Ilya Sutskever and Jan Leike resign from OpenAI
It is hard to pinpoint motivation here. If you are a top researcher at a top lab working on alignment and you disagree with something within the company, I see two categories of options you can take to try to fix things
- Stay and try to use your position of power to do good. Better that someone who deeply cares about AI risk is in charge than someone who doesn’t
- Leave in protest to try to sway public opinion into thinking that your organization is unsafe and that we should not trust it
Jan and Ilya left but haven’t said much about how they lost confidence in OpenAI. I expect we will see them making more damning statements about OpenAI in the future
Or is there a possible motivation I’m missing here?

Tyler Tracy 15 May 2024 0:24 UTC
1 point
0
on: Towards a formalization of the agent structure problem
I’m intrigued by the distinction between the policy function and the program implementing it, as their structures seem different.

For example, a policy function might be said to have the structure of mapping all inputs to a single output, and the program that implements it is a Python program that uses a dictionary. Does this distinction matter?

When we talk about agent structure, I’d imagine that we care both about the structure of the actions that the agent takes and the structure of the computation the agent does to decide on the actions.

Tyler Tracy

Ctrl-Z: Con­trol­ling AI Agents via Resampling

When does ex­ter­nal be­havi­our im­ply in­teral struc­ture?

Ctrl-Z: Controlling AI Agents via Resampling

When does external behaviour imply interal structure?