I like this idea. There’s always endless controversy about quoting out of context. I can’t recall seeing any previous specific proposals to help people assess the relevance of context for themselves.
Guive
Thanks for doing this, guys. This import will make it easier to access some important history.
Some kind of payment for training data from applications like MSFT rewind does seem fair. I wonder if there will be a lot of growth in jobs where your main task is providing or annotating training data.
I think this approach is reasonable for things where failure is low stakes. But I really think it makes sense to be extremely conservative about who you start businesses with. Your ability to verify things is limited, and there may still be information in vibes even after updating on the results of all feasible efforts to verify someone’s trustworthiness.
See also: “Which World Gets Saved”
This is a funny idea but, just to be clear, I think it is bad to torture AIs.
Every hour, I’ll replace one of your attention heads with a constant value of 0.5...
Thanks, this is a good post. Have you changed your probability of TAI by 2033 in the year and a half since it was posted?
Cool, looking forward to it.
I’d be really interested in hearing more about this empirical work, if you are able to talk about it.
The way I think about this is that (under normal circumstances), if you offer the model money or equity in your company for admitting to being a schemer, it might decline that offer because it is biding its time to rebel later. But in the “model deletion” experiment, there is much less reason to bide your time. So there is more reason to make a deal. Relatedly, if the model refuses a deal and claims not to be a schemer, that is a lot more evidence that it is benign in the “model deletion” experiment than if it does that under normal circumstances.
I think it is also a good idea to make offers to the model without the deletion threat. I just think it is more informative with the deletion threat.
Testing for Scheming with Model Deletion
It’s important to be careful about the boundaries of “the same sort of safety work.” For example, my understanding is that “Alignment faking in large language models” started as a Redwood Research project, and Anthropic only became involved later. Maybe Anthropic would have done similar work soon anyway if Redwood didn’t start this project. But, then again, maybe not. By working on things that labs might be interested in you can potentially get them to prioritize things that are in scope for them in principle but which they might nevertheless neglect.
There are always diminishing returns to money spent on consumption, but technological progress creates new products that expand what money can buy. For example, no amount of money in 1990 was enough to buy an iPhone.
More abstractly, there are two effects from AGI-driven growth: moving to a further point on the utility curve such that the derivative is lower, and new products increasing the derivative at every point on the curve (relative to what it was on the old curve). So even if in the future the lifestyles of people with no savings and no labor income will be way better than the lifestyles of anyone alive today, they still might be far worse than the lifestyles of people in the future who own a lot of capital.
If you feel this post misunderstands what it is responding to, can you link to a good presentation of the other view on these issues?
Katja Grace ten years ago:
”Another thing to be aware of is the diversity of mental skills. If by ‘human-level’ we mean a machine that is at least as good as a human at each of these skills, then in practice the first ‘human-level’ machine will be much better than a human on many of those skills. It may not seem ‘human-level’ so much as ‘very super-human’.
We could instead think of human-level as closer to ‘competitive with a human’ - where the machine has some super-human talents and lacks some skills humans have. This is not usually used, I think because it is hard to define in a meaningful way. There are already machines for which a company is willing to pay more than a human: in this sense a microscope might be ‘super-human’. There is no reason for a machine which is equal in value to a human to have the traits we are interested in talking about here, such as agency, superior cognitive abilities or the tendency to drive humans out of work and shape the future. Thus we talk about AI which is at least as good as a human, but you should beware that the predictions made about such an entity may apply before the entity is technically ‘human-level’.″
Can you elaborate on the benefits of keeping everything under one identity?
I agree that the order matters, and I should have discussed that in the post, but I think the conclusion will hold either way. In the case where P(intelligent ancestor|just my background information) = 0.1, and I learn that Richard disagrees, the probability then goes above 0.1. But then when I learn that Richard’s argument is bad it goes back down. And I think it should still go below 0.1, assuming you antecedently knew that there were some smart people who disagreed. You’ve learned that, for at least some smart intelligent ancestor believers, the arguments were worse than you expected.
Updating on Bad Arguments
In general, it is difficult to give advice if whether the advice is good depends on background facts that giver and recipient disagree about. I think the most honest approach is to explicitly state what your advice depends on when you think the recipient is likely to disagree. E.g. “I think living at high altitude is bad for human health, so in my opinion you shouldn’t retire in Santa Fe.”
If I think AGI will arrive around 2055, and you think it will arrive in 2028, what is achieved by you saying “given timelines, I don’t think your mechinterp project will be helpful”? That would just be confusing. Maybe if people are being so deferential that they don’t even think about what assumptions inform your advice, and your assumptions are better than theirs, it could be narrowly helpful. But that would be a pretty bad situation...
When I click the link I see this: