Yeah, they both made up some stuff in response to the same question.
quetzal_rainbow
I’m so far not impressed with Claude 4s. They are trying to make up superficially plausible stuff for my math questions as fast as possible. Sonnet 3.7, at least, explored a lot of genuinely interesting venues before making an error. “Making up superficially plausible stuff” sounds like a good strategy for hacking not very robust verifiers.
I dunno how much it’s obvious for people who want to try for bounty, but I only now realized that you can express criteria for redund as inequality with mutual information and I find mutual information to be much nicer to work with, even if from pure convenience of notation. Proof:
Let’s take criterion for redund w.r.t. of ,
expand expression for KL divergence:
expand joint distribution:
simplify:
which is a conditional mutual information:
what results in:
how should you pick which reference class to use
You shouldn’t. This epistemic bath has no baby in it and we should throw water out of it.
It’s really sad that we still don’t have bookmarks for comments
No, the point is that AI x-risk is commonsensical. “If you drink much from a bottle marked poison it is certain to disagree with you sooner or later” even if you don’t know mechanism of action of poison. We don’t expect Newtonian mechanics to prove that hitting yourself with a brick is quite safe, if we’d found that Newtonian mechanics predicts hitting yourself with a brick to be safe, it would be a huge evidence for Newtonian mechanics to be wrong. Good theories usually support common intuitions.
The other thing here is an isolated demand for rigor: there is no “technical understanding of today’s deep learning systems” which predicts, say, success of AGI labs or that their final products are going to be safe.
“For a while” is usually, like, day for me. Sometimes even hours. I don’t think that whatever damage other addictions inflict on cognitive function is that much easy to reverse.
It doesn’t explain why I fully regain concentration ability after abstaining for a while?
In my personal experience, the main reason why social media causes cognitive decline is fatigue. Evidence from personal experience: like many social media addicts, I struggle with maintaining concentration on books. If I stop using social media for a while, I regain the full ability to concentrate without drawbacks—in a sense, “I suddenly become capable of reading 1,000 pages of a book in two days, which I had been trying to start for two months.”
The reason why social media is addictive to me, I think, is the following process:
Social media is entertaining;
It becomes cognitively taxing very quickly, such that I lose the stamina to do anything else;
I’m bored but incapable of doing anything else;
I’m stuck in social media.
“Social media causes fatigue” is also a useful frame to motivate myself not to use social media. “It’s going to have bad long-term consequences” doesn’t motivate me much; “you are going to be so, so tired afterwards” immediately triggers aversion.
almost all the other agents it expects to encounter are CDT agents
Given this particular setup (you both get source codes of each other and make decision simultaneously without any means to verify choices of counterparty until outcomes happen), you shouldn’t self-modify into extortionist, because CDT agents always defect, because no amount of reasoning about source code can causally affect your decision and D-D is Nash equilibrium. CDT agents can expect with high probability to meet extortionist in the future and self-modify into weird Son-of-CDT agent, which gives in to extortion, but for this setup to work in any non-trivial way you should be at least EDT-ish.
But yes, general principle here is “evaluate how much other player decision procedure is logically influenced by my decision procedure, calculate expected value, act accordingly”. The same is true for situation when you decide about self-modification.
For example, if you think that modifying into extortionist is a good policy, it can lead to situation where everyone is extortionist and everybody nukes each other.
Let’s suppose that you look at the code of your counterparty which says “I’ll Nuke you unless you Cooperate in which case I Defect”, call it “extortionist”. You have two hypotheses here:
Your counterparty deliberately modified its decision-making procedures in hope to extort more utility;
This decision-making procedure is a genuine result of some weird evolutionary/learning process.
If you can’t actually get any evidence in favor of each hypothesis, you go with your prior and do whatever is the best from the standpoint of UDT/FDT/LDT counterfactual operationalization. I.e., let’s suppose payoffs are:
Cooperate:Cooperate − 10:10
Cooperate:Defect −
20:215:2Defect:Defect − 5:5
Nuke - −100:-100.
You are playing against extortionist counterparty. Prior probability of extortionist from hypothesis 2 is x. Extortionists from hypothesis 1 can perfectly predict your responce in their decision to self-modify and cover tracks. If they decide to not self-modify, they choose to cooperate conditional on your cooperation. Let’s call policy “Nuke extortionist, cooperate with non-extortionist” and “Cooperate with both”
From here, your UDT-expected utility is:
Therefore, you should choose if
i.e.
And 7.27% is a really high frequency of “natural” extortionists, I won’t expect it to be this high.
I think the difference between real bureaucracies and HCH is that in real functioning bureaucracies should be elements capable to say “screw this arbitrary problem factorization, I’m doing what’s useful” and bosses of bureaucracy should be able to say “we all understand that otherwise system wouldn’t be able to work”.
There is a conceptual path for interpretability to lead to reliability: you can understand model in sufficient details to know how it produces intelligence and then make another model out of interpreted details. Obviously, it’s not something that we can expect to happen anytime soon, but it’s something that army of interpretability geniuses in datacenter could do.
I think current Russia-Ukraine war is a perfect place to implement such system. It’s an attrition war, there is not many goals which are not reduced to “kill and destroy as many as you can”. There is a strategic aspect: Russia pays exorbitant compensations to families of killed soldiers, decades of income for poor regions. So, when Russian soldier dies, two things can happen:
Russian government dumps unreasonable amount of money on the market, contributing to inflation;
Russian government fails to pay (for 1000 and 1 stupid bureaucratic reasons), which erodes trust of would-be soldiers and reduces Russia’s mobilization potential.
I can easily see how such system could fail terribly in, say, Afghanistan (if you paid for every killed terrorist, there is an easy loophole “kill civilian, say they are terrorist”). It’s fine for current stage of Ukraine war.
Also I don’t see how kill markets contribute to ability of military to coup. Payments are made in purely virtual points, soldiers can’t spend them on something else.
By “artificial employee” I mean “something than can fully replace human employee, including their agentic capabilities”. And, of course, it should be much more useful than generic AI chatbot, it should be useful like owning Walmart (1,200,000 employees) is useful.
“Set up a corporation with a million of artificial employees” is pretty legible, but human amount of agency is catastrophically insufficient for it.
The emphasis here is not on properties of model behavior but on how developers relate to model testing/understanding.
Recent update from OpenAI about 4o sycophancy surely looks like Standard Misalignment Scenario #325:
Our early assessment is that each of these changes, which had looked beneficial individually, may have played a part in tipping the scales on sycophancy when combined.
<...>
One of the key problems with this launch was that our offline evaluations—especially those testing behavior—generally looked good. Similarly, the A/B tests seemed to indicate that the small number of users who tried the model liked it.
<...>
some expert testers had indicated that the model behavior “felt” slightly off.
<...>
We also didn’t have specific deployment evaluations tracking sycophancy.
<...>
In the end, we decided to launch the model due to the positive signals from the users who tried out the model.
Not really, classical universe can be spatially infinite and contain infinite number of your copies.
I think you would appreciate this post