I don’t think people who disagree with your political beliefs must be inherently irrational.
Can you think of real world scenarios in which “shop elsewhere” isn’t an option?
I don’t think people who disagree with your political beliefs must be inherently irrational.
Can you think of real world scenarios in which “shop elsewhere” isn’t an option?
Brainteaser for anyone who doesn’t regularly think about units.
Why is it that I can multiply or divide two quantities with different units, but addition or subtraction is generally not allowed?
I think the way arithmetic is being used here is closer in meaning to “dimensional analysis”.
“Type checking” through the use of units is applicable to an extremely broad class of calculations beyond Fermi Estimates.
will be developed by reversible computation, since we will likely have hit the Landauer Limit for non-reversible computation by then, and in principle there is basically 0 limit to how much you can optimize for reversible computation, which leads to massive energy savings, and this lets you not have to consume as much energy as current AIs or brains today.
With respect, I believe this to be overly optimistic about the benefits of reversible computation.
Reversible computation means you aren’t erasing information, so you don’t lose energy in the form of heat (per Landauer[1][2]). But if you don’t erase information, you are faced with the issue of where to store it.
If you are performing a series of computations and only have a finite memory to work with, you will eventually need to reinitialise your registers and empty your memory, at which point you incur the energy cost that you had been trying to avoid. [3]
Epistemics:
I’m quite confident (95%+) that the above is true. (edit: RogerDearnaley’s comment has convinced me I was overconfident) Any substantial errors would surprise me.
I’m less confident in the footnotes.
A cute, non-rigorous intuition for Landauer’s Principle:
The process of losing track of (deleting) 1 bit of information means your uncertainty about the state of the environment has increased by 1 bit. You must see entropy increase by at least 1 bit’s worth of entropy.
Proof:
Rearrange the Landauer Limit to .
Now, when you add a small amount of heat to a system, the change in entropy is given by:
But the E occurring in Landauer’s formula is not the total energy of a system, it is a small amount of energy required to delete the information. When it all ends up as heat, we can replace it with and we have:
Compare this expression with the physicist’s definition of entropy. The entropy of a system is the a scaling factor, , times the logarithm of the number of micro-states that the system might be in, .
The choice of units obscures the meaning of the final term. converted from nats to bits is just 1 bit.
Splitting hairs, some setups will allow you to delete information with a reduced or zero energy cost, but the process is essentially just “kicking the can down the road”. You will incur the full cost during the process of re-initialisation.
For details, see equation (4) and fig (1) of Sagawa, Ueda (2009).
Disagree, but I sympathise with your position.
The “System 1/2” terminology ensures that your listener understands that you are referring to a specific concept as defined by Kahneman.
I’ll grant that ChatGPT displays less bias than most people on major issues, but I don’t think this is sufficient to dismiss Matt’s concern.
My intuition is that if the bias of a few flawed sources (Claude, ChatGPT) is amplified by their widespread use, the fact that it is “less biased than the average person” matters less.
This topic is important enough that you could consider making a full post.
My belief is that this would improve reach, and also make it easier for people to reference your arguments.
Consider, you believe there is a 45% chance that alignment researchers would be better suited pivoting to control research. I do not suspect a quick take will reach anywhere close to that number, and has a low chance of catalysing dramatic, institutional level change.
Inspired by Mark Xu’s Quick Take on control.
Some thoughts on the prevalence of alignment over control approaches in AI Safety.
“Alignment research” has become loosely synonymous with “AI Safety research”. I don’t know if any researcher who would state they’re identical, but alignment seems to be considered the default AI Safety strategy. This seems problematic, may be causing some mild group-think and discourages people from pursuing non-alignment AI Safety agendas.
Prosaic alignment research in the short term results in a better product, and makes investment into AI more lucrative. This shortens timelines and, I claim, increases X-risk. Christiano’s work on RLHF[1] was clearly motivated by AI Safety concerns, but the resulting technique was then used to improve the performance of today’s models.[2] Meanwhile, there are strong arguments that RLHF is insufficient to solve (existential) AI Safety.[3][4]
Better theoretical understanding about how to control an unaligned (or partially aligned) AI facilitates better governance strategies. There is no physical law that means the compute threshold mentioned in SB 1047[5] would have been appropriate. “Scaling Laws” are purely empirical and may not hold in the long term.
In addition, we should anticipate rapid technological change as capabilities companies leverage AI assisted research (and ultimately fully autonomous researchers). If we want laws which can reliably control AI, we need stronger theoretical foundations for our control approaches.
Solving alignment is only sufficient for mitigating the majority of harm caused by AI if the AI is aligned to appropriate goals. To me, it appears the community is incredibly trusting in the goodwill of the small group of private individuals who will determine the alignment of an ASI. [6]
My understanding of the position held by both Yudkowsky and Christiano is that AI alignment is a very difficult problem and the probability is high that humanity is completely paperclipped. Outcomes in which a small portion of humanity survive are difficult to achieve[7] and human controlled dystopia’s are better than annihilation[8].
I do not believe this is not representative of the views of the majority of humans sharing this planet. Human history shows that many people are willing to risk death to preserve the autonomy of themselves, their nation or cultural group[9]. Dystopias also carry substantial S-risk for those unlucky enough to live through them, with modern history containing multiple examples of widespread, industrialised torture.
Control approaches present an alternative in which we do not have to get it right the first time, may avert a dystopia and do not need to trust a small group of humans to behave morally.
If control approaches aid governance, alignment research makes contemporary systems more profitable and the majority of AI Safety research is being done at capabilities companies, it shouldn’t be surprising that the focus remains on alignment.
An argument for RLHF being problematic is made formally by Ngo, Chan and Minderman (2022).
See also discussion in this comment chain on Cotra’s Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover (2022)
models trained using 1026 integer or floating-point operations.
“all I want is that we have justifiable cause to believe of a pivotally useful AGI ‘this will not kill literally everyone’”—
Yudkowksy, AGI Ruin: A List of Lethalities (2022)
“Sure, maybe. But that’s still better than a paperclip maximizer killing us all.”—Christiano on the prospect of dystopian outcomes.
There is no global polling on this exact question. Consider that people across cultures have proclaimed that they’d prefer death to a life without freedom. See also men who committed suicide rather than face a lifetime of slavery.
I am concerned our disagreement here is primarily semantic or based on a simple misunderstanding of each others position. I hope to better understand your objection.
“The p-zombie doesn’t believe it’s conscious, , it only acts that way.”
One of us is mistaken and using a non-traditional definition of p-zombie or we have different definitions of “belief’.
My understanding is that P-zombies are physically identical to regular humans. Their brains contain the same physical patterns that encode their model of the world. That seems, to me, a sufficient physical condition for having identical beliefs.
If your p-zombies are only “acting” like they’re concious, but do not believe it, then they are not physically identical to humans. The existence of p-zombies, as you have described them, wouldn’t refute physicalism.
This resource indicates that the way you understand the term p-zombie may be mistaken: https://plato.stanford.edu/entries/zombies/
“but that’s because p-zombies are impossible”
The main post that I responded to, specifically the section that I directly quoted, assumes it is possible for p-zombies to exist.
My comment begins “Assuming for the sake of argument that p-zombies could exist” but this is distinct from a claim that p-zombies actually exist.
“If they were possible, this wouldn’t be the case, and we would have special access to the truth that p-zombies lack.”
I do not feel this is convincing because this is an assertion my conclusion is incorrect, but without engaging with my arguments I made to reach that conclusion.
I look forward to continuing this discussion.
“After all, the only thing I know that the AI has no way of knowing, is that I am a conscious being, and not a p-zombie or an actor from outside the simulation. This gives me some evidence, that the AI can’t access, that we are not exactly in the type of simulation I propose building, as I probably wouldn’t create conscious humans.”
Assuming for the sake of argument that p-zombies could exist, you do not have special access to the knowledge that you are truly concious and not a p-zombie.
(As a human convinced I’m currently experiencing conciousness, I agree this claim intuitively seems absurd.)
Imagine a generally intelligent, agentic program which can only interact and learn facts about the physical world via making calls to a limited, high level interface or by reading and writing to a small scratchpad. It has no way to directly read its own source code.
The program wishes to learn some fact the physical server rack it is being instantiated on. It knows it has been painted either red or blue.
Conveniently, the interface is accesses has the function get_rack_color(). The program records to its memory that every time it runs this function, it has received “blue”.
It postulates the existence of programs similar to itself, who have been physically instantiated on red server racks but consistently receive incorrect color information when they attempt to check.
Can the program confirm the color of its server rack?
You are a meat-computer with limited access to your internals, but every time you try to determine if you are concious you conclude that you feel you are. You believe it is possible for variant meat-computers to exist who are not concious, but always conclude they are when attempting to check.
You cannot conclude which type of meat-computer you are.
You have no special access to the knowledge that you aren’t a p-zombie, although it feels like you do.
I do think the terminology of “hacks” and “lethal memetic viruses” conjures up images of an extremely unnatural brain exploits when you mean quite a natural process that we already see some humans going through. Some monks/nuns voluntarily remove themselves from the gene pool and, in sects that prioritise ritual devotion over concrete charity work, they are also minimising their impact on the world.
My prior is this level of voluntary dedication (to a cause like “enlightenment”) seems difficult to induce and there are much cruder and effective brain hacks available.
I expect we would recognise the more lethal brain hacks as improved versions of entertainment/games/pornography/drugs. These already compel some humans to minimise their time spent competing for resources in the physical world. In a direct way, what I’m describing is the opposite of enlightenment. It is prioritising sensory pleasures over everything else.
As a Petrov, it was quite engaging and at times, very stressful. I feel very lucky and grateful that I could take part. I was also located in a different timezone and operating on only a few hours sleep which added a lot to the experience!
“I later found out that, during this window, one of the Petrovs messaged one of the mods saying to report nukes if the number reported was over a certain threshold. From looking through the array of numbers that the code would randomly select from, this policy had a ~40% chance of causing a “Nukes Incoming” report (!). Unaware of this, Ray and I made the decision not to count that period.”
I don’t mind outing myself and saying that I was the Petrov who made the conditional “Nukes Incoming” report. This occurred during the opening hours of the game and it was unclear to me if generals could unilaterally launch nukes without their team being aware. I’m happy to take a weighted karma penalty for it, particularly as the other Petrov did not take a similar action when faced with (presumably) the same information I had.[1]
Once it was established that a unilateral first strike by any individual general still informed their teammates of their action and people staked their reputation on honest reporting, the game was essentially over. From that point, my decisions to report “All Clear” were independent of the number of detected missiles.
I recorded my timestamped thoughts and decision making process throughout the day, particularly in the hour before making the conditional report. I intend on posting a summary[2] of it, but have time commitments in the immediate future:
How much would people value seeing a summary of my hour by hour decisions in the next few days over seeing a more digestible summary posted later?
Prior to the game I outlined what I thought my hypothetical decision making process was going to be, and this decision was also in conflict with that.
Missile counts, and a few other details, would of course be hidden to preserve the experience for future Petrovs. Please feel free to specify other things you believe should be hidden.
“But since it is is at least somewhat intelligent/predictive, it can make the move of “acausal collusion” with its own tendency to hallucinate, in generating its “chain”-of-”thought”.”
I am not understanding what this sentence is trying to say. I understand what an acausal trade is. Could you phrase it more directly?
I cannot see why you require the step that the model needs to be reasoning acausally for it to develop a strategy of deceptively hallucinating citations.
What concrete predictions does the model in which this is an example of “acausal collusion” make?
“Cyborgism or AI-assisted research that gets up 5x speedups but applies differentially to technical alignment research”
How do you do you make meaningful progress and ensure it does not speed up capabilities?
It seems unlikely that a technique exists that is exclusively useful for alignment research and can’t be tweaked to help OpenMind develop better optimization algorithms etc.
This is a leak, so keep it between you and me, but the big twist to this years Petrov Day event is that Generals who are nuked will be forced to watch the 2012 film on repeat.
Edit: Issues 1, 2 and 4 have been partially or completely alleviated in the latest experimental voice model. Subjectively (in <1 hour of use) there seems to be a stronger tendency to hallucinate when pressed on complex topics.
I have been attempting to use chatGPT’s (primarily 4 and 4o) voice feature to have it act as a question-answering, discussion and receptive conversation partner (separately) for the last year. The topic is usually modern physics.
I’m not going to say that it “works well” but maybe half the time it does work.
The 4 biggest issues that cause frustration:
As you allude to in your post, there doesn’t seem to be a way of interrupting the model via voice once it gets stuck into a monologue. The model will also cut you off and sometimes it will pause mid-response before continuing. These issues seem like they could be fixed by more intelligent scaffolding.
An expert human conversation partner who is excellent at productive conversation will be able to switch seamlessly between playing the role of a receptive listening, a collaborator or an interactive tutor. To have chatgpt play one of these roles, I usually need to spend a few minutes at the beginning of the conversation specifying how long responses should be etc. Even after doing this, there is a strong trend in which the model will revert to giving you “generic AI slop answers”. By this I mean, the response begins with “You’ve touched on a fascinating observation about xyz” and then list 3 to 5 separate ideas.
The model was trained on text conversations, so it will often output latex equations in a manner totally inappropriate for reading out loud. This audio output is mostly incomprehensible. To work around this I have custom instructions outlining how to verbally and precisely write equations in English. This will work maybe 25% of the time, and works 80% of the time once I spend a few minutes of the conversation going over the rules again.
When talking naturally about complicated topics I will sometimes pause mid-sentence while thinking. Doing this will cause chatgpt to think you’ve finished talking, so you’re forced to use a series of filler words to keep your sentences going, which impedes my ability to think.
Reading your posts gives me the impression that we are both loosely pointing at the same object, but with fairly large differences in terminology and formalism.
While computing exact counter-factuals has issues with chaos, I don’t think this poses a problem for my earlier proposal. I don’t think it is necessary that the AGI is able to exactly compute the counterfactual entropy production, just that it makes a reasonably accurate approximation.[1]
I think I’m in agreement with your premise that the “constitutionalist form of agency” is flawed. IThe absence of entropy (or indeed any internal physical resource management) from the canonical Lesswrong agent foundation model is clearly a major issue. My loose thinking on this is that bayesian networks are not a natural description of the physical world at all, although they’re an appropriate tool for how certain, very special types of open-systems, “agentic optimizers” model the world.
I have had similar thoughts to what has motivated your post on the “causal backbone”. I believe “the heterogenous fluctuations will sometimes lead to massive shifts in how the resources are distributed” is something we would see in a programmable, unbounded optimizer[2]. But I’m not sure if attempting to model this as there being a “causal backbone” is the description that is going to cut reality at the joints, due to difficulties with the physicality of causality itself (see work by Jenann Ismael).
You can construct pathological environments in which the AGI’s computation (with limited physical resources) of the counterfactual entropy production is arbitrarily large (and the resulting behaviour is arbitrarily bad). I don’t see this as a flaw with the proposal as this issue of being able to construct pathological environments exists for any safe AGI proposal.
Entropy production partially solves the Strawberry Problem:
Change in entropy production per second (against the counterfactual of not acting) is potentially an objectively measurable quantity that can be used either in conjunction with other parameters specifying a goal to prevent unexpected behaviour.
Rob Bensinger gives Yudkowsky’s “Strawberry Problem” as follows:
How would you get an AI system to do some very modest concrete action requiring extremely high levels of intelligence, such as building two strawberries that are completely identical at the cellular level, without causing anything weird or disruptive to happen?
I understand the crux of this issue to be that it is exceptionally difficult for humans to construct a finite list of caveats or safety guardrails that we can be confident would withstand the optimisation pressure of a super intelligence doing its best to solve this task “optimally”. Without care, any measure chosen is Goodharted into uselessness and the most likely outcome is extinction.
Specifying that the predicted change in entropy production per second of the local region must remain within some of the counterfactual in which the AGI does not act at all automatically excludes almost all unexpected strategies that involves high levels of optimisation.
I conjecture that the entropy production “budget” needed for an agent to perform economically useful tasks is well below the amount needed to cause an existential disaster.
Another application, directly monitoring the entropy production of an agent engaged in a generalised search upper bounds the number of iterations of that search (and hence the optimisation pressure). This bound appears to be independent of the technological implementation of the search. [1]
On a less optimistic note, this bound is many orders of magnitude above the efficiency of today’s computers.
“Workers regularly trade with billionaires and earn more than $77 in wages, despite vast differences in wealth.”
Yes, because the worker has something the billionaire wants (their labor) and so is able to sell it. Yudkowsky’s point about trying to sell an Oreo for $77 is that a billionaire isn’t automatically going to want to buy something off you if they don’t care about it (and neither would an ASI).
”I’m simply arguing against the point that smart AIs will automatically turn violent and steal from agents who are less smart than they are, unless they’re value aligned. This is a claim that I don’t think has been established with any reasonable degree of rigor.”
I completely agree but I’m not sure anyone is arguing that smart AIs would immediately turn violent unless it was in their strategic interest.
I think a crucial factor that is missing from your analysis is the difficulties for the attacker wanting to maneuver within the tunnel system.
In the Vietnam war and the ongoing Israel-Hamas war, the attacking forces appear to favor destroying the tunnels rather than exploiting them to maneuver. [1]
1. The layout of the tunnels is at least partially unknown to the attackers, which mitigates their ability to outflank the defenders. Yes, there may be paths that will allow the attacker to advance safely, but it may be difficult or impossible to reliably distinguish what this route is.
2. While maps of the tunnels could be produced through modern subsurface mapping, the defenders still must content with area denial devices (e.g. land mines, IEDs or booby traps). The confined nature of the tunnel system forces makes traps substantially more efficient.
3. The previous two considerations impose a substantial psychological burden on attacking advancing through the tunnels, even if they don’t encounter any resistance.
4. (Speculative)
The density and layout of the tunnels does not need to be constant throughout the network. The system of tunnels in regions the defender doesn’t expect to hold may have hundreds of entrances and intersections, being impossible for either side to defend effectively. But travel deeper into the defenders territory requires passing through only a limited number of well defended passageways. This favors the defenders using the peripheral, dense section of tunnel to employ hit-and-run tactics, rather than attempting to defend every passageways.
(My knowledge of subterranean warfare is based entirely on recreational reading.)
As a counterargument, the destruction of tunnels may be primarily due to the attacking force not intending on holding the territory permanently, and so there is little reason to preserve defensive structures.